PEP 3131: Supporting Non-ASCII Identifiers

Collapse
This topic is closed.
X
X
 
  • Time
  • Show
Clear All
new posts
  • Anders J. Munch

    #31
    Re: PEP 3131: Supporting Non-ASCII Identifiers

    Michael Torrie wrote:
    >
    So given that people can already transliterate their language for use as
    identifiers, I think avoiding non-ASCII character sets is a good idea.
    Transliteration makes people choose bad variable names, I see it all the time
    with Danish programmers. Say e.g. the most descriptive name for a process is
    "kør forlæns" (run forward). But "koer_forla ens" is ugly, so instead he'll
    write "run_fremad ", combining an English word with a slightly less appropriate
    Danish word. Sprinkle in some English spelling errors and badly-chosen English
    words, and you have the sorry state of the art that is today.

    - Anders

    Comment

    • Steven D'Aprano

      #32
      Re: PEP 3131: Supporting Non-ASCII Identifiers

      On Sun, 13 May 2007 15:35:15 -0700, Alex Martelli wrote:
      Homoglyphic characters _introduced by accident_ should not be discounted
      as a risk
      ....
      But when something similar
      happens to somebody using a sufficiently fancy text editor to input
      source in a programming language allowing arbitrary Unicode letters in
      identifiers, the damage (the sheer waste of developer time) can be much
      more substantial -- there will be two separate identifiers around, both
      looking exactly like each other but actually distinct, and unbounded
      amount of programmer time can be spent chasing after this extremely
      elusive and tricky bug -- why doesn't a rebinding appear to "take", etc.
      With some copy-and-paste during development and attempts at debugging,
      several copies of each distinct version of the identifier can be spread
      around the code, further hampering attempts at understanding.

      How is that different from misreading "disk_burnt = True" as "disk_bumt =
      True"? In the right (or perhaps wrong) font, like the ever-popular Arial,
      the two can be visually indistinguishab le. Or "call" versus "cal1"?

      Surely the correct solution is something like pylint or pychecker? Or
      banning the use of lower-case L and digit 1 in identifiers. I'm good with
      both.


      --
      Steven.

      Comment

      • Anders J. Munch

        #33
        Re: PEP 3131: Supporting Non-ASCII Identifiers

        Alex Martelli wrote:
        >
        Homoglyphic characters _introduced by accident_ should not be discounted
        as a risk, as, it seems to me, was done early in this thread after the
        issue had been mentioned. In the past, it has happened to me to
        erroneously introduce such homoglyphs in a document I was preparing with
        a word processor, by a slight error in the use of the system- provided
        way for inserting characters not present on the keyboard; I found out
        when later I went looking for the name I _thought_ I had input (but I
        was looking for it spelled with the "right" glyph, not the one I had
        actually used which looked just the same) and just could not find it.
        There's any number of things to be done about that.
        1. # -*- encoding: ascii -*-
        (I'd like to see you sneak those homoglyphic characters past *that*.)
        2. pychecker and pylint - I'm sure you realise what they could do for you.
        3. Use a font that doesn't have those characters or deliberately makes them
        distinct (that could help web browsing safety too).

        I'm not discounting the problem, I just dont believe it's a big one. Can we
        chose a codepoint subset that doesn't have these dupes?

        - Anders

        Comment

        • Paul Rubin

          #34
          Re: PEP 3131: Supporting Non-ASCII Identifiers

          Alexander Schmolck <a.schmolck@gma il.comwrites:
          Plenty of programming languages already support unicode identifiers,
          Could you name a few? Thanks.

          Comment

          • Steven D'Aprano

            #35
            Re: PEP 3131: Supporting Non-ASCII Identifiers

            On Sun, 13 May 2007 10:52:12 -0700, Paul Rubin wrote:
            "Martin v. Löwis" <martin@v.loewi s.dewrites:
            >This is a commonly-raised objection, but I don't understand why people
            >see it as a problem. The phishing issue surely won't apply, as you
            >normally don't "click" on identifiers, but rather type them. In a
            >phishing case, it is normally difficult to type the fake character
            >(because the phishing relies on you mistaking the character for another
            >one, so you would type the wrong identifier).
            >
            It certainly does apply, if you're maintaining a program and someone
            submits a patch. In that case you neither click nor type the
            character. You'd normally just make sure the patched program passes
            the existing test suite, and examine the patch on the screen to make
            sure it looks reasonable. The phishing possibilities are obvious.
            Not to me, I'm afraid. Can you explain how it works? A phisher might be
            able to fool a casual reader, but how does he fool the compiler into
            executing the wrong code?

            As for project maintainers, surely a patch using some unexpected Unicode
            locale would fail the "looks reasonable" test? That could even be
            automated -- if the patch uses an unexpected "#-*- coding: blah" line, or
            includes characters outside of a pre-defined range, ring alarm bells.
            ("Why is somebody patching my Turkish module in Korean?")



            --
            Steven

            Comment

            • Marc 'BlackJack' Rintsch

              #36
              Re: PEP 3131: Supporting Non-ASCII Identifiers

              In <mailman.7627.1 179086416.32031 .python-list@python.org >, Michael Torrie
              wrote:
              I think non-ASCII characters makes the problem far far worse. While I
              may not understand what the function is by it's name in your example,
              allowing non-ASCII characters makes it works by forcing all would-be
              code readers have to have all kinds of necessary fonts just to view the
              source code. Things like reporting exceptions too. At least in your
              example I know the exception occurred in zieheDreiAbVon. But if that
              identifier is some UTF-8 string, how do I go about finding it in my text
              editor, or even reporting the message to the developers? I don't happen
              to have that particular keymap installed in my linux system, so I can't
              even type the letters!
              You find it in the sources by the line number from the traceback and the
              letters can be copy'n'pasted if you don't know how to input them with your
              keymap or keyboard layout.

              Ciao,
              Marc 'BlackJack' Rintsch

              Comment

              • Aldo Cortesi

                #37
                Re: PEP 3131: Supporting Non-ASCII Identifiers

                Thus spake "Martin v. Löwis" (martin@v.loewi s.de):
                - should non-ASCII identifiers be supported? why?
                No! I believe that:

                - The security implications have not been sufficiently explored. I don't
                want to be in a situation where I need to mechanically "clean" code (say,
                from a submitted patch) with a tool because I can't reliably verify it by
                eye. We should learn from the plethora of Unicode-related security
                problems that have cropped up in the last few years.
                - Non-ASCII identifiers would be a barrier to code exchange. If I know
                Python I should be able to easily read any piece of code written in it,
                regardless of the linguistic origin of the author. If PEP 3131 is
                accepted, this will no longer be the case. A Python project that uses
                Urdu identifiers throughout is just as useless to me, from a
                code-exchange point of view, as one written in Perl.
                - Unicode is harder to work with than ASCII in ways that are more important
                in code than in human-language text. Humans eyes don't care if two
                visually indistinguishab le characters are used interchangeably .
                Interpreters do. There is no doubt that people will accidentally
                introduce mistakes into their code because of this.

                - would you use them if it was possible to do so? in what cases?
                No.




                Regards,



                Aldo



                --
                Aldo Cortesi
                aldo@nullcube.c om

                Mob: 0419 492 863

                Comment

                • Paul Rubin

                  #38
                  Re: PEP 3131: Supporting Non-ASCII Identifiers

                  Steven D'Aprano <steve@REMOVE.T HIS.cybersource .com.auwrites:
                  It certainly does apply, if you're maintaining a program and someone
                  submits a patch. In that case you neither click nor type the
                  character. You'd normally just make sure the patched program passes
                  the existing test suite, and examine the patch on the screen to make
                  sure it looks reasonable. The phishing possibilities are obvious.
                  >
                  Not to me, I'm afraid. Can you explain how it works? A phisher might be
                  able to fool a casual reader, but how does he fool the compiler into
                  executing the wrong code?
                  The compiler wouldn't execute the wrong code; it would execute the code
                  that the phisher intended it to execute. That might be different from
                  what it looked like to the reviewer.

                  Comment

                  • Terry Reedy

                    #39
                    Re: PEP 3131: Supporting Non-ASCII Identifiers


                    "Alan Franzoni" <alan.franzoni_ invalid@geemail .invalidwrote in message
                    news:1u9kz7l2gc z1p.1e0kxqeikfp 97.dlg@40tude.n et...
                    Il Sun, 13 May 2007 17:44:39 +0200, "Martin v. Löwis" ha scritto:
                    |Also, there should be a way to convert source files in any 'exotic'
                    encoding to a pseudo-intellegibile encoding for any reader, a kind of
                    translittering (is that a proper english word) system out-of-the-box, not
                    requiring any other tool that's not included in the Python distro. This
                    will let people to retain their usual working environments even though
                    they're dealing with source code with identifiers in a really different
                    charset.
                    =============== ==============

                    When I proposed that PEP3131 include transliteration support, Martin
                    rejected the idea.

                    tjr



                    Comment

                    • Neil Hodgson

                      #40
                      Re: PEP 3131: Supporting Non-ASCII Identifiers

                      Paul Rubin wrote:
                      >Plenty of programming languages already support unicode identifiers,
                      >
                      Could you name a few? Thanks.
                      C#, Java, Ecmascript, Visual Basic.

                      Neil

                      Comment

                      • Steven D'Aprano

                        #41
                        Re: PEP 3131: Supporting Non-ASCII Identifiers

                        On Mon, 14 May 2007 09:42:13 +1000, Aldo Cortesi wrote:
                        I don't
                        want to be in a situation where I need to mechanically "clean"
                        code (say, from a submitted patch) with a tool because I can't
                        reliably verify it by eye.
                        But you can't reliably verify by eye. That's orders of magnitude more
                        difficult than debugging by eye, and we all know that you can't reliably
                        debug anything but the most trivial programs by eye.

                        If you're relying on cursory visual inspection to recognize harmful code,
                        you're already vulnerable to trojans.


                        We should learn from the plethora of
                        Unicode-related security problems that have cropped up in the last
                        few years.
                        Of course we should. And one of the things we should learn is when and
                        how Unicode is a risk, and not imagine that Unicode is some sort of
                        mystical contamination that creates security problems just by being used.


                        - Non-ASCII identifiers would be a barrier to code exchange. If I
                        know
                        Python I should be able to easily read any piece of code written
                        in it, regardless of the linguistic origin of the author. If PEP
                        3131 is accepted, this will no longer be the case.
                        But it isn't the case now, so that's no different. Code exchange
                        regardless of human language is a nice principle, but it doesn't work in
                        practice. How do you use "any piece of code ... regardless of the
                        linguistic origin of the author" when you don't know what the functions
                        and classes and arguments _mean_?

                        Here's a tiny doc string from one of the functions in the standard
                        library, translated (more or less) to Portuguese. If you can't read
                        Portuguese at least well enough to get by, how could you possibly use
                        this function? What would you use it for? What does it do? What arguments
                        does it take?

                        def dirsorteinserca o(a, x, baixo=0, elevado=None):
                        """da o artigo x insercao na lista a, e mantem-na a
                        supondo classificado e classificado. Se x estiver ja em a,
                        introduza-o a direita do x direita mais. Os args opcionais
                        baixos (defeito 0) e elevados (len(a) do defeito) limitam
                        a fatia de a a ser procurarado.
                        """
                        # not a non-ASCII character in sight (unless I missed one...)

                        [Apologies to Portuguese speakers for the dogs-breakfast I'm sure Babel-
                        fish and I made of the translation.]

                        The particular function I chose is probably small enough and obvious
                        enough that you could work out what it does just by following the
                        algorithm. You might even be able to guess what it is, because Portuguese
                        is similar enough to other Latin languages that most people can guess
                        what some of the words might mean (elevados could be height, maybe?). Now
                        multiply this difficulty by a thousand for a non-trivial module with
                        multiple classes and dozens of methods and functions. And you might not
                        even know what language it is in.

                        No, code exchange regardless of natural language is a nice principle, but
                        it doesn't exist except in very special circumstances.


                        A Python
                        project that uses Urdu identifiers throughout is just as useless
                        to me, from a code-exchange point of view, as one written in Perl.
                        That's because you can't read it, not because it uses Unicode. It could
                        be written entirely in ASCII, and still be unreadable and impossible to
                        understand.


                        - Unicode is harder to work with than ASCII in ways that are more
                        important
                        in code than in human-language text. Humans eyes don't care if two
                        visually indistinguishab le characters are used interchangeably .
                        Interpreters do. There is no doubt that people will accidentally
                        introduce mistakes into their code because of this.
                        That's no different from typos in ASCII. There's no doubt that we'll give
                        the same answer we've always given for this problem: unit tests, pylint
                        and pychecker.



                        --
                        Steven.

                        Comment

                        • Steven D'Aprano

                          #42
                          Re: PEP 3131: Supporting Non-ASCII Identifiers

                          On Sun, 13 May 2007 17:59:23 -0700, Paul Rubin wrote:
                          Steven D'Aprano <steve@REMOVE.T HIS.cybersource .com.auwrites:
                          It certainly does apply, if you're maintaining a program and someone
                          submits a patch. In that case you neither click nor type the
                          character. You'd normally just make sure the patched program passes
                          the existing test suite, and examine the patch on the screen to make
                          sure it looks reasonable. The phishing possibilities are obvious.
                          >>
                          >Not to me, I'm afraid. Can you explain how it works? A phisher might be
                          >able to fool a casual reader, but how does he fool the compiler into
                          >executing the wrong code?
                          >
                          The compiler wouldn't execute the wrong code; it would execute the code
                          that the phisher intended it to execute. That might be different from
                          what it looked like to the reviewer.
                          How? Just repeating in more words your original claim doesn't explain a
                          thing.

                          It seems to me that your argument is, only slightly exaggerated, akin to
                          the following:

                          "Unicode identifiers are bad because phishers will no longer need to
                          write call_evil_func( ) but can write call_ƎvĬľ_fu nc() instead."

                          Maybe I'm naive, but I don't see how giving phishers the ability to
                          insert a call to Æ’unction() in some module is any more dangerous than
                          them inserting a call to function() instead.

                          If I'm mistaken, please explain why I'm mistaken, not just repeat your
                          claim in different words.


                          --
                          Steven.

                          Comment

                          • Paul Rubin

                            #43
                            Re: PEP 3131: Supporting Non-ASCII Identifiers

                            Neil Hodgson <nyamatongwe+th under@gmail.com writes:
                            Plenty of programming languages already support unicode identifiers,
                            Could you name a few? Thanks.
                            C#, Java, Ecmascript, Visual Basic.
                            Java (and C#?) have mandatory declarations so homoglyphic identifiers aren't
                            nearly as bad a problem. Ecmascript is a horrible bug-prone language and
                            we want Python to move away from resembling it, not towards it. VB: well,
                            same as Ecmascript, I guess.

                            Comment

                            • Paul Rubin

                              #44
                              Re: PEP 3131: Supporting Non-ASCII Identifiers

                              Steven D'Aprano <steven@REMOVE. THIS.cybersourc e.com.auwrites:
                              If I'm mistaken, please explain why I'm mistaken, not just repeat your
                              claim in different words.
                              if user_entered_pa ssword != stored_password _from_database:
                              password_is_cor rect = False
                              ...
                              if password_is_cor rect:
                              log_user_in()

                              Does "password_is_co rrect" refer to the same variable in both places?

                              Comment

                              • Steven D'Aprano

                                #45
                                Re: PEP 3131: Supporting Non-ASCII Identifiers

                                On Sun, 13 May 2007 20:12:23 -0700, Paul Rubin wrote:
                                Steven D'Aprano <steven@REMOVE. THIS.cybersourc e.com.auwrites:
                                >If I'm mistaken, please explain why I'm mistaken, not just repeat your
                                >claim in different words.
                                >
                                if user_entered_pa ssword != stored_password _from_database:
                                password_is_cor rect = False
                                ...
                                if password_is_cor rect:
                                log_user_in()
                                >
                                Does "password_is_co rrect" refer to the same variable in both places?
                                No way of telling without a detailed code inspection. Who knows what
                                happens in the ... ? If a black hat has access to the code, he could
                                insert anything he liked in there, ASCII or non-ASCII.

                                How is this a problem with non-ASCII identifiers? password_is_cor rect is
                                all ASCII. How can you justify saying that non-ASCII identifiers
                                introduce a security hole that already exists in all-ASCII Python?


                                --
                                Steven.

                                Comment

                                Working...