PEP 3131: Supporting Non-ASCII Identifiers

**Anders J. Munch** · May 13 '07, 11:05 PM

Re: PEP 3131: Supporting Non-ASCII Identifiers

Michael Torrie wrote:

>
So given that people can already transliterate their language for use as
identifiers, I think avoiding non-ASCII character sets is a good idea.

Transliteration makes people choose bad variable names, I see it all the time
with Danish programmers. Say e.g. the most descriptive name for a process is
"kør forlæns" (run forward). But "koer_forla ens" is ugly, so instead he'll
write "run_fremad ", combining an English word with a slightly less appropriate
Danish word. Sprinkle in some English spelling errors and badly-chosen English
words, and you have the sorry state of the art that is today.

- Anders

**Steven D'Aprano** · May 13 '07, 11:35 PM

Re: PEP 3131: Supporting Non-ASCII Identifiers

On Sun, 13 May 2007 15:35:15 -0700, Alex Martelli wrote:

Homoglyphic characters _introduced by accident_ should not be discounted
as a risk

....

But when something similar
happens to somebody using a sufficiently fancy text editor to input
source in a programming language allowing arbitrary Unicode letters in
identifiers, the damage (the sheer waste of developer time) can be much
more substantial -- there will be two separate identifiers around, both
looking exactly like each other but actually distinct, and unbounded
amount of programmer time can be spent chasing after this extremely
elusive and tricky bug -- why doesn't a rebinding appear to "take", etc.
With some copy-and-paste during development and attempts at debugging,
several copies of each distinct version of the identifier can be spread
around the code, further hampering attempts at understanding.

How is that different from misreading "disk_burnt = True" as "disk_bumt =
True"? In the right (or perhaps wrong) font, like the ever-popular Arial,
the two can be visually indistinguishab le. Or "call" versus "cal1"?

Surely the correct solution is something like pylint or pychecker? Or
banning the use of lower-case L and digit 1 in identifiers. I'm good with
both.

--
Steven.

**Anders J. Munch** · May 13 '07, 11:35 PM

Re: PEP 3131: Supporting Non-ASCII Identifiers

Alex Martelli wrote:

>
Homoglyphic characters _introduced by accident_ should not be discounted
as a risk, as, it seems to me, was done early in this thread after the
issue had been mentioned. In the past, it has happened to me to
erroneously introduce such homoglyphs in a document I was preparing with
a word processor, by a slight error in the use of the system- provided
way for inserting characters not present on the keyboard; I found out
when later I went looking for the name I _thought_ I had input (but I
was looking for it spelled with the "right" glyph, not the one I had
actually used which looked just the same) and just could not find it.

There's any number of things to be done about that.
1. # -*- encoding: ascii -*-
(I'd like to see you sneak those homoglyphic characters past *that*.)
2. pychecker and pylint - I'm sure you realise what they could do for you.
3. Use a font that doesn't have those characters or deliberately makes them
distinct (that could help web browsing safety too).

I'm not discounting the problem, I just dont believe it's a big one. Can we
chose a codepoint subset that doesn't have these dupes?

- Anders

**Paul Rubin** · May 13 '07, 11:35 PM

Re: PEP 3131: Supporting Non-ASCII Identifiers

Alexander Schmolck <a.schmolck@gma il.comwrites:

Plenty of programming languages already support unicode identifiers,

Could you name a few? Thanks.

**Steven D'Aprano** · May 13 '07, 11:45 PM

Re: PEP 3131: Supporting Non-ASCII Identifiers

On Sun, 13 May 2007 10:52:12 -0700, Paul Rubin wrote:

"Martin v. Löwis" <martin@v.loewi s.dewrites:

>This is a commonly-raised objection, but I don't understand why people
>see it as a problem. The phishing issue surely won't apply, as you
>normally don't "click" on identifiers, but rather type them. In a
>phishing case, it is normally difficult to type the fake character
>(because the phishing relies on you mistaking the character for another
>one, so you would type the wrong identifier).

>
It certainly does apply, if you're maintaining a program and someone
submits a patch. In that case you neither click nor type the
character. You'd normally just make sure the patched program passes
the existing test suite, and examine the patch on the screen to make
sure it looks reasonable. The phishing possibilities are obvious.

Not to me, I'm afraid. Can you explain how it works? A phisher might be
able to fool a casual reader, but how does he fool the compiler into
executing the wrong code?

As for project maintainers, surely a patch using some unexpected Unicode
locale would fail the "looks reasonable" test? That could even be
automated -- if the patch uses an unexpected "#-*- coding: blah" line, or
includes characters outside of a pre-defined range, ring alarm bells.
("Why is somebody patching my Turkish module in Korean?")

--
Steven

**Marc 'BlackJack' Rintsch** · May 13 '07, 11:55 PM

Re: PEP 3131: Supporting Non-ASCII Identifiers

In <mailman.7627.1 179086416.32031 .python-list@python.org >, Michael Torrie
wrote:

I think non-ASCII characters makes the problem far far worse. While I
may not understand what the function is by it's name in your example,
allowing non-ASCII characters makes it works by forcing all would-be
code readers have to have all kinds of necessary fonts just to view the
source code. Things like reporting exceptions too. At least in your
example I know the exception occurred in zieheDreiAbVon. But if that
identifier is some UTF-8 string, how do I go about finding it in my text
editor, or even reporting the message to the developers? I don't happen
to have that particular keymap installed in my linux system, so I can't
even type the letters!

You find it in the sources by the line number from the traceback and the
letters can be copy'n'pasted if you don't know how to input them with your
keymap or keyboard layout.

Ciao,
Marc 'BlackJack' Rintsch

**Aldo Cortesi** · May 14 '07, 12:05 AM

Re: PEP 3131: Supporting Non-ASCII Identifiers

Thus spake "Martin v. Löwis" (martin@v.loewi s.de):

- should non-ASCII identifiers be supported? why?

No! I believe that:

- The security implications have not been sufficiently explored. I don't
want to be in a situation where I need to mechanically "clean" code (say,
from a submitted patch) with a tool because I can't reliably verify it by
eye. We should learn from the plethora of Unicode-related security
problems that have cropped up in the last few years.
- Non-ASCII identifiers would be a barrier to code exchange. If I know
Python I should be able to easily read any piece of code written in it,
regardless of the linguistic origin of the author. If PEP 3131 is
accepted, this will no longer be the case. A Python project that uses
Urdu identifiers throughout is just as useless to me, from a
code-exchange point of view, as one written in Perl.
- Unicode is harder to work with than ASCII in ways that are more important
in code than in human-language text. Humans eyes don't care if two
visually indistinguishab le characters are used interchangeably .
Interpreters do. There is no doubt that people will accidentally
introduce mistakes into their code because of this.

- would you use them if it was possible to do so? in what cases?

No.

Regards,

Aldo

--
Aldo Cortesi
aldo@nullcube.c om

About us

http://www.nullcube.com

Mob: 0419 492 863

**Paul Rubin** · May 14 '07, 01:05 AM

Re: PEP 3131: Supporting Non-ASCII Identifiers

Steven D'Aprano <steve@REMOVE.T HIS.cybersource .com.auwrites:

It certainly does apply, if you're maintaining a program and someone
submits a patch. In that case you neither click nor type the
character. You'd normally just make sure the patched program passes
the existing test suite, and examine the patch on the screen to make
sure it looks reasonable. The phishing possibilities are obvious.

>
Not to me, I'm afraid. Can you explain how it works? A phisher might be
able to fool a casual reader, but how does he fool the compiler into
executing the wrong code?

The compiler wouldn't execute the wrong code; it would execute the code
that the phisher intended it to execute. That might be different from
what it looked like to the reviewer.

**Terry Reedy** · May 14 '07, 02:15 AM

Re: PEP 3131: Supporting Non-ASCII Identifiers

"Alan Franzoni" <alan.franzoni_ invalid@geemail .invalidwrote in message
news:1u9kz7l2gc z1p.1e0kxqeikfp 97.dlg@40tude.n et...
Il Sun, 13 May 2007 17:44:39 +0200, "Martin v. Löwis" ha scritto:
|Also, there should be a way to convert source files in any 'exotic'
encoding to a pseudo-intellegibile encoding for any reader, a kind of
translittering (is that a proper english word) system out-of-the-box, not
requiring any other tool that's not included in the Python distro. This
will let people to retain their usual working environments even though
they're dealing with source code with identifiers in a really different
charset.
=============== ==============

When I proposed that PEP3131 include transliteration support, Martin
rejected the idea.

tjr

**Neil Hodgson** · May 14 '07, 02:45 AM

Re: PEP 3131: Supporting Non-ASCII Identifiers

Paul Rubin wrote:

>Plenty of programming languages already support unicode identifiers,

>
Could you name a few? Thanks.

C#, Java, Ecmascript, Visual Basic.

Neil

**Steven D'Aprano** · May 14 '07, 02:45 AM

Re: PEP 3131: Supporting Non-ASCII Identifiers

On Mon, 14 May 2007 09:42:13 +1000, Aldo Cortesi wrote:

I don't
want to be in a situation where I need to mechanically "clean"
code (say, from a submitted patch) with a tool because I can't
reliably verify it by eye.

But you can't reliably verify by eye. That's orders of magnitude more
difficult than debugging by eye, and we all know that you can't reliably
debug anything but the most trivial programs by eye.

If you're relying on cursory visual inspection to recognize harmful code,
you're already vulnerable to trojans.

We should learn from the plethora of
Unicode-related security problems that have cropped up in the last
few years.

Of course we should. And one of the things we should learn is when and
how Unicode is a risk, and not imagine that Unicode is some sort of
mystical contamination that creates security problems just by being used.

- Non-ASCII identifiers would be a barrier to code exchange. If I
know
Python I should be able to easily read any piece of code written
in it, regardless of the linguistic origin of the author. If PEP
3131 is accepted, this will no longer be the case.

But it isn't the case now, so that's no different. Code exchange
regardless of human language is a nice principle, but it doesn't work in
practice. How do you use "any piece of code ... regardless of the
linguistic origin of the author" when you don't know what the functions
and classes and arguments _mean_?

Here's a tiny doc string from one of the functions in the standard
library, translated (more or less) to Portuguese. If you can't read
Portuguese at least well enough to get by, how could you possibly use
this function? What would you use it for? What does it do? What arguments
does it take?

def dirsorteinserca o(a, x, baixo=0, elevado=None):
"""da o artigo x insercao na lista a, e mantem-na a
supondo classificado e classificado. Se x estiver ja em a,
introduza-o a direita do x direita mais. Os args opcionais
baixos (defeito 0) e elevados (len(a) do defeito) limitam
a fatia de a a ser procurarado.
"""
# not a non-ASCII character in sight (unless I missed one...)

[Apologies to Portuguese speakers for the dogs-breakfast I'm sure Babel-
fish and I made of the translation.]

The particular function I chose is probably small enough and obvious
enough that you could work out what it does just by following the
algorithm. You might even be able to guess what it is, because Portuguese
is similar enough to other Latin languages that most people can guess
what some of the words might mean (elevados could be height, maybe?). Now
multiply this difficulty by a thousand for a non-trivial module with
multiple classes and dozens of methods and functions. And you might not
even know what language it is in.

No, code exchange regardless of natural language is a nice principle, but
it doesn't exist except in very special circumstances.

A Python
project that uses Urdu identifiers throughout is just as useless
to me, from a code-exchange point of view, as one written in Perl.

That's because you can't read it, not because it uses Unicode. It could
be written entirely in ASCII, and still be unreadable and impossible to
understand.

- Unicode is harder to work with than ASCII in ways that are more
important
in code than in human-language text. Humans eyes don't care if two
visually indistinguishab le characters are used interchangeably .
Interpreters do. There is no doubt that people will accidentally
introduce mistakes into their code because of this.

That's no different from typos in ASCII. There's no doubt that we'll give
the same answer we've always given for this problem: unit tests, pylint
and pychecker.

--
Steven.

**Steven D'Aprano** · May 14 '07, 02:55 AM

Re: PEP 3131: Supporting Non-ASCII Identifiers

On Sun, 13 May 2007 17:59:23 -0700, Paul Rubin wrote:

Steven D'Aprano <steve@REMOVE.T HIS.cybersource .com.auwrites:

It certainly does apply, if you're maintaining a program and someone
submits a patch. In that case you neither click nor type the
character. You'd normally just make sure the patched program passes
the existing test suite, and examine the patch on the screen to make
sure it looks reasonable. The phishing possibilities are obvious.

>>
>Not to me, I'm afraid. Can you explain how it works? A phisher might be
>able to fool a casual reader, but how does he fool the compiler into
>executing the wrong code?

>
The compiler wouldn't execute the wrong code; it would execute the code
that the phisher intended it to execute. That might be different from
what it looked like to the reviewer.

How? Just repeating in more words your original claim doesn't explain a
thing.

It seems to me that your argument is, only slightly exaggerated, akin to
the following:

"Unicode identifiers are bad because phishers will no longer need to
write call_evil_func( ) but can write call_ÆŽvÄ¬Ä¾_fu nc() instead."

Maybe I'm naive, but I don't see how giving phishers the ability to
insert a call to Æ’unction() in some module is any more dangerous than
them inserting a call to function() instead.

If I'm mistaken, please explain why I'm mistaken, not just repeat your
claim in different words.

--
Steven.

**Paul Rubin** · May 14 '07, 03:15 AM

Re: PEP 3131: Supporting Non-ASCII Identifiers

Neil Hodgson <nyamatongwe+th under@gmail.com writes:

Plenty of programming languages already support unicode identifiers,

Could you name a few? Thanks.

C#, Java, Ecmascript, Visual Basic.

Java (and C#?) have mandatory declarations so homoglyphic identifiers aren't
nearly as bad a problem. Ecmascript is a horrible bug-prone language and
we want Python to move away from resembling it, not towards it. VB: well,
same as Ecmascript, I guess.

**Paul Rubin** · May 14 '07, 03:15 AM

Re: PEP 3131: Supporting Non-ASCII Identifiers

Steven D'Aprano <steven@REMOVE. THIS.cybersourc e.com.auwrites:

If I'm mistaken, please explain why I'm mistaken, not just repeat your
claim in different words.

if user_entered_pa ssword != stored_password _from_database:
password_is_cor rect = False
...
if password_is_cor rect:
log_user_in()

Does "password_is_co rrect" refer to the same variable in both places?

**Steven D'Aprano** · May 14 '07, 03:45 AM

Re: PEP 3131: Supporting Non-ASCII Identifiers

On Sun, 13 May 2007 20:12:23 -0700, Paul Rubin wrote:

Steven D'Aprano <steven@REMOVE. THIS.cybersourc e.com.auwrites:

>If I'm mistaken, please explain why I'm mistaken, not just repeat your
>claim in different words.

>
if user_entered_pa ssword != stored_password _from_database:
password_is_cor rect = False
...
if password_is_cor rect:
log_user_in()
>
Does "password_is_co rrect" refer to the same variable in both places?

No way of telling without a detailed code inspection. Who knows what
happens in the ... ? If a black hat has access to the code, he could
insert anything he liked in there, ASCII or non-ASCII.

How is this a problem with non-ASCII identifiers? password_is_cor rect is
all ASCII. How can you justify saying that non-ASCII identifiers
introduce a security hole that already exists in all-ASCII Python?

--
Steven.

PEP 3131: Supporting Non-ASCII Identifiers

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment