PEP 263 status check

**Martin v. Löwis** · Jul 18 '05, 01:26 PM

Re: PEP 263 status check

John Roth wrote:[color=blue]
> PEP 263 is marked finished in the PEP index, however
> I haven't seen the specified Phase 2 in the list of changes
> for 2.4 which is when I expected it.
>
> Did phase 2 get cancelled, or is it just not in the
> changes document?[/color]

Neither, nor. Although this hasn't been discussed widely,
I personally believe it is too early yet to make lack of
encoding declarations a syntax error. I'd like to
reconsider the issue with Python 2.5.

OTOH, not many people have commented either way: would you
be outraged if a script that has given you a warning about
missing encoding declarations for some time fails with a
strict SyntaxError in 2.4? Has everybody already corrected
their scripts?

Regards,
Martin

**Fernando Perez** · Jul 18 '05, 01:26 PM

Re: PEP 263 status check

"Martin v. Löwis" wrote:
[color=blue]
> I personally believe it is too early yet to make lack of
> encoding declarations a syntax error. I'd like to[/color]

+1

Making this an all-out failure is pretty brutal, IMHO. You could change the
warning message to be more stringent about it becoming soon an error. But if
someone upgrades to 2.4 because of other benefits, and some large third-party
code they rely on (and which is otherwise perfectly fine with 2.4) fails
catastrophicall y because of these warnings becoming errors, I suspect they
will be very unhappy.

I see the need to nudge people in the right direction, but there's no need to
do it with a 10.000 Volt stick :)

Best,

f

**John Roth** · Jul 18 '05, 01:26 PM

Re: PEP 263 status check

"Martin v. Löwis" <martin@v.loewi s.de> wrote in message
news:4112AB53.6 010701@v.loewis .de...[color=blue]
> John Roth wrote:[color=green]
> > PEP 263 is marked finished in the PEP index, however
> > I haven't seen the specified Phase 2 in the list of changes
> > for 2.4 which is when I expected it.
> >
> > Did phase 2 get cancelled, or is it just not in the
> > changes document?[/color]
>
> Neither, nor. Although this hasn't been discussed widely,
> I personally believe it is too early yet to make lack of
> encoding declarations a syntax error. I'd like to
> reconsider the issue with Python 2.5.
>
> OTOH, not many people have commented either way: would you
> be outraged if a script that has given you a warning about
> missing encoding declarations for some time fails with a
> strict SyntaxError in 2.4? Has everybody already corrected
> their scripts?[/color]

Well, I don't particularly have that problem because I don't
have a huge number of scripts and for the ones I do it would be
relatively simple to do a scan and update - or just run them
with the unit tests and see if they break!

In fact, I think that a scan and update program in the tools
directory might be a very good idea - just walk through a
Python library, scan and update everything that doesn't
have a declaration.

The issue has popped in and out of my awareness a few
times, what brought it up this time was Hallvard's thread.

My specific question there was how the code handles the
combination of UTF-8 as the encoding and a non-ascii
character in an 8-bit string literal. Is this an error? The
PEP does not say so. If it isn't, what encoding will
it use to translate from unicode back to an 8-bit
encoding?

Another project for people who care about this
subject: tools. Of the half zillion editors, pretty printers
and so forth out there, how many check for the encoding
line and do the right thing with it? Which ones need to
be updated?

John Roth[color=blue]
>
> Regards,
> Martin[/color]

**Vincent Wehren** · Jul 18 '05, 01:27 PM

Re: PEP 263 status check

"John Roth" <newsgroups@jhr othjr.com> schrieb im Newsbeitrag
news:10h5hgvpaf m8a64@news.supe rnews.com...
|
| "Martin v. Löwis" <martin@v.loewi s.de> wrote in message
| news:4112AB53.6 010701@v.loewis .de...
| > John Roth wrote:
| > > PEP 263 is marked finished in the PEP index, however
| > > I haven't seen the specified Phase 2 in the list of changes
| > > for 2.4 which is when I expected it.
| > >
| > > Did phase 2 get cancelled, or is it just not in the
| > > changes document?
| >
| > Neither, nor. Although this hasn't been discussed widely,
| > I personally believe it is too early yet to make lack of
| > encoding declarations a syntax error. I'd like to
| > reconsider the issue with Python 2.5.
| >
| > OTOH, not many people have commented either way: would you
| > be outraged if a script that has given you a warning about
| > missing encoding declarations for some time fails with a
| > strict SyntaxError in 2.4? Has everybody already corrected
| > their scripts?
|
| Well, I don't particularly have that problem because I don't
| have a huge number of scripts and for the ones I do it would be
| relatively simple to do a scan and update - or just run them
| with the unit tests and see if they break!

Here's another thought: the company I work for uses (embedded) Python as
scripting language
for their report writer (among other things). Users can add little scripts
to their document templates which are used for printing database data. This
means, there are literally hundreds of little Python scripts embeddeded
within the document templates, which themselves are stored in whatever
database is used as the backend. In such a case, "scan and update" when
upgrading gets a little more complicated ;)

|
| In fact, I think that a scan and update program in the tools
| directory might be a very good idea - just walk through a
| Python library, scan and update everything that doesn't
| have a declaration.
|
| The issue has popped in and out of my awareness a few
| times, what brought it up this time was Hallvard's thread.
|
| My specific question there was how the code handles the
| combination of UTF-8 as the encoding and a non-ascii
| character in an 8-bit string literal. Is this an error? The
| PEP does not say so. If it isn't, what encoding will
| it use to translate from unicode back to an 8-bit
| encoding?

Isn't this covered by:

"Embedding of differently encoded data is not allowed and will
result in a decoding error during compilation of the Python
source code."

--
Vincent Wehren

|
| Another project for people who care about this
| subject: tools. Of the half zillion editors, pretty printers
| and so forth out there, how many check for the encoding
| line and do the right thing with it? Which ones need to
| be updated?
|
| John Roth
| >
| > Regards,
| > Martin
|
|

**Martin v. Löwis** · Jul 18 '05, 01:27 PM

Re: PEP 263 status check

John Roth wrote:[color=blue]
> In fact, I think that a scan and update program in the tools
> directory might be a very good idea - just walk through a
> Python library, scan and update everything that doesn't
> have a declaration.[/color]

Good idea. I see whether I can write something before 2.4,
but contributions are definitely welcome.
[color=blue]
> My specific question there was how the code handles the
> combination of UTF-8 as the encoding and a non-ascii
> character in an 8-bit string literal. Is this an error? The
> PEP does not say so. If it isn't, what encoding will
> it use to translate from unicode back to an 8-bit
> encoding?[/color]

UTF-8 is not in any way special wrt. the PEP. Notice that
UTF-8 is *not* Unicode - it is an encoding of Unicode, just
like ISO-8559-1 or us-ascii (although the latter two only
encode a subset of Unicode). Yes, the byte string literals
will be converted back to an "8-bit encoding", but the 8-bit
encoding will be UTF-8! IOW, byte string literals are always
converted back to the source encoding before execution.
[color=blue]
> Another project for people who care about this
> subject: tools. Of the half zillion editors, pretty printers
> and so forth out there, how many check for the encoding
> line and do the right thing with it? Which ones need to
> be updated?[/color]

I know IDLE, Eric, Komodo, and Emacs do support encoding
declarations. I know PythonWin doesn't, although I once
had written patches to add such support. A number of editors
(like notepad.exe) do the right thing only if the document
has the UTF-8 signature.

Of course, editors don't necessarily need to actively
support the feature as long as the declared encoding is
the one they use, anyway. They won't display source in
other encodings correctly, but some of them don't have
the notion of multiple encodings, anyway.

Regards,
Martin

**Martin v. Löwis** · Jul 18 '05, 01:27 PM

Re: PEP 263 status check

Vincent Wehren wrote:[color=blue]
> Here's another thought: the company I work for uses (embedded) Python as
> scripting language
> for their report writer (among other things). Users can add little scripts
> to their document templates which are used for printing database data. This
> means, there are literally hundreds of little Python scripts embeddeded
> within the document templates, which themselves are stored in whatever
> database is used as the backend. In such a case, "scan and update" when
> upgrading gets a little more complicated ;)[/color]

At the same time, it might get also more simple. If the user interface
to edit these scripts is encoding-aware, and/or the database to store
them in is encoding-aware, an automated tool would not need to guess
what the encoding in the source is.
[color=blue]
> | My specific question there was how the code handles the
> | combination of UTF-8 as the encoding and a non-ascii
> | character in an 8-bit string literal. Is this an error? The
> | PEP does not say so. If it isn't, what encoding will
> | it use to translate from unicode back to an 8-bit
> | encoding?
>
> Isn't this covered by:
>
> "Embedding of differently encoded data is not allowed and will
> result in a decoding error during compilation of the Python
> source code."[/color]

No. It is perfectly legal to have non-ASCII data in 8-bit string
literals (aka byte string literals, aka <type 'str'>). Of course,
these non-ASCII data also need to be encoded in UTF-8. Whether UTF-8
is an 8-bit encoding, I don't know - it is more precisely described
as a multibyte encoding. At execution time, the byte string literals
then have the source encoding again, i.e. UTF-8.

Regards,
Martin

**John Roth** · Jul 18 '05, 01:27 PM

Re: PEP 263 status check

"Martin v. Löwis" <martin@v.loewi s.de> wrote in message
news:41133C76.8 040302@v.loewis .de...[color=blue]
> John Roth wrote:[/color]
[color=blue][color=green]
> > My specific question there was how the code handles the
> > combination of UTF-8 as the encoding and a non-ascii
> > character in an 8-bit string literal. Is this an error? The
> > PEP does not say so. If it isn't, what encoding will
> > it use to translate from unicode back to an 8-bit
> > encoding?[/color]
>
> UTF-8 is not in any way special wrt. the PEP.[/color]

That's what I thought.
[color=blue]
> Notice that
> UTF-8 is *not* Unicode - it is an encoding of Unicode, just
> like ISO-8559-1 or us-ascii (although the latter two only
> encode a subset of Unicode).[/color]

I disagree, but I think this is a definitional issue.
[color=blue]
> Yes, the byte string literals
> will be converted back to an "8-bit encoding", but the 8-bit
> encoding will be UTF-8! IOW, byte string literals are always
> converted back to the source encoding before execution.[/color]

If I understand you correctly, if I put, say, a mixture of
Cyrillic, Hebrew, Arabic and Greek into a byte string
literal, at run time that character string will contain the
proper unicode at each character position?

Or are you trying to say that the character string will
contain the UTF-8 encoding of these characters; that
is, if I do a subscript, I will get one character of the
multi-byte encoding?

The point of this is that I don't think that either behavior
is what one would expect. It's also an open invitation
for someone to make an unchecked mistake! I think this
may be Hallvard's underlying issue in the other thread.
[color=blue]
> Regards,
> Martin[/color]

John Roth

**Michael Hudson** · Jul 18 '05, 01:27 PM

Re: PEP 263 status check

"John Roth" <newsgroups@jhr othjr.com> writes:
[color=blue]
> If I understand you correctly, if I put, say, a mixture of
> Cyrillic, Hebrew, Arabic and Greek into a byte string
> literal, at run time that character string will contain the
> proper unicode at each character position?[/color]

Uh, I seem to be making a habit of labelling things you suggest
impossible :-)
[color=blue]
> Or are you trying to say that the character string will
> contain the UTF-8 encoding of these characters; that
> is, if I do a subscript, I will get one character of the
> multi-byte encoding?[/color]

This is what happens, indeed.

Cheers,
mwh

--
This is the fixed point problem again; since all some implementors
do is implement the compiler and libraries for compiler writing, the
language becomes good at writing compilers and not much else!
-- Brian Rogoff, comp.lang.funct ional

**Martin v. Löwis** · Jul 18 '05, 01:28 PM

Re: PEP 263 status check

John Roth wrote:[color=blue]
> Or are you trying to say that the character string will
> contain the UTF-8 encoding of these characters; that
> is, if I do a subscript, I will get one character of the
> multi-byte encoding?[/color]

Michael is almost right: this is what happens. Except that
what you get, I wouldn't call a "character" . Instead, it
is always a single byte - even if that byte is part of
a multi-byte character.

Unfortunately, the things that constitute a byte string
are also called characters in the literature.

To be more specific: In an UTF-8 source file, doing

print "ö" == "\xc3\xb6"
print "ö"[0] == "\xc3"

would print two times "True", and len("ö") is 2.
OTOH, len(u"ö")==1.
[color=blue]
> The point of this is that I don't think that either behavior
> is what one would expect. It's also an open invitation
> for someone to make an unchecked mistake! I think this
> may be Hallvard's underlying issue in the other thread.[/color]

What would you expect instead? Do you think your expectation
is implementable?

Regards,
Martin

**John Roth** · Jul 18 '05, 01:28 PM

Re: PEP 263 status check

"Martin v. Löwis" <martin@v.loewi s.de> wrote in message
news:41137799.7 0808@v.loewis.d e...[color=blue]
> John Roth wrote:[color=green]
> > Or are you trying to say that the character string will
> > contain the UTF-8 encoding of these characters; that
> > is, if I do a subscript, I will get one character of the
> > multi-byte encoding?[/color]
>
> Michael is almost right: this is what happens. Except that
> what you get, I wouldn't call a "character" . Instead, it
> is always a single byte - even if that byte is part of
> a multi-byte character.
>
> Unfortunately, the things that constitute a byte string
> are also called characters in the literature.
>
> To be more specific: In an UTF-8 source file, doing
>
> print "ö" == "\xc3\xb6"
> print "ö"[0] == "\xc3"
>
> would print two times "True", and len("ö") is 2.
> OTOH, len(u"ö")==1.
>[color=green]
> > The point of this is that I don't think that either behavior
> > is what one would expect. It's also an open invitation
> > for someone to make an unchecked mistake! I think this
> > may be Hallvard's underlying issue in the other thread.[/color]
>
> What would you expect instead? Do you think your expectation
> is implementable?[/color]

I'd expect that the compiler would reject anything that
wasn't either in the 7-bit ascii subset, or else defined
with a hex escape.

The reason for this is simply that wanting to put characters
outside of the 7-bit ascii subset into a byte character string
isn't portable. It just pushes the need for a character set
(encoding) declaration down one level of recursion.
There's already a way of doing this: use a unicode string,
so it's not like we need two ways of doing it.

Now I will grant you that there is a need for representing
the utf-8 encoding in a character string, but do we need
to support that in the source text when it's much more
likely that it's a programming mistake?

As far as implementation goes, it should have been done
at the beginning. Prior to 2.3, there was no way of writing
a program using the utf-8 encoding (I think - I might be
wrong on that) so there were no programs out there that
put non-ascii subset characters into byte strings.

Today it's one more forward migration hurdle to jump over.
I don't think it's a particularly large one, but I don't have
any real world data at hand.

John Roth[color=blue]
>
> Regards,
> Martin[/color]

**Martin v. Löwis** · Jul 18 '05, 01:30 PM

Re: PEP 263 status check

John Roth wrote:[color=blue][color=green]
>>What would you expect instead? Do you think your expectation
>>is implementable?[/color]
>
>
> I'd expect that the compiler would reject anything that
> wasn't either in the 7-bit ascii subset, or else defined
> with a hex escape.[/color]

Are we still talking about PEP 263 here? If the entire source
code has to be in the 7-bit ASCII subset, then what is the point
of encoding declarations?

If you were suggesting that anything except Unicode literals
should be in the 7-bit ASCII subset, then this is still
unacceptable: Comments should also be allowed to contain non-ASCII
characters, don't you agree?

If you think that only Unicode literals and comments should be
allowed to contain non-ASCII, I disagree: At some point, I'd
like to propose support for non-ASCII in identifiers. This would
allow people to make identifiers that represent words from their
native language, which is helpful for people who don't speak
English well.

If you think that only Unicod literals, comments, and identifiers
should be allowed non-ASCII: perhaps, but this is out of scope
of PEP 263, which *only* introduces encoding declarations,
and explains what they mean for all current constructs.
[color=blue]
> The reason for this is simply that wanting to put characters
> outside of the 7-bit ascii subset into a byte character string
> isn't portable.[/color]

Define "is portable". With an encoding declaration, I can move
the source code from one machine to another, open it in an editor,
and have it display correctly. This was not portable without
encoding declarations (likewise for comments); with PEP 263,
such source code became portable.

Also, the run-time behaviour is fully predictable (which it
even was without PEP 263): At run-time, the string will have
exactly the same bytes that it does in the .py file. This
is fully portable.
[color=blue]
> It just pushes the need for a character set
> (encoding) declaration down one level of recursion.[/color]

It depends on the program. E.g. if the program was to generate
HTML files with an explicit HTTP-Equiv charset=iso-8859-1,
then the resulting program is absolutely, 100% portable.

For messages directly output to a terminal, portability
might not be important.
[color=blue]
> There's already a way of doing this: use a unicode string,
> so it's not like we need two ways of doing it.[/color]

Using a Unicode string might not work, because a library might
crash when confronted with a Unicode string. You are proposing
to break existing applications for no good reason, and with
no simple fix.
[color=blue]
> Now I will grant you that there is a need for representing
> the utf-8 encoding in a character string, but do we need
> to support that in the source text when it's much more
> likely that it's a programming mistake?[/color]

But it isn't! People do put KOI-8R into source code, into
string literals, and it works perfectly fine for them. There
is no reason to arbitrarily break their code.
[color=blue]
> As far as implementation goes, it should have been done
> at the beginning. Prior to 2.3, there was no way of writing
> a program using the utf-8 encoding (I think - I might be
> wrong on that)[/color]

You are wrong. You were always able to put UTF-8 into byte
strings, even at a time where UTF-8 was not yet an RFC
(say, in Python 1.1).
[color=blue]
> so there were no programs out there that
> put non-ascii subset characters into byte strings.[/color]

That is just not true. If it were true, there would be no
need to introduce a grace period in the PEP. However,
*many* scripts in the world use non-ASCII in string literals;
it was always possible (although the documentation was
wishy-washy on what it actually meant).
[color=blue]
> Today it's one more forward migration hurdle to jump over.
> I don't think it's a particularly large one, but I don't have
> any real world data at hand.[/color]

Trust me: the outcry for banning non-ASCII from string literals
would be, by far, louder than the one for a proposed syntax
on decorators. That would break many production systems, CGI
scripts would suddenly stop working, GUIs would crash, etc.

Regards,
Martin

**Hallvard B Furuseth** · Jul 18 '05, 01:30 PM

Re: PEP 263 status check

An addition to Martin's reply:

John Roth wrote:[color=blue]
>"Martin v. Löwis" <martin@v.loewi s.de> wrote in message
>news:41137799. 70808@v.loewis. de...[color=green]
>>John Roth wrote:
>>
>> To be more specific: In an UTF-8 source file, doing
>>
>> print "ö" == "\xc3\xb6"
>> print "ö"[0] == "\xc3"
>>
>> would print two times "True", and len("ö") is 2.
>> OTOH, len(u"ö")==1.
>>[color=darkred]
>>> The point of this is that I don't think that either behavior
>>> is what one would expect. It's also an open invitation
>>> for someone to make an unchecked mistake! I think this
>>> may be Hallvard's underlying issue in the other thread.[/color]
>>
>> What would you expect instead? Do you think your expectation
>> is implementable?[/color]
>
> I'd expect that the compiler would reject anything that
> wasn't either in the 7-bit ascii subset, or else defined
> with a hex escape.[/color]

Then you should also expect a lot of people to move to
another language - one whose designers live in the real
world instead of your Utopian Unicode world.
[color=blue]
> The reason for this is simply that wanting to put characters
> outside of the 7-bit ascii subset into a byte character string
> isn't portable.[/color]

Unicode isn't portable either.
Try to output a Unicode string to a device (e.g. your terminal)
whose character encoding is not known to the program.
The program will fail, or just output the raw utf-8 string or
something, or just guess some character set the program's author
is fond of.

For that matter, tell me why my programs should spend any time
on converting between UTF-8 and the character set the
application actually works with just because you are fond of
Unicode. That might be a lot more time than just the time spent
parsing the program. Or tell me why I should spell quite normal
text strings with hex escaping or something, if that's what you
mean.

And tell me why I shouldn't be allowed to work easily with raw
UTF-8 strings, if I do use coding:utf-8.

--
Hallvard

**John Roth** · Jul 18 '05, 01:30 PM

Re: PEP 263 status check

"Martin v. Löwis" <martin@v.loewi s.de> wrote in message
news:4113D8DF.8 080106@v.loewis .de...[color=blue]
> John Roth wrote:[color=green][color=darkred]
> >>What would you expect instead? Do you think your expectation
> >>is implementable?[/color]
> >
> >
> > I'd expect that the compiler would reject anything that
> > wasn't either in the 7-bit ascii subset, or else defined
> > with a hex escape.[/color]
>
> Are we still talking about PEP 263 here? If the entire source
> code has to be in the 7-bit ASCII subset, then what is the point
> of encoding declarations?[/color]

Martin, I think you misinterpreted what I said at the
beginning. I'm only, and I need to repeat this, ONLY
dealing with the case where the encoding declaration
specifically says that the script is in UTF-8. No other
case.

I'm going to deal with your response point by point,
but I don't think most of this is really relevant. Your
response only makes sense if you missed the point that
I was talking about scripts that explicitly declared their
encoding to be UTF-8, and no other scripts in no
other circumstances.

I didn't mean the entire source was in 7-bit ascii. What
I meant was that if the encoding was utf-8 then the source
for 8-bit string literals must be in 7-bit ascii. Nothing more.
[color=blue]
> If you were suggesting that anything except Unicode literals
> should be in the 7-bit ASCII subset, then this is still
> unacceptable: Comments should also be allowed to contain non-ASCII
> characters, don't you agree?[/color]

Of course.
[color=blue]
> If you think that only Unicode literals and comments should be
> allowed to contain non-ASCII, I disagree: At some point, I'd
> like to propose support for non-ASCII in identifiers. This would
> allow people to make identifiers that represent words from their
> native language, which is helpful for people who don't speak
> English well.[/color]

L:ikewise. I never thought otherwise; in fact I'd like to expand
the availible operators to include the set operators as well as
the logical operators and the "real" division operator (the one
you learned in grade school - the dash with a dot above and
below the line.)
[color=blue]
> If you think that only Unicod literals, comments, and identifiers
> should be allowed non-ASCII: perhaps, but this is out of scope
> of PEP 263, which *only* introduces encoding declarations,
> and explains what they mean for all current constructs.
>[color=green]
> > The reason for this is simply that wanting to put characters
> > outside of the 7-bit ascii subset into a byte character string
> > isn't portable.[/color]
>
> Define "is portable". With an encoding declaration, I can move
> the source code from one machine to another, open it in an editor,
> and have it display correctly. This was not portable without
> encoding declarations (likewise for comments); with PEP 263,
> such source code became portable.[/color]
[color=blue]
> Also, the run-time behaviour is fully predictable (which it
> even was without PEP 263): At run-time, the string will have
> exactly the same bytes that it does in the .py file. This
> is fully portable.[/color]

It's predictable, but as far as I'm concerned, that's
not only useless behavior, it's counterproducti ve
behavior. I find it difficult to imagine any case
where the benefit of having normal character
literals accidentally contain utf-8 multi-byte
characters outweighs the pain of having it happen
accidentally, and then figuring out why your program
is giving you wierd behavior.

I would grant that there are cases where you
might want this behavior. I am pretty sure they
are in the distinct minority.

[color=blue][color=green]
> > It just pushes the need for a character set
> > (encoding) declaration down one level of recursion.[/color]
>
> It depends on the program. E.g. if the program was to generate
> HTML files with an explicit HTTP-Equiv charset=iso-8859-1,
> then the resulting program is absolutely, 100% portable.[/color]

It's portable, but that's not the normal case. See above.
[color=blue]
> For messages directly output to a terminal, portability
> might not be important.[/color]

Portabiliity is less of an issue for me than the likelihood
of making a mistake in coding a literal and then having
to debug unexpected behavior when one byte no longer
equals one character.

[color=blue][color=green]
> > There's already a way of doing this: use a unicode string,
> > so it's not like we need two ways of doing it.[/color]
>
> Using a Unicode string might not work, because a library might
> crash when confronted with a Unicode string. You are proposing
> to break existing applications for no good reason, and with
> no simple fix.[/color]

There's no reason why you have to have a utf-8
encoding declaration. If you want your source to
be utf-8, you need to accept the consequences.
I fully expect Python to support the usual mixture
of encodings until 3.0 at least. At that point, everything
gets to be rewritten anyway.
[color=blue][color=green]
> > Now I will grant you that there is a need for representing
> > the utf-8 encoding in a character string, but do we need
> > to support that in the source text when it's much more
> > likely that it's a programming mistake?[/color]
>
> But it isn't! People do put KOI-8R into source code, into
> string literals, and it works perfectly fine for them. There
> is no reason to arbitrarily break their code.
>[color=green]
> > As far as implementation goes, it should have been done
> > at the beginning. Prior to 2.3, there was no way of writing
> > a program using the utf-8 encoding (I think - I might be
> > wrong on that)[/color]
>
> You are wrong. You were always able to put UTF-8 into byte
> strings, even at a time where UTF-8 was not yet an RFC
> (say, in Python 1.1).[/color]

Were you able to write your entire program in UTF-8?
I think not.
[color=blue]
>[color=green]
> > so there were no programs out there that
> > put non-ascii subset characters into byte strings.[/color]
>
> That is just not true. If it were true, there would be no
> need to introduce a grace period in the PEP. However,
> *many* scripts in the world use non-ASCII in string literals;
> it was always possible (although the documentation was
> wishy-washy on what it actually meant).
>[color=green]
> > Today it's one more forward migration hurdle to jump over.
> > I don't think it's a particularly large one, but I don't have
> > any real world data at hand.[/color]
>
> Trust me: the outcry for banning non-ASCII from string literals
> would be, by far, louder than the one for a proposed syntax
> on decorators. That would break many production systems, CGI
> scripts would suddenly stop working, GUIs would crash, etc.[/color]

..

[color=blue]
>
> Regards,
> Martin[/color]

**John Roth** · Jul 18 '05, 01:31 PM

Re: PEP 263 status check

"Hallvard B Furuseth" <h.b.furuseth@u sit.uio.no> wrote in message
news:HBF.200408 06qchc@bombur.u io.no...[color=blue]
> An addition to Martin's reply:
>
> John Roth wrote:[color=green]
> >"Martin v. Löwis" <martin@v.loewi s.de> wrote in message
> >news:41137799. 70808@v.loewis. de...[color=darkred]
> >>John Roth wrote:
> >>
> >> To be more specific: In an UTF-8 source file, doing
> >>
> >> print "ö" == "\xc3\xb6"
> >> print "ö"[0] == "\xc3"
> >>
> >> would print two times "True", and len("ö") is 2.
> >> OTOH, len(u"ö")==1.
> >>
> >>> The point of this is that I don't think that either behavior
> >>> is what one would expect. It's also an open invitation
> >>> for someone to make an unchecked mistake! I think this
> >>> may be Hallvard's underlying issue in the other thread.
> >>
> >> What would you expect instead? Do you think your expectation
> >> is implementable?[/color]
> >
> > I'd expect that the compiler would reject anything that
> > wasn't either in the 7-bit ascii subset, or else defined
> > with a hex escape.[/color]
>
> Then you should also expect a lot of people to move to
> another language - one whose designers live in the real
> world instead of your Utopian Unicode world.[/color]

Rudeness objection to your characteization .

Please see my response to Martin - I'm talking only,
and I repeat ONLY, about scripts that explicitly
say they are encoded in utf-8. Nothing else. I've
been in this business for close to 40 years, and I'm
quite well aware of backwards compatibility issues
and issues with breaking existing code.

Programmers in general have a very strong, and
let me repeat that, VERY STRONG assumption
that an 8-bit string contains one byte per character
unless there is a good reason to believe otherwise.
This assumption is built into various places, including
all of the string methods.

The current design allows accidental inclusion of
a character that is not in the 7bit ascii subset ***IN
A PROGRAM THAT HAS A UTF-8 CHARACTER
ENCODING DECLARATION*** to break that
assumption without any kind of notice. That in
turn will break all of the assumptions that the string
module and string methods are based on. That in
turn is likely to break lots of existing modules and
cause a lot of debugging time that could be avoided
by proper design.

One of Python's strong points is that it's difficult
to get into trouble unless you deliberately try (then
it's quite easy, fortunately.)

I'm not worried about this causing people to
abandon Python. I'm more worried about the
current situation causing enough grief that people
will decided that utf-8 source code encoding isn't
worth it.
[color=blue]
> And tell me why I shouldn't be allowed to work easily with raw
> UTF-8 strings, if I do use coding:utf-8.[/color]

First, there's nothing that's stopping you. All that
my proposal will do is require you to do a one
time conversion of any strings you put in the
program as literals. It doesn't affect any other
strings in any other way at any other time.

I'll withdraw my objection if you can seriously
assure me that working with raw utf-8 in
8-bit character string literals is what most programmers
are going to do most of the time.

I'm not going to accept the very common need
of converting unicode strings to 8-bit strings so
they can be written to disk or stored in a data base
or whatnot (or reversing the conversion for reading.)
That has nothing to do with the current issue - it's
something that everyone who deals with unicode
needs to do, regardless of the encoding of the
source program.

John Roth[color=blue]
>
> --
> Hallvard[/color]