PEP 263 status check

Collapse
This topic is closed.
X
X
 
  • Time
  • Show
Clear All
new posts
  • John Roth

    PEP 263 status check

    PEP 263 is marked finished in the PEP index, however
    I haven't seen the specified Phase 2 in the list of changes
    for 2.4 which is when I expected it.

    Did phase 2 get cancelled, or is it just not in the
    changes document?

    John Roth


  • Martin v. Löwis

    #2
    Re: PEP 263 status check

    John Roth wrote:[color=blue]
    > PEP 263 is marked finished in the PEP index, however
    > I haven't seen the specified Phase 2 in the list of changes
    > for 2.4 which is when I expected it.
    >
    > Did phase 2 get cancelled, or is it just not in the
    > changes document?[/color]

    Neither, nor. Although this hasn't been discussed widely,
    I personally believe it is too early yet to make lack of
    encoding declarations a syntax error. I'd like to
    reconsider the issue with Python 2.5.

    OTOH, not many people have commented either way: would you
    be outraged if a script that has given you a warning about
    missing encoding declarations for some time fails with a
    strict SyntaxError in 2.4? Has everybody already corrected
    their scripts?

    Regards,
    Martin

    Comment

    • Fernando Perez

      #3
      Re: PEP 263 status check

      "Martin v. Löwis" wrote:
      [color=blue]
      > I personally believe it is too early yet to make lack of
      > encoding declarations a syntax error. I'd like to[/color]

      +1

      Making this an all-out failure is pretty brutal, IMHO. You could change the
      warning message to be more stringent about it becoming soon an error. But if
      someone upgrades to 2.4 because of other benefits, and some large third-party
      code they rely on (and which is otherwise perfectly fine with 2.4) fails
      catastrophicall y because of these warnings becoming errors, I suspect they
      will be very unhappy.

      I see the need to nudge people in the right direction, but there's no need to
      do it with a 10.000 Volt stick :)

      Best,

      f

      Comment

      • John Roth

        #4
        Re: PEP 263 status check


        "Martin v. Löwis" <martin@v.loewi s.de> wrote in message
        news:4112AB53.6 010701@v.loewis .de...[color=blue]
        > John Roth wrote:[color=green]
        > > PEP 263 is marked finished in the PEP index, however
        > > I haven't seen the specified Phase 2 in the list of changes
        > > for 2.4 which is when I expected it.
        > >
        > > Did phase 2 get cancelled, or is it just not in the
        > > changes document?[/color]
        >
        > Neither, nor. Although this hasn't been discussed widely,
        > I personally believe it is too early yet to make lack of
        > encoding declarations a syntax error. I'd like to
        > reconsider the issue with Python 2.5.
        >
        > OTOH, not many people have commented either way: would you
        > be outraged if a script that has given you a warning about
        > missing encoding declarations for some time fails with a
        > strict SyntaxError in 2.4? Has everybody already corrected
        > their scripts?[/color]

        Well, I don't particularly have that problem because I don't
        have a huge number of scripts and for the ones I do it would be
        relatively simple to do a scan and update - or just run them
        with the unit tests and see if they break!

        In fact, I think that a scan and update program in the tools
        directory might be a very good idea - just walk through a
        Python library, scan and update everything that doesn't
        have a declaration.

        The issue has popped in and out of my awareness a few
        times, what brought it up this time was Hallvard's thread.

        My specific question there was how the code handles the
        combination of UTF-8 as the encoding and a non-ascii
        character in an 8-bit string literal. Is this an error? The
        PEP does not say so. If it isn't, what encoding will
        it use to translate from unicode back to an 8-bit
        encoding?

        Another project for people who care about this
        subject: tools. Of the half zillion editors, pretty printers
        and so forth out there, how many check for the encoding
        line and do the right thing with it? Which ones need to
        be updated?

        John Roth[color=blue]
        >
        > Regards,
        > Martin[/color]


        Comment

        • Vincent Wehren

          #5
          Re: PEP 263 status check


          "John Roth" <newsgroups@jhr othjr.com> schrieb im Newsbeitrag
          news:10h5hgvpaf m8a64@news.supe rnews.com...
          |
          | "Martin v. Löwis" <martin@v.loewi s.de> wrote in message
          | news:4112AB53.6 010701@v.loewis .de...
          | > John Roth wrote:
          | > > PEP 263 is marked finished in the PEP index, however
          | > > I haven't seen the specified Phase 2 in the list of changes
          | > > for 2.4 which is when I expected it.
          | > >
          | > > Did phase 2 get cancelled, or is it just not in the
          | > > changes document?
          | >
          | > Neither, nor. Although this hasn't been discussed widely,
          | > I personally believe it is too early yet to make lack of
          | > encoding declarations a syntax error. I'd like to
          | > reconsider the issue with Python 2.5.
          | >
          | > OTOH, not many people have commented either way: would you
          | > be outraged if a script that has given you a warning about
          | > missing encoding declarations for some time fails with a
          | > strict SyntaxError in 2.4? Has everybody already corrected
          | > their scripts?
          |
          | Well, I don't particularly have that problem because I don't
          | have a huge number of scripts and for the ones I do it would be
          | relatively simple to do a scan and update - or just run them
          | with the unit tests and see if they break!

          Here's another thought: the company I work for uses (embedded) Python as
          scripting language
          for their report writer (among other things). Users can add little scripts
          to their document templates which are used for printing database data. This
          means, there are literally hundreds of little Python scripts embeddeded
          within the document templates, which themselves are stored in whatever
          database is used as the backend. In such a case, "scan and update" when
          upgrading gets a little more complicated ;)

          |
          | In fact, I think that a scan and update program in the tools
          | directory might be a very good idea - just walk through a
          | Python library, scan and update everything that doesn't
          | have a declaration.
          |
          | The issue has popped in and out of my awareness a few
          | times, what brought it up this time was Hallvard's thread.
          |
          | My specific question there was how the code handles the
          | combination of UTF-8 as the encoding and a non-ascii
          | character in an 8-bit string literal. Is this an error? The
          | PEP does not say so. If it isn't, what encoding will
          | it use to translate from unicode back to an 8-bit
          | encoding?

          Isn't this covered by:

          "Embedding of differently encoded data is not allowed and will
          result in a decoding error during compilation of the Python
          source code."

          --
          Vincent Wehren


          |
          | Another project for people who care about this
          | subject: tools. Of the half zillion editors, pretty printers
          | and so forth out there, how many check for the encoding
          | line and do the right thing with it? Which ones need to
          | be updated?
          |
          | John Roth
          | >
          | > Regards,
          | > Martin
          |
          |


          Comment

          • Martin v. Löwis

            #6
            Re: PEP 263 status check

            John Roth wrote:[color=blue]
            > In fact, I think that a scan and update program in the tools
            > directory might be a very good idea - just walk through a
            > Python library, scan and update everything that doesn't
            > have a declaration.[/color]

            Good idea. I see whether I can write something before 2.4,
            but contributions are definitely welcome.
            [color=blue]
            > My specific question there was how the code handles the
            > combination of UTF-8 as the encoding and a non-ascii
            > character in an 8-bit string literal. Is this an error? The
            > PEP does not say so. If it isn't, what encoding will
            > it use to translate from unicode back to an 8-bit
            > encoding?[/color]

            UTF-8 is not in any way special wrt. the PEP. Notice that
            UTF-8 is *not* Unicode - it is an encoding of Unicode, just
            like ISO-8559-1 or us-ascii (although the latter two only
            encode a subset of Unicode). Yes, the byte string literals
            will be converted back to an "8-bit encoding", but the 8-bit
            encoding will be UTF-8! IOW, byte string literals are always
            converted back to the source encoding before execution.
            [color=blue]
            > Another project for people who care about this
            > subject: tools. Of the half zillion editors, pretty printers
            > and so forth out there, how many check for the encoding
            > line and do the right thing with it? Which ones need to
            > be updated?[/color]

            I know IDLE, Eric, Komodo, and Emacs do support encoding
            declarations. I know PythonWin doesn't, although I once
            had written patches to add such support. A number of editors
            (like notepad.exe) do the right thing only if the document
            has the UTF-8 signature.

            Of course, editors don't necessarily need to actively
            support the feature as long as the declared encoding is
            the one they use, anyway. They won't display source in
            other encodings correctly, but some of them don't have
            the notion of multiple encodings, anyway.

            Regards,
            Martin

            Comment

            • Martin v. Löwis

              #7
              Re: PEP 263 status check

              Vincent Wehren wrote:[color=blue]
              > Here's another thought: the company I work for uses (embedded) Python as
              > scripting language
              > for their report writer (among other things). Users can add little scripts
              > to their document templates which are used for printing database data. This
              > means, there are literally hundreds of little Python scripts embeddeded
              > within the document templates, which themselves are stored in whatever
              > database is used as the backend. In such a case, "scan and update" when
              > upgrading gets a little more complicated ;)[/color]

              At the same time, it might get also more simple. If the user interface
              to edit these scripts is encoding-aware, and/or the database to store
              them in is encoding-aware, an automated tool would not need to guess
              what the encoding in the source is.
              [color=blue]
              > | My specific question there was how the code handles the
              > | combination of UTF-8 as the encoding and a non-ascii
              > | character in an 8-bit string literal. Is this an error? The
              > | PEP does not say so. If it isn't, what encoding will
              > | it use to translate from unicode back to an 8-bit
              > | encoding?
              >
              > Isn't this covered by:
              >
              > "Embedding of differently encoded data is not allowed and will
              > result in a decoding error during compilation of the Python
              > source code."[/color]

              No. It is perfectly legal to have non-ASCII data in 8-bit string
              literals (aka byte string literals, aka <type 'str'>). Of course,
              these non-ASCII data also need to be encoded in UTF-8. Whether UTF-8
              is an 8-bit encoding, I don't know - it is more precisely described
              as a multibyte encoding. At execution time, the byte string literals
              then have the source encoding again, i.e. UTF-8.

              Regards,
              Martin

              Comment

              • John Roth

                #8
                Re: PEP 263 status check


                "Martin v. Löwis" <martin@v.loewi s.de> wrote in message
                news:41133C76.8 040302@v.loewis .de...[color=blue]
                > John Roth wrote:[/color]
                [color=blue][color=green]
                > > My specific question there was how the code handles the
                > > combination of UTF-8 as the encoding and a non-ascii
                > > character in an 8-bit string literal. Is this an error? The
                > > PEP does not say so. If it isn't, what encoding will
                > > it use to translate from unicode back to an 8-bit
                > > encoding?[/color]
                >
                > UTF-8 is not in any way special wrt. the PEP.[/color]

                That's what I thought.
                [color=blue]
                > Notice that
                > UTF-8 is *not* Unicode - it is an encoding of Unicode, just
                > like ISO-8559-1 or us-ascii (although the latter two only
                > encode a subset of Unicode).[/color]

                I disagree, but I think this is a definitional issue.
                [color=blue]
                > Yes, the byte string literals
                > will be converted back to an "8-bit encoding", but the 8-bit
                > encoding will be UTF-8! IOW, byte string literals are always
                > converted back to the source encoding before execution.[/color]

                If I understand you correctly, if I put, say, a mixture of
                Cyrillic, Hebrew, Arabic and Greek into a byte string
                literal, at run time that character string will contain the
                proper unicode at each character position?

                Or are you trying to say that the character string will
                contain the UTF-8 encoding of these characters; that
                is, if I do a subscript, I will get one character of the
                multi-byte encoding?

                The point of this is that I don't think that either behavior
                is what one would expect. It's also an open invitation
                for someone to make an unchecked mistake! I think this
                may be Hallvard's underlying issue in the other thread.
                [color=blue]
                > Regards,
                > Martin[/color]

                John Roth


                Comment

                • Michael Hudson

                  #9
                  Re: PEP 263 status check

                  "John Roth" <newsgroups@jhr othjr.com> writes:
                  [color=blue]
                  > If I understand you correctly, if I put, say, a mixture of
                  > Cyrillic, Hebrew, Arabic and Greek into a byte string
                  > literal, at run time that character string will contain the
                  > proper unicode at each character position?[/color]

                  Uh, I seem to be making a habit of labelling things you suggest
                  impossible :-)
                  [color=blue]
                  > Or are you trying to say that the character string will
                  > contain the UTF-8 encoding of these characters; that
                  > is, if I do a subscript, I will get one character of the
                  > multi-byte encoding?[/color]

                  This is what happens, indeed.

                  Cheers,
                  mwh

                  --
                  This is the fixed point problem again; since all some implementors
                  do is implement the compiler and libraries for compiler writing, the
                  language becomes good at writing compilers and not much else!
                  -- Brian Rogoff, comp.lang.funct ional

                  Comment

                  • Martin v. Löwis

                    #10
                    Re: PEP 263 status check

                    John Roth wrote:[color=blue]
                    > Or are you trying to say that the character string will
                    > contain the UTF-8 encoding of these characters; that
                    > is, if I do a subscript, I will get one character of the
                    > multi-byte encoding?[/color]

                    Michael is almost right: this is what happens. Except that
                    what you get, I wouldn't call a "character" . Instead, it
                    is always a single byte - even if that byte is part of
                    a multi-byte character.

                    Unfortunately, the things that constitute a byte string
                    are also called characters in the literature.

                    To be more specific: In an UTF-8 source file, doing

                    print "ö" == "\xc3\xb6"
                    print "ö"[0] == "\xc3"

                    would print two times "True", and len("ö") is 2.
                    OTOH, len(u"ö")==1.
                    [color=blue]
                    > The point of this is that I don't think that either behavior
                    > is what one would expect. It's also an open invitation
                    > for someone to make an unchecked mistake! I think this
                    > may be Hallvard's underlying issue in the other thread.[/color]

                    What would you expect instead? Do you think your expectation
                    is implementable?

                    Regards,
                    Martin

                    Comment

                    • John Roth

                      #11
                      Re: PEP 263 status check


                      "Martin v. Löwis" <martin@v.loewi s.de> wrote in message
                      news:41137799.7 0808@v.loewis.d e...[color=blue]
                      > John Roth wrote:[color=green]
                      > > Or are you trying to say that the character string will
                      > > contain the UTF-8 encoding of these characters; that
                      > > is, if I do a subscript, I will get one character of the
                      > > multi-byte encoding?[/color]
                      >
                      > Michael is almost right: this is what happens. Except that
                      > what you get, I wouldn't call a "character" . Instead, it
                      > is always a single byte - even if that byte is part of
                      > a multi-byte character.
                      >
                      > Unfortunately, the things that constitute a byte string
                      > are also called characters in the literature.
                      >
                      > To be more specific: In an UTF-8 source file, doing
                      >
                      > print "ö" == "\xc3\xb6"
                      > print "ö"[0] == "\xc3"
                      >
                      > would print two times "True", and len("ö") is 2.
                      > OTOH, len(u"ö")==1.
                      >[color=green]
                      > > The point of this is that I don't think that either behavior
                      > > is what one would expect. It's also an open invitation
                      > > for someone to make an unchecked mistake! I think this
                      > > may be Hallvard's underlying issue in the other thread.[/color]
                      >
                      > What would you expect instead? Do you think your expectation
                      > is implementable?[/color]

                      I'd expect that the compiler would reject anything that
                      wasn't either in the 7-bit ascii subset, or else defined
                      with a hex escape.

                      The reason for this is simply that wanting to put characters
                      outside of the 7-bit ascii subset into a byte character string
                      isn't portable. It just pushes the need for a character set
                      (encoding) declaration down one level of recursion.
                      There's already a way of doing this: use a unicode string,
                      so it's not like we need two ways of doing it.

                      Now I will grant you that there is a need for representing
                      the utf-8 encoding in a character string, but do we need
                      to support that in the source text when it's much more
                      likely that it's a programming mistake?

                      As far as implementation goes, it should have been done
                      at the beginning. Prior to 2.3, there was no way of writing
                      a program using the utf-8 encoding (I think - I might be
                      wrong on that) so there were no programs out there that
                      put non-ascii subset characters into byte strings.

                      Today it's one more forward migration hurdle to jump over.
                      I don't think it's a particularly large one, but I don't have
                      any real world data at hand.

                      John Roth[color=blue]
                      >
                      > Regards,
                      > Martin[/color]


                      Comment

                      • Martin v. Löwis

                        #12
                        Re: PEP 263 status check

                        John Roth wrote:[color=blue][color=green]
                        >>What would you expect instead? Do you think your expectation
                        >>is implementable?[/color]
                        >
                        >
                        > I'd expect that the compiler would reject anything that
                        > wasn't either in the 7-bit ascii subset, or else defined
                        > with a hex escape.[/color]

                        Are we still talking about PEP 263 here? If the entire source
                        code has to be in the 7-bit ASCII subset, then what is the point
                        of encoding declarations?

                        If you were suggesting that anything except Unicode literals
                        should be in the 7-bit ASCII subset, then this is still
                        unacceptable: Comments should also be allowed to contain non-ASCII
                        characters, don't you agree?

                        If you think that only Unicode literals and comments should be
                        allowed to contain non-ASCII, I disagree: At some point, I'd
                        like to propose support for non-ASCII in identifiers. This would
                        allow people to make identifiers that represent words from their
                        native language, which is helpful for people who don't speak
                        English well.

                        If you think that only Unicod literals, comments, and identifiers
                        should be allowed non-ASCII: perhaps, but this is out of scope
                        of PEP 263, which *only* introduces encoding declarations,
                        and explains what they mean for all current constructs.
                        [color=blue]
                        > The reason for this is simply that wanting to put characters
                        > outside of the 7-bit ascii subset into a byte character string
                        > isn't portable.[/color]

                        Define "is portable". With an encoding declaration, I can move
                        the source code from one machine to another, open it in an editor,
                        and have it display correctly. This was not portable without
                        encoding declarations (likewise for comments); with PEP 263,
                        such source code became portable.

                        Also, the run-time behaviour is fully predictable (which it
                        even was without PEP 263): At run-time, the string will have
                        exactly the same bytes that it does in the .py file. This
                        is fully portable.
                        [color=blue]
                        > It just pushes the need for a character set
                        > (encoding) declaration down one level of recursion.[/color]

                        It depends on the program. E.g. if the program was to generate
                        HTML files with an explicit HTTP-Equiv charset=iso-8859-1,
                        then the resulting program is absolutely, 100% portable.

                        For messages directly output to a terminal, portability
                        might not be important.
                        [color=blue]
                        > There's already a way of doing this: use a unicode string,
                        > so it's not like we need two ways of doing it.[/color]

                        Using a Unicode string might not work, because a library might
                        crash when confronted with a Unicode string. You are proposing
                        to break existing applications for no good reason, and with
                        no simple fix.
                        [color=blue]
                        > Now I will grant you that there is a need for representing
                        > the utf-8 encoding in a character string, but do we need
                        > to support that in the source text when it's much more
                        > likely that it's a programming mistake?[/color]

                        But it isn't! People do put KOI-8R into source code, into
                        string literals, and it works perfectly fine for them. There
                        is no reason to arbitrarily break their code.
                        [color=blue]
                        > As far as implementation goes, it should have been done
                        > at the beginning. Prior to 2.3, there was no way of writing
                        > a program using the utf-8 encoding (I think - I might be
                        > wrong on that)[/color]

                        You are wrong. You were always able to put UTF-8 into byte
                        strings, even at a time where UTF-8 was not yet an RFC
                        (say, in Python 1.1).
                        [color=blue]
                        > so there were no programs out there that
                        > put non-ascii subset characters into byte strings.[/color]

                        That is just not true. If it were true, there would be no
                        need to introduce a grace period in the PEP. However,
                        *many* scripts in the world use non-ASCII in string literals;
                        it was always possible (although the documentation was
                        wishy-washy on what it actually meant).
                        [color=blue]
                        > Today it's one more forward migration hurdle to jump over.
                        > I don't think it's a particularly large one, but I don't have
                        > any real world data at hand.[/color]

                        Trust me: the outcry for banning non-ASCII from string literals
                        would be, by far, louder than the one for a proposed syntax
                        on decorators. That would break many production systems, CGI
                        scripts would suddenly stop working, GUIs would crash, etc.

                        Regards,
                        Martin

                        Comment

                        • Hallvard B Furuseth

                          #13
                          Re: PEP 263 status check

                          An addition to Martin's reply:

                          John Roth wrote:[color=blue]
                          >"Martin v. Löwis" <martin@v.loewi s.de> wrote in message
                          >news:41137799. 70808@v.loewis. de...[color=green]
                          >>John Roth wrote:
                          >>
                          >> To be more specific: In an UTF-8 source file, doing
                          >>
                          >> print "ö" == "\xc3\xb6"
                          >> print "ö"[0] == "\xc3"
                          >>
                          >> would print two times "True", and len("ö") is 2.
                          >> OTOH, len(u"ö")==1.
                          >>[color=darkred]
                          >>> The point of this is that I don't think that either behavior
                          >>> is what one would expect. It's also an open invitation
                          >>> for someone to make an unchecked mistake! I think this
                          >>> may be Hallvard's underlying issue in the other thread.[/color]
                          >>
                          >> What would you expect instead? Do you think your expectation
                          >> is implementable?[/color]
                          >
                          > I'd expect that the compiler would reject anything that
                          > wasn't either in the 7-bit ascii subset, or else defined
                          > with a hex escape.[/color]

                          Then you should also expect a lot of people to move to
                          another language - one whose designers live in the real
                          world instead of your Utopian Unicode world.
                          [color=blue]
                          > The reason for this is simply that wanting to put characters
                          > outside of the 7-bit ascii subset into a byte character string
                          > isn't portable.[/color]

                          Unicode isn't portable either.
                          Try to output a Unicode string to a device (e.g. your terminal)
                          whose character encoding is not known to the program.
                          The program will fail, or just output the raw utf-8 string or
                          something, or just guess some character set the program's author
                          is fond of.

                          For that matter, tell me why my programs should spend any time
                          on converting between UTF-8 and the character set the
                          application actually works with just because you are fond of
                          Unicode. That might be a lot more time than just the time spent
                          parsing the program. Or tell me why I should spell quite normal
                          text strings with hex escaping or something, if that's what you
                          mean.

                          And tell me why I shouldn't be allowed to work easily with raw
                          UTF-8 strings, if I do use coding:utf-8.

                          --
                          Hallvard

                          Comment

                          • John Roth

                            #14
                            Re: PEP 263 status check


                            "Martin v. Löwis" <martin@v.loewi s.de> wrote in message
                            news:4113D8DF.8 080106@v.loewis .de...[color=blue]
                            > John Roth wrote:[color=green][color=darkred]
                            > >>What would you expect instead? Do you think your expectation
                            > >>is implementable?[/color]
                            > >
                            > >
                            > > I'd expect that the compiler would reject anything that
                            > > wasn't either in the 7-bit ascii subset, or else defined
                            > > with a hex escape.[/color]
                            >
                            > Are we still talking about PEP 263 here? If the entire source
                            > code has to be in the 7-bit ASCII subset, then what is the point
                            > of encoding declarations?[/color]

                            Martin, I think you misinterpreted what I said at the
                            beginning. I'm only, and I need to repeat this, ONLY
                            dealing with the case where the encoding declaration
                            specifically says that the script is in UTF-8. No other
                            case.

                            I'm going to deal with your response point by point,
                            but I don't think most of this is really relevant. Your
                            response only makes sense if you missed the point that
                            I was talking about scripts that explicitly declared their
                            encoding to be UTF-8, and no other scripts in no
                            other circumstances.

                            I didn't mean the entire source was in 7-bit ascii. What
                            I meant was that if the encoding was utf-8 then the source
                            for 8-bit string literals must be in 7-bit ascii. Nothing more.
                            [color=blue]
                            > If you were suggesting that anything except Unicode literals
                            > should be in the 7-bit ASCII subset, then this is still
                            > unacceptable: Comments should also be allowed to contain non-ASCII
                            > characters, don't you agree?[/color]

                            Of course.
                            [color=blue]
                            > If you think that only Unicode literals and comments should be
                            > allowed to contain non-ASCII, I disagree: At some point, I'd
                            > like to propose support for non-ASCII in identifiers. This would
                            > allow people to make identifiers that represent words from their
                            > native language, which is helpful for people who don't speak
                            > English well.[/color]

                            L:ikewise. I never thought otherwise; in fact I'd like to expand
                            the availible operators to include the set operators as well as
                            the logical operators and the "real" division operator (the one
                            you learned in grade school - the dash with a dot above and
                            below the line.)
                            [color=blue]
                            > If you think that only Unicod literals, comments, and identifiers
                            > should be allowed non-ASCII: perhaps, but this is out of scope
                            > of PEP 263, which *only* introduces encoding declarations,
                            > and explains what they mean for all current constructs.
                            >[color=green]
                            > > The reason for this is simply that wanting to put characters
                            > > outside of the 7-bit ascii subset into a byte character string
                            > > isn't portable.[/color]
                            >
                            > Define "is portable". With an encoding declaration, I can move
                            > the source code from one machine to another, open it in an editor,
                            > and have it display correctly. This was not portable without
                            > encoding declarations (likewise for comments); with PEP 263,
                            > such source code became portable.[/color]
                            [color=blue]
                            > Also, the run-time behaviour is fully predictable (which it
                            > even was without PEP 263): At run-time, the string will have
                            > exactly the same bytes that it does in the .py file. This
                            > is fully portable.[/color]

                            It's predictable, but as far as I'm concerned, that's
                            not only useless behavior, it's counterproducti ve
                            behavior. I find it difficult to imagine any case
                            where the benefit of having normal character
                            literals accidentally contain utf-8 multi-byte
                            characters outweighs the pain of having it happen
                            accidentally, and then figuring out why your program
                            is giving you wierd behavior.

                            I would grant that there are cases where you
                            might want this behavior. I am pretty sure they
                            are in the distinct minority.

                            [color=blue][color=green]
                            > > It just pushes the need for a character set
                            > > (encoding) declaration down one level of recursion.[/color]
                            >
                            > It depends on the program. E.g. if the program was to generate
                            > HTML files with an explicit HTTP-Equiv charset=iso-8859-1,
                            > then the resulting program is absolutely, 100% portable.[/color]

                            It's portable, but that's not the normal case. See above.
                            [color=blue]
                            > For messages directly output to a terminal, portability
                            > might not be important.[/color]

                            Portabiliity is less of an issue for me than the likelihood
                            of making a mistake in coding a literal and then having
                            to debug unexpected behavior when one byte no longer
                            equals one character.

                            [color=blue][color=green]
                            > > There's already a way of doing this: use a unicode string,
                            > > so it's not like we need two ways of doing it.[/color]
                            >
                            > Using a Unicode string might not work, because a library might
                            > crash when confronted with a Unicode string. You are proposing
                            > to break existing applications for no good reason, and with
                            > no simple fix.[/color]

                            There's no reason why you have to have a utf-8
                            encoding declaration. If you want your source to
                            be utf-8, you need to accept the consequences.
                            I fully expect Python to support the usual mixture
                            of encodings until 3.0 at least. At that point, everything
                            gets to be rewritten anyway.
                            [color=blue][color=green]
                            > > Now I will grant you that there is a need for representing
                            > > the utf-8 encoding in a character string, but do we need
                            > > to support that in the source text when it's much more
                            > > likely that it's a programming mistake?[/color]
                            >
                            > But it isn't! People do put KOI-8R into source code, into
                            > string literals, and it works perfectly fine for them. There
                            > is no reason to arbitrarily break their code.
                            >[color=green]
                            > > As far as implementation goes, it should have been done
                            > > at the beginning. Prior to 2.3, there was no way of writing
                            > > a program using the utf-8 encoding (I think - I might be
                            > > wrong on that)[/color]
                            >
                            > You are wrong. You were always able to put UTF-8 into byte
                            > strings, even at a time where UTF-8 was not yet an RFC
                            > (say, in Python 1.1).[/color]

                            Were you able to write your entire program in UTF-8?
                            I think not.
                            [color=blue]
                            >[color=green]
                            > > so there were no programs out there that
                            > > put non-ascii subset characters into byte strings.[/color]
                            >
                            > That is just not true. If it were true, there would be no
                            > need to introduce a grace period in the PEP. However,
                            > *many* scripts in the world use non-ASCII in string literals;
                            > it was always possible (although the documentation was
                            > wishy-washy on what it actually meant).
                            >[color=green]
                            > > Today it's one more forward migration hurdle to jump over.
                            > > I don't think it's a particularly large one, but I don't have
                            > > any real world data at hand.[/color]
                            >
                            > Trust me: the outcry for banning non-ASCII from string literals
                            > would be, by far, louder than the one for a proposed syntax
                            > on decorators. That would break many production systems, CGI
                            > scripts would suddenly stop working, GUIs would crash, etc.[/color]

                            ..


                            [color=blue]
                            >
                            > Regards,
                            > Martin[/color]


                            Comment

                            • John Roth

                              #15
                              Re: PEP 263 status check


                              "Hallvard B Furuseth" <h.b.furuseth@u sit.uio.no> wrote in message
                              news:HBF.200408 06qchc@bombur.u io.no...[color=blue]
                              > An addition to Martin's reply:
                              >
                              > John Roth wrote:[color=green]
                              > >"Martin v. Löwis" <martin@v.loewi s.de> wrote in message
                              > >news:41137799. 70808@v.loewis. de...[color=darkred]
                              > >>John Roth wrote:
                              > >>
                              > >> To be more specific: In an UTF-8 source file, doing
                              > >>
                              > >> print "ö" == "\xc3\xb6"
                              > >> print "ö"[0] == "\xc3"
                              > >>
                              > >> would print two times "True", and len("ö") is 2.
                              > >> OTOH, len(u"ö")==1.
                              > >>
                              > >>> The point of this is that I don't think that either behavior
                              > >>> is what one would expect. It's also an open invitation
                              > >>> for someone to make an unchecked mistake! I think this
                              > >>> may be Hallvard's underlying issue in the other thread.
                              > >>
                              > >> What would you expect instead? Do you think your expectation
                              > >> is implementable?[/color]
                              > >
                              > > I'd expect that the compiler would reject anything that
                              > > wasn't either in the 7-bit ascii subset, or else defined
                              > > with a hex escape.[/color]
                              >
                              > Then you should also expect a lot of people to move to
                              > another language - one whose designers live in the real
                              > world instead of your Utopian Unicode world.[/color]

                              Rudeness objection to your characteization .

                              Please see my response to Martin - I'm talking only,
                              and I repeat ONLY, about scripts that explicitly
                              say they are encoded in utf-8. Nothing else. I've
                              been in this business for close to 40 years, and I'm
                              quite well aware of backwards compatibility issues
                              and issues with breaking existing code.

                              Programmers in general have a very strong, and
                              let me repeat that, VERY STRONG assumption
                              that an 8-bit string contains one byte per character
                              unless there is a good reason to believe otherwise.
                              This assumption is built into various places, including
                              all of the string methods.

                              The current design allows accidental inclusion of
                              a character that is not in the 7bit ascii subset ***IN
                              A PROGRAM THAT HAS A UTF-8 CHARACTER
                              ENCODING DECLARATION*** to break that
                              assumption without any kind of notice. That in
                              turn will break all of the assumptions that the string
                              module and string methods are based on. That in
                              turn is likely to break lots of existing modules and
                              cause a lot of debugging time that could be avoided
                              by proper design.

                              One of Python's strong points is that it's difficult
                              to get into trouble unless you deliberately try (then
                              it's quite easy, fortunately.)

                              I'm not worried about this causing people to
                              abandon Python. I'm more worried about the
                              current situation causing enough grief that people
                              will decided that utf-8 source code encoding isn't
                              worth it.
                              [color=blue]
                              > And tell me why I shouldn't be allowed to work easily with raw
                              > UTF-8 strings, if I do use coding:utf-8.[/color]

                              First, there's nothing that's stopping you. All that
                              my proposal will do is require you to do a one
                              time conversion of any strings you put in the
                              program as literals. It doesn't affect any other
                              strings in any other way at any other time.

                              I'll withdraw my objection if you can seriously
                              assure me that working with raw utf-8 in
                              8-bit character string literals is what most programmers
                              are going to do most of the time.

                              I'm not going to accept the very common need
                              of converting unicode strings to 8-bit strings so
                              they can be written to disk or stored in a data base
                              or whatnot (or reversing the conversion for reading.)
                              That has nothing to do with the current issue - it's
                              something that everyone who deals with unicode
                              needs to do, regardless of the encoding of the
                              source program.

                              John Roth[color=blue]
                              >
                              > --
                              > Hallvard[/color]


                              Comment

                              Working...