a question about Chinese characters in aPython Program

Collapse
This topic is closed.
X
X
 
  • Time
  • Show
Clear All
new posts
  • Liang Chen

    a question about Chinese characters in aPython Program

    Hope you all had a nice weekend.

    I have a question that I hope someone can help me out. I want to run a Python program that uses Tkinter for the user interface (GUI). The program allows me to type Chinese characters, but neverthelss is unable to show them up on screen. The follow is some of the error message I received after I logged off the program:

    "Could not write output: <type "exceptions : UnicodeEncodeEr ror'>, 'ascii' codec can't encode characters in position 0-1: ordinal not in range (128)"

    Any suggestion will be appreciated.

    Sincerely,

    Liang


    Liang Chen,Ph.D.
    Assistant Professor
    University of Georgia
    Communication Sciences and Special Education
    542 Aderhold Hall
    Athens, GA 30602

    Phone: 706-542-4566


  • est

    #2
    Re: a question about Chinese characters in a Python Program

    On Oct 20, 10:48 am, Liang Chen <c...@uga.eduwr ote:
    Hope you all had a nice weekend.
    >
    I have a question that I hope someone can help me out. I want to run a Python program that uses Tkinter for the user interface (GUI). The program allows me to type Chinese characters, but neverthelss is unable to show them up on screen. The follow is some of the error message I received after I logged off the program:
    >
    "Could not write output: <type "exceptions : UnicodeEncodeEr ror'>, 'ascii'codec can't encode characters in position 0-1: ordinal not in range (128)"
    >
    Any suggestion will be appreciated.
    >
    Sincerely,
    >
    Liang
    >
    Liang Chen,Ph.D.
    Assistant Professor
    University of Georgia
    Communication Sciences and Special Education
    542 Aderhold Hall
    Athens, GA 30602
    >
    Phone: 706-542-4566
    Personally I call it a serious bug in python, but sadly most of python
    community members do not agree
    .. It may be a internal str() that caused this issue.




    Comment

    • Paul Boddie

      #3
      Re: a question about Chinese characters in a Python Program

      On 20 Okt, 07:32, est <electronix...@ gmail.comwrote:
      >
      Personally I call it a serious bug in python
      Normally I'd entertain the possibility of bugs in Python, but your
      reasoning is a bit thin (in http://bugs.python.org/issue3648): "Why
      cann't Python just define ascii to range(256)"

      I do accept that it can be awkward to output text to the console, for
      example, but you have to consider that the console might not be
      configured to display any character you can throw at it. My console is
      configured for ISO-8859-15 (something like your magical "ascii to
      range(256)" only where someone has to decide what those 256 characters
      actually are), but that isn't going to help me display CJK characters.
      A solution might be to generate UTF-8 and then get the user to display
      the output in an appropriately configured application, but even then
      someone has to say that it's UTF-8 and not some other encoding that's
      being used. As discussed in another recent thread, Python 2.x does
      make some reasonable guesses about such matters to the extent that
      it's possible automatically (without magical knowledge).

      There is also the problem about use of the "str" built-in function or
      any operation where some Unicode object may be converted to a plain
      string. It is now recommended that you only convert to plain strings
      when you need to produce a sequence of bytes (for output, for
      example), and that you indicate how the Unicode values are encoded as
      bytes (by specifying an encoding). Python 3.x doesn't really change
      this: it just makes the Unicode/text vs. bytes distinction more
      obvious.

      Paul

      Comment

      • est

        #4
        Re: a question about Chinese characters in a Python Program

        On Oct 20, 6:47 pm, Paul Boddie <p...@boddie.or g.ukwrote:
        On 20 Okt, 07:32, est <electronix...@ gmail.comwrote:
        >
        >
        >
        Personally I call it a serious bug in python
        >
        Normally I'd entertain the possibility of bugs in Python, but your
        reasoning is a bit thin (inhttp://bugs.python.org/issue3648):"Why
        cann't Python just define ascii to range(256)"
        >
        I do accept that it can be awkward to output text to the console, for
        example, but you have to consider that the console might not be
        configured to display any character you can throw at it. My console is
        configured for ISO-8859-15 (something like your magical "ascii to
        range(256)" only where someone has to decide what those 256 characters
        actually are), but that isn't going to help me display CJK characters.
        A solution might be to generate UTF-8 and then get the user to display
        the output in an appropriately configured application, but even then
        someone has to say that it's UTF-8 and not some other encoding that's
        being used. As discussed in another recent thread, Python 2.x does
        make some reasonable guesses about such matters to the extent that
        it's possible automatically (without magical knowledge).
        >
        There is also the problem about use of the "str" built-in function or
        any operation where some Unicode object may be converted to a plain
        string. It is now recommended that you only convert to plain strings
        when you need to produce a sequence of bytes (for output, for
        example), and that you indicate how the Unicode values are encoded as
        bytes (by specifying an encoding). Python 3.x doesn't really change
        this: it just makes the Unicode/text vs. bytes distinction more
        obvious.
        >
        Paul
        Thanks for the long comment Paul, but it didn't help massive errors in
        Python encoding.

        IMHO it's even better to output wrong encodings rather than halt the
        WHOLE damn program by an exception

        When debugging encoding problems, the solution is simple. If
        characters display wrong, switch to another encoding, one of them must
        be right.

        But it's tiring in python to deal with encodings, you have to wrap
        EVERY SINGLE character expression with try ... except ... just imagine
        what pain it is.

        Just like the example I gave in Google Groups, u'\ue863' can NEVER be
        encoded into '\xfe\x9f'. Not a chance, because python REFUSE to handle
        a byte that is greater than range(128).

        Strangely the 'mbcs' encoding system can. Does 'mbcs' have magic or
        something? But it's Windows-specific

        Dealing with character encodings is really simple. AFAIK early
        encoding before Unicode, although they have many names, are all based
        on hacks. Take Chinese characters as an example. They are called
        GB2312 encoding, in fact it is totally compatible with range(256)
        ANSI. (There are minor issues like display half of a wide-character in
        a question mark ? but at least it's readable) If you just output
        serials of byte array, it IS GB2312. The same is true with BIG5, JIS,
        etc.


        Like I said, str() should NOT throw an exception BY DESIGN, it's a
        basic language standard. str() is not only a convert to string
        function, but also a serialization in most cases.(e.g. socket) My
        simple suggestion is: If it's a unicode character, output as UTF-8;
        other wise just ouput byte array, please do not encode it with really
        stupid range(128) ASCII. It's not guessing, it's totally wrong.

        Comment

        • Paul Boddie

          #5
          Re: a question about Chinese characters in a Python Program

          On 20 Okt, 15:30, est <electronix...@ gmail.comwrote:
          >
          Thanks for the long comment Paul, but it didn't help massive errors in
          Python encoding.
          >
          IMHO it's even better to output wrong encodings rather than halt the
          WHOLE damn program by an exception
          I disagree. Maybe I'll now get round to uploading an amusing pictorial
          example of this strategy just to illustrate where it can lead. CJK
          characters may be more demanding to deal with than various European
          characters, but I've seen public advertisements (admittedly aimed at
          IT course applicants) which made jokes about stuff like "å" and "ø"
          appearing in documents instead of the intended European characters, so
          it's fairly safe to say that people do care what gets written out from
          computer programs.
          When debugging encoding problems, the solution is simple. If
          characters display wrong, switch to another encoding, one of them must
          be right.
          >
          But it's tiring in python to deal with encodings, you have to wrap
          EVERY SINGLE character expression with try ... except ... just imagine
          what pain it is.
          If everything is in Unicode then you don't have to think about
          encodings. I recommend using things like codecs.open to ensure that
          input and output even produce and consume Unicode objects when dealing
          with files.
          Just like the example I gave in Google Groups, u'\ue863' can NEVER be
          encoded into '\xfe\x9f'. Not a chance, because python REFUSE to handle
          a byte that is greater than range(128).
          Aside from the matter of which encoding you'd need to use to convert
          u'\ue863' into '\xfe\x9f', it has nothing to do with any implicit byte
          value range. To get from a Unicode object to a sequence of bytes
          (since that is the external representation of the text for other
          programs), Python has to perform a conversion. As a safe (but
          obviously conservative) default, Python only attempts to convert each
          Unicode character to a byte value using the ASCII character value
          table which is only defined for characters 0 to 127 - there's no such
          thing as "8-bit ASCII".

          Python doesn't attempt to automatically convert using other character
          tables (encodings, in other words), since there is quite a large
          possibility that the result, if not produced for the correct encoding,
          will not produce the desired visual effect. If I start with, say,
          character "ø" and encode it using UTF-8, I get a sequence of bytes
          which, if interpreted by a program expecting ISO-8859-15 will appear
          as "ø". If I encode the character using ISO-8859-15 and then feed the
          resulting byte sequence to a program expecting UTF-8, it will probably
          either complain or produce an incorrect visual effect. The reason why
          ASCII is safer (although not entirely safe) is because many encodings
          support ASCII as a subset of themselves.
          Strangely the 'mbcs' encoding system can. Does 'mbcs' have magic or
          something? But it's Windows-specific
          I thought Microsoft used some UTF-16 variant. That would explain how
          it can handle more or less everything.
          Dealing with character encodings is really simple. AFAIK early
          encoding before Unicode, although they have many names, are all based
          on hacks. Take Chinese characters as an example. They are called
          GB2312 encoding, in fact it is totally compatible with range(256)
          ANSI. (There are minor issues like display half of a wide-character in
          a question mark ? but at least it's readable) If you just output
          serials of byte array, it IS GB2312. The same is true with BIG5, JIS,
          etc.
          From the Wikipedia page, it appears that you need to convert GB2312
          values to EUC-CN by a relatively straightforward process, and can then
          output the resulting byte sequence in an ASCII compatible way,
          provided that you filter out all the byte values greater than 127:
          these filtered bytes would produce nonsense for anyone using a program
          not expecting EUC-CN. UTF-8 has some similar properties, but as I
          noted above, you wouldn't want to read most of the output if your
          program wasn't expecting UTF-8.
          Like I said, str() should NOT throw an exception BY DESIGN, it's a
          basic language standard. str() is not only a convert to string
          function, but also a serialization in most cases.(e.g. socket) My
          simple suggestion is: If it's a unicode character, output as UTF-8;
          other wise just ouput byte array, please do not encode it with really
          stupid range(128) ASCII. It's not guessing, it's totally wrong.
          I think it's unfortunate that "str" is now potentially unreliable for
          certain uses, but to just output an arbitrary byte sequence (unless by
          byte array you mean a representation of the numeric values) is the
          wrong thing to do unless you don't care about the output; in which
          case, you could just as well use "repr" instead. I think the output of
          "str" vs. "unicode" especially with regard to Unicode objects was
          discussed extensively on the python-dev mailing list at one point.

          I don't disagree that people sometimes miss a way of having Python or
          some library "do the right thing" when writing stuff out. I could
          imagine a wrapper for Python accepting UTF-8 whose purpose is to
          "blank out" characters which the console cannot handle, and people
          might use this wrapper explicitly because that is the "right thing"
          for them. Indeed, such a program may already exist for a more general
          audience since I imagine that it could be fairly useful.

          Paul

          Comment

          • Steven D'Aprano

            #6
            Re: a question about Chinese characters in a Python Program

            On Mon, 20 Oct 2008 06:30:09 -0700, est wrote:
            Like I said, str() should NOT throw an exception BY DESIGN, it's a basic
            language standard.
            int() is also a basic language standard, but it is perfectly acceptable
            for int() to raise an exception if you ask it to convert something into
            an integer that can't be converted:

            int("cat")

            What else would you expect int() to do but raise an exception?

            If you ask str() to convert something into a string which can't be
            converted, then what else should it do other than raise an exception?
            Whatever answer you give, somebody else will argue it should do another
            thing. Maybe I want failed characters replaced with '?'. Maybe Fred wants
            failed characters deleted altogether. Susan wants UTF-16. George wants
            Latin-1.

            The simple fact is that there is no 1:1 mapping from all 65,000+ Unicode
            characters to the 256 bytes used by byte strings, so there *must* be an
            encoding, otherwise you don't know which characters map to which bytes.

            ASCII has the advantage of being the lowest common denominator. Perhaps
            it doesn't make too many people very happy, but it makes everyone equally
            unhappy.


            str() is not only a convert to string function, but
            also a serialization in most cases.(e.g. socket) My simple suggestion
            is: If it's a unicode character, output as UTF-8;
            Why UTF-8? That will never do. I want it output as UCS-4.

            other wise just ouput
            byte array, please do not encode it with really stupid range(128) ASCII.
            It's not guessing, it's totally wrong.
            If you start with a byte string, you can always get a byte string:
            >>s = '\x96 \xa0 \xaa' # not ASCII characters
            >>s
            '\x96 \xa0 \xaa'
            >>str(s)
            '\x96 \xa0 \xaa'



            --
            Steven

            Comment

            • est

              #7
              Re: a question about Chinese characters in a Python Program

              On Oct 20, 11:46 pm, Steven D'Aprano <st...@REMOVE-THIS-
              cybersource.com .auwrote:
              On Mon, 20 Oct 2008 06:30:09 -0700, est wrote:
              Like I said, str() should NOT throw an exception BY DESIGN, it's a basic
              language standard.
              >
              int() is also a basic language standard, but it is perfectly acceptable
              for int() to raise an exception if you ask it to convert something into
              an integer that can't be converted:
              >
              int("cat")
              >
              What else would you expect int() to do but raise an exception?
              >
              If you ask str() to convert something into a string which can't be
              converted, then what else should it do other than raise an exception?
              Whatever answer you give, somebody else will argue it should do another
              thing. Maybe I want failed characters replaced with '?'. Maybe Fred wants
              failed characters deleted altogether. Susan wants UTF-16. George wants
              Latin-1.
              >
              The simple fact is that there is no 1:1 mapping from all 65,000+ Unicode
              characters to the 256 bytes used by byte strings, so there *must* be an
              encoding, otherwise you don't know which characters map to which bytes.
              >
              ASCII has the advantage of being the lowest common denominator. Perhaps
              it doesn't make too many people very happy, but it makes everyone equally
              unhappy.
              >
              str() is not only a convert to string function, but
              also a serialization in most cases.(e.g. socket) My simple suggestion
              is: If it's a unicode character, output as UTF-8;
              >
              Why UTF-8? That will never do. I want it output as UCS-4.
              >
              other wise just ouput
              byte array, please do not encode it with really stupid range(128) ASCII..
              It's not guessing, it's totally wrong.
              >
              If you start with a byte string, you can always get a byte string:
              >
              >s = '\x96 \xa0 \xaa'  # not ASCII characters
              >s
              '\x96 \xa0 \xaa'
              >str(s)
              >
              '\x96 \xa0 \xaa'
              >
              --
              Steven
              In fact Python handles characters well than most other open-source
              programming languages. But still:

              1. You can explain str() in 1000 ways, there are 1001 more confusing
              error on all kinds of python apps. (Not only some of the scripts I've
              written, but also famous enough apps like Boa Constructor
              http://i36.tinypic.com/1gqekh.jpg. This sucks hard, right?)


              2. Anyone please kindly tell me how can I define a customized encoding
              (namely 'ansi') which handles range(256) so I can
              sys.setdefaulte ncoding('ansi') once and for all?

              Comment

              • Lie Ryan

                #8
                Re: a question about Chinese characters in a Python Program

                On Sun, 19 Oct 2008 22:32:20 -0700, est wrote:
                On Oct 20, 10:48 am, Liang Chen <c...@uga.eduwr ote:
                >Hope you all had a nice weekend.
                >>
                >I have a question that I hope someone can help me out. I want to run a
                >Python program that uses Tkinter for the user interface (GUI). The
                >program allows me to type Chinese characters, but neverthelss is unable
                >to show them up on screen. The follow is some of the error message I
                >received after I logged off the program:
                >>
                >"Could not write output: <type "exceptions : UnicodeEncodeEr ror'>,
                >'ascii' codec can't encode characters in position 0-1: ordinal not in
                >range (128)"
                >>
                >Any suggestion will be appreciated.
                >>
                >Sincerely,
                >>
                >Liang
                >>
                >Liang Chen,Ph.D.
                >Assistant Professor
                >University of Georgia
                >Communicatio n Sciences and Special Education 542 Aderhold Hall
                >Athens, GA 30602
                >>
                >Phone: 706-542-4566
                >
                Personally I call it a serious bug in python, but sadly most of python
                community members do not agree
                . It may be a internal str() that caused this issue.
                No, it's not a bug, it's a correct behavior that is the most correct
                behavior, although some people might not be able to immediately grab the
                reasons why it is correct and why defining ascii as range(256) is plain
                wrong.

                Anyway, if you haven't noticed, str() is capable of emitting all
                characters in range(256), e.g. str('\xff'). ascii though, doesn't allow
                that, as ascii is a 7-bit encoding, latin-1, ansi, and other ascii
                extensions are 8-bit encodings, but not ascii itself.

                Comment

                • Ben Finney

                  #9
                  Re: a question about Chinese characters in a Python Program

                  est <electronixtar@ gmail.comwrites :
                  IMHO it's even better to output wrong encodings rather than halt the
                  WHOLE damn program by an exception
                  I can't agree with this. The correct thing to do in the face of
                  ambiguity is for Python to refuse to guess.
                  When debugging encoding problems, the solution is simple. If
                  characters display wrong, switch to another encoding, one of them
                  must be right.
                  That's debugging problems not in the program but in the *data*, which
                  Python is helping with by making the problems apparent as soon as
                  feasible to do so.
                  But it's tiring in python to deal with encodings, you have to wrap
                  EVERY SINGLE character expression with try ... except ... just imagine
                  what pain it is.
                  That sounds like a rather poor program design. Much better to sanitise
                  the inputs to the program at a few well-defined points, and know from
                  that point that the program is dealing internally with Unicode.
                  Dealing with character encodings is really simple.
                  Given that your solutions are baroque and complicated, I don't think
                  even you yourself can believe that statement.
                  Like I said, str() should NOT throw an exception BY DESIGN, it's a
                  basic language standard.
                  Any code should throw an exception if the input is both ambiguous and
                  invalid by the documented specification.
                  str() is not only a convert to string function, but also a
                  serialization in most cases.(e.g. socket) My simple suggestion is:
                  If it's a unicode character, output as UTF-8; other wise just ouput
                  byte array, please do not encode it with really stupid range(128)
                  ASCII. It's not guessing, it's totally wrong.
                  Your assumption would require that UTF-8 be a lowest *common*
                  denominator for most output devices Python will be connected to.
                  That's simply not the case; the lowest common denominator is still
                  ASCII.

                  I yearn for a future where all output devices can be assumed, in the
                  absence of other information, to understand a common Unicode encoding
                  (e.g. UTF-8), but we're not there yet and it would be a grave mistake
                  for Python to falsely behave as though we were.

                  --
                  \ “I went to a fancy French restaurant called ‘Déjà Vu’. The head |
                  `\ waiter said, ‘Don't I know you?’” —Steven Wright |
                  _o__) |
                  Ben Finney

                  Comment

                  • John Machin

                    #10
                    Re: a question about Chinese characters in a Python Program

                    On Oct 21, 1:45 am, Paul Boddie <p...@boddie.or g.ukwrote:
                    From the Wikipedia page, it appears that you need to convert GB2312
                    values to EUC-CN by a relatively straightforward process, and can then
                    output the resulting byte sequence in an ASCII compatible way,
                    provided that you filter out all the byte values greater than 127:
                    these filtered bytes would produce nonsense for anyone using a program
                    not expecting EUC-CN. UTF-8 has some similar properties, but as I
                    noted above, you wouldn't want to read most of the output if your
                    program wasn't expecting UTF-8.
                    What the Wikipedia page doesn't say is that the number of people who
                    grok the concept of a GB2312 codepoint is vanishingly small, and the
                    number of people who would actually have GB2312 codepoints in a file
                    is smaller still. When people say their data is GB2312, they mean
                    "GB<somethingen coded as EUC-CN". So the relatively straightforward
                    process is not required in practice.

                    I don't understand the point or value of filtering out all byte values
                    greater than 127:

                    If the data is really GB2312, this would throw out all the Chinese
                    characters.

                    If the GB<somethingis, as is likely, really GBK aka cp936 (a
                    superset of GB2312), then the second byte of a Chinese character may
                    be in the ASCII range, and the result of the filter would comprise the
                    true ASCII characters plus some garbage ASCII characters.

                    Comment

                    • Ben Finney

                      #11
                      Re: a question about Chinese characters in a Python Program

                      John Machin <sjmachin@lexic on.netwrites:
                      I don't understand the point or value of filtering out all byte values
                      greater than 127
                      That's only done if the encoding isn't otherwise specified. In which
                      case, ASCII is the documented default encoding. In which case, it
                      *must* be restricted to code points 0+IBM-127, otherwise it's not ASCII.

                      The value of doing this is to make it rapidly and repeatably apparent
                      when the programmer's assumptions about character encoding are false,
                      allowing the programming error to be fixed early rather than late.
                      This is, in my estimation, of more value than heuristic magic to
                      +IBw-guess+IB0- the encoding, and the resultant debugging nightmare when
                      that guesswork fails in unpredictable ways later in the program's
                      life.

                      --
                      +AFw- +IBw-My girlfriend has a queen sized bed; I have a court jester |
                      `+AFw- sized bed. It's red and green and has bells on it, and the ends |
                      _o__) curl up.+IB0- +IBQ-Steven Wright |
                      Ben Finney

                      Comment

                      • John Machin

                        #12
                        Re: a question about Chinese characters in a Python Program

                        On Oct 21, 11:03 pm, Ben Finney <bignose+hate s-s...@benfinney. id.au>
                        wrote:
                        John Machin <sjmac...@lexic on.netwrites:
                        I don't understand the point or value of filtering out all byte values
                        greater than 127
                        >
                        That's only done if the encoding isn't otherwise specified. In which
                        case, ASCII is the documented default encoding. In which case, it
                        *must* be restricted to code points 0+IBM-127, otherwise it's not ASCII.
                        >
                        The value of doing this is to make it rapidly and repeatably apparent
                        when the programmer's assumptions about character encoding are false,
                        allowing the programming error to be fixed early rather than late.
                        "make it rapidly and repeatably apparent ..." is much better achieved
                        by raising an exception.
                        This is, in my estimation, of more value than heuristic magic to
                        +IBw-guess+IB0- the encoding, and the resultant debugging nightmare when
                        that guesswork fails in unpredictable ways later in the program's
                        life.
                        Was I suggesting "heuristic magic"?

                        What is that 0+IBM-127 +IBw-guess+IB0- gibberish in your posting?

                        Comment

                        • Ben Finney

                          #13
                          Re: a question about Chinese characters in a Python Program

                          John Machin <sjmachin@lexic on.netwrites:
                          On Oct 21, 11:03 pm, Ben Finney <bignose+hate s-s...@benfinney. id.au>
                          wrote:
                          John Machin <sjmac...@lexic on.netwrites:
                          I don't understand the point or value of filtering out all byte values
                          greater than 127
                          That's only done if the encoding isn't otherwise specified. In which
                          case, ASCII is the documented default encoding. In which case, it
                          *must* be restricted to code points 0+IBM-127, otherwise it's not ASCII.

                          The value of doing this is to make it rapidly and repeatably apparent
                          when the programmer's assumptions about character encoding are false,
                          allowing the programming error to be fixed early rather than late.
                          >
                          "make it rapidly and repeatably apparent ..." is much better achieved
                          by raising an exception.
                          Ah, I misread; I thought you were asking about the value of defaulting
                          to ASCII and therefore raising an exception. It seems we agree on
                          that, then.
                          What is that 0+IBM-127 +IBw-guess+IB0- gibberish in your posting?
                          It wasn't in my message as sent to my news server, nor as I read the
                          message in comp.lang.pytho n. The message was encoded using UTF-8.
                          Perhaps it's since been munged in transit to your eyeballs by any of a
                          number of intermediaries.

                          --
                          \ “I bought some batteries, but they weren't included; so I had |
                          `\ to buy them again.” —Steven Wright |
                          _o__) |
                          Ben Finney

                          Comment

                          • John Machin

                            #14
                            Re: a question about Chinese characters in a Python Program

                            On Oct 22, 11:07 am, Ben Finney <bignose+hate s-s...@benfinney. id.au>
                            wrote:
                            John Machin <sjmac...@lexic on.netwrites:
                            What is that 0+IBM-127 +IBw-guess+IB0- gibberish in your posting?
                            >
                            It wasn't in my message as sent to my news server, nor as I read the
                            message in comp.lang.pytho n. The message was encoded using UTF-8.
                            Perhaps it's since been munged in transit to your eyeballs by any of a
                            number of intermediaries.
                            Would you believe:
                            >>'0+IBM-127 +IBw-guess+IB0-'.decode('utf7' )
                            u'0\u2013127 \u201cguess\u20 1d'

                            Comment

                            Working...