Unicode problems, yet again

Collapse
This topic is closed.
X
X
 
  • Time
  • Show
Clear All
new posts
  • Ivan Voras

    Unicode problems, yet again

    I have a string fetched from database, in iso8859-2, with 8bit
    characters, and I'm trying to send it over the network, via a socket:

    File "E:\Python24\li b\socket.py", line 249, in write
    data = str(data) # XXX Should really reject non-string non-buffers
    UnicodeEncodeEr ror: 'ascii' codec can't encode character u'\u0161' in
    position 123: ordinal not in range(128)

    The other end knows it should expect this encoding, so how to send it?

    (Does anyone else feel that python's unicode handling is, well...
    suboptimal at least?)
  • Kent Johnson

    #2
    Re: Unicode problems, yet again

    Ivan Voras wrote:[color=blue]
    > I have a string fetched from database, in iso8859-2, with 8bit
    > characters, and I'm trying to send it over the network, via a socket:
    >
    > File "E:\Python24\li b\socket.py", line 249, in write
    > data = str(data) # XXX Should really reject non-string non-buffers
    > UnicodeEncodeEr ror: 'ascii' codec can't encode character u'\u0161' in
    > position 123: ordinal not in range(128)
    >
    > The other end knows it should expect this encoding, so how to send it?[/color]

    I think maybe the string from the database is a unicode string, not 8-bit. What happens if you write
    data.encode('is o8859-2') ?
    [color=blue]
    >
    > (Does anyone else feel that python's unicode handling is, well...
    > suboptimal at least?)[/color]

    It can be confusing and surprising, yes. Suboptimal...we ll, I wouldn't want to say that I could do
    better...

    Kent

    Comment

    • John Machin

      #3
      Re: Unicode problems, yet again

      On Sun, 24 Apr 2005 03:15:02 +0200, Ivan Voras
      <ivoras@somethi ng.ortheother> wrote:
      [color=blue]
      >I have a string fetched from database, in iso8859-2, with 8bit
      >characters,[/color]

      "8bit characters"?? Maybe you did once, or you thought you did, but
      what you have now is a Unicode string, and socket.write() is expecting
      an ordinary string.
      [color=blue]
      > and I'm trying to send it over the network, via a socket:
      >
      > File "E:\Python24\li b\socket.py", line 249, in write
      > data = str(data) # XXX Should really reject non-string non-buffers
      >UnicodeEncodeE rror: 'ascii' codec can't encode character u'\u0161' in
      >position 123: ordinal not in range(128)[/color]

      Like it says, you have passed it a *UNICODE* string that has u'\u0161'
      (the small s with caron) at position 123.
      [color=blue]
      >
      >The other end knows it should expect this encoding, so how to send it?
      >[/color]

      If the other end wants an encoding, then you should *encode* it, like
      this:

      [color=blue][color=green][color=darkred]
      >>> us = u'\u0161'
      >>> s = us.encode('iso8 859_2')
      >>> s[/color][/color][/color]
      '\xb9'[color=blue][color=green][color=darkred]
      >>> str(us)[/color][/color][/color]
      Traceback (most recent call last):
      File "<stdin>", line 1, in ?
      UnicodeEncodeEr ror: 'ascii' codec can't encode character u'\u0161' in
      position 0: ordinal not in range(128)[color=blue][color=green][color=darkred]
      >>> str(s)[/color][/color][/color]
      '\xb9'[color=blue][color=green][color=darkred]
      >>> # looks like socket.write() might be happier with this.[/color][/color][/color]
      [color=blue]
      >(Does anyone else feel that python's unicode handling is, well...
      >suboptimal at least?)[/color]

      Your posting gives no evidence for such a conclusion.


      Comment

      • Ivan Voras

        #4
        Re: Unicode problems, yet again

        John Machin wrote:
        [color=blue][color=green]
        >>(Does anyone else feel that python's unicode handling is, well...
        >>suboptimal at least?)[/color]
        >
        >
        > Your posting gives no evidence for such a conclusion.[/color]

        Sorry, that was just steam venting from my ears - I often get bitten by
        the "ordinal not in range(128)" error. :)

        Comment

        • Martin v. Löwis

          #5
          Re: Unicode problems, yet again

          Ivan Voras wrote:[color=blue]
          > Sorry, that was just steam venting from my ears - I often get bitten by
          > the "ordinal not in range(128)" error. :)[/color]

          I think I'm glad to hear that. Errors should never pass silently, unless
          explicitly silenced. When you get that error, it means there is a bug in
          your code (just like a ValueError, a TypeError, or an IndexError). The
          best way to deal with them is to fix them.

          Now, the troubling part is clearly that you are getting *bitten* by
          this specific error, and often so. I presume you get other kinds of
          errors also often, but they don't bite :-) This suggests that you should
          really try to understand what the error message is trying to tell so,
          and what precisely the underlying error is.

          For other errors, you have already come to an understanding what they
          mean: NameError, ah, there must be a typo. AttributeError on None, ah,
          forgot to check for a None result somewhere. ordinal not in range(128),
          hmm, let's try different variations of the code and see which ones
          work. This is going to continue biting you until you really understand
          what it means.

          The most "sane" mental model (and architecture) is one where you always
          have Unicode strings in your code, and decode/encode only at system
          interfaces (sockets, databases, ...). It turns out that the database
          you use already follows this strategy (i.e. it decodes for you), so
          you now only need to design the other interfaces so it is clear when
          you have Unicode characters and when you have bytes.

          Regards,
          Martin

          Comment

          Working...