sys.stdout, urllib and unicode... I don't understand.

Collapse
This topic is closed.
X
X
 
  • Time
  • Show
Clear All
new posts
  • Thierry

    sys.stdout, urllib and unicode... I don't understand.

    Hello fellow pythonists,

    I'm a relatively new python developer, and I try to adjust my
    understanding about "how things works" to python, but I have hit a
    block, that I cannot understand.
    I needed to output unicode datas back from a web service, and could
    not get back unicode/multibyte text before applying an hack that I
    don't understand (thank you google)

    I have realized an wxPython simple application, that takes the input
    of a user, send it to a web service, and get back translations in
    several languages.
    The service itself is fully UTF-8.

    The "source" string is first encoded to "latin1" after a passage into
    unicode.normali ze(), as urllib.quote() cannot work on unicode
    >>srcText=unico dedata.normaliz e('NFKD',srcTex t).encode('lati n1','ignore')
    After that, an urllib request is sent with this encoded string to the
    web service
    >>con=urllib2.R equest(self.url , headers={'User-Agent':'Mozilla/5.0 (X11; U; Linux i686) Gecko/20071127 Firefox/2.0.0.11'}, origin_req_host ='http://translate.googl e.com')
    >>req=urllib2.u rlopen(con)
    First problem, how to determine the encoding of the return ?
    If I inspect a request from firefox, I see that the server return
    header specify UTF-8
    But if I use this code:
    >>ret=U''
    >>for line in req:
    > ret=ret+string. replace(line.st rip(),'\n',chr( 10))
    I end up with an UnicodeDecodeEr ror. I tried various line.decode(),
    line.normalize and such, but could not make this error disapear.
    I, until now, avoided that problem as the service always seems to
    return 1 line, but I am wondering.

    Second problem, if I try an
    >>print line
    into the loop, I too get the same error. I though that unicode() would
    force python to consider the given text as unicode, not to try to
    convert it to unicode.
    Here again, trying several normalize/decode combination did not helped
    at all.

    Then, looking for help through google, I have found this post:

    and I gave it a try. What I did, though, was not to override
    sys.stdout, but to declare a new writer stream as a property of my
    main class:
    >>self.out=OutS treamEncoder(sy s.stdout, 'utf-8')
    But what is strange, is that since I did that, even without using this
    self.out writer, the unicode translation are working as I was
    expecting them to. Except on the for loop, where a concatenation still
    triggers the UnicodeDecodeEr ro exception.
    I know the "explicit is better than implicit" python motto, and I
    really like it.
    But here, I don't understand what is going on.

    Does the fact that defining that writer object does a initialization
    of the standard sys.stdout object ?
    Does it is related to an internal usage of it, maybe in urllib ?
    I tried to find more on the subject, but felt short.
    Can someone explain to me what is happening ?
    The full script source can be found at http://www.webalis.com/translator/translator.pyw
  • Tino Wildenhain

    #2
    Re: sys.stdout, urllib and unicode... I don't understand.

    Thierry wrote:
    Hello fellow pythonists,
    >
    I'm a relatively new python developer, and I try to adjust my
    understanding about "how things works" to python, but I have hit a
    block, that I cannot understand.
    I needed to output unicode datas back from a web service, and could
    not get back unicode/multibyte text before applying an hack that I
    don't understand (thank you google)
    >
    I have realized an wxPython simple application, that takes the input
    of a user, send it to a web service, and get back translations in
    several languages.
    The service itself is fully UTF-8.
    >
    The "source" string is first encoded to "latin1" after a passage into
    unicode.normali ze(), as urllib.quote() cannot work on unicode
    >>srcText=unico dedata.normaliz e('NFKD',srcTex t).encode('lati n1','ignore')
    urllib.quote() operates on byte streams. If your web service is UTF-8
    it would make sense to use UTF-8 as input encoding not latin1,
    wouldn't it? unicodeinput.en code("utf-8")
    After that, an urllib request is sent with this encoded string to the
    web service
    >>con=urllib2.R equest(self.url , headers={'User-Agent':'Mozilla/5.0 (X11; U; Linux i686) Gecko/20071127 Firefox/2.0.0.11'}, origin_req_host ='http://translate.googl e.com')
    >
    >>req=urllib2.u rlopen(con)
    >
    First problem, how to determine the encoding of the return ?
    It is sent as part of the headers. e.g. content-type: text/html;
    charset=utf-8
    If I inspect a request from firefox, I see that the server return
    header specify UTF-8
    But if I use this code:
    >>ret=U''
    >>for line in req:
    >> ret=ret+string. replace(line.st rip(),'\n',chr( 10))
    I end up with an UnicodeDecodeEr ror. I tried various line.decode(),
    line.normalize and such, but could not make this error disapear.
    I, until now, avoided that problem as the service always seems to
    return 1 line, but I am wondering.
    web server answer is encoded byte stream too (usually utf-8 but you
    can check the headers) so

    line.decoce("ut f-8") should give you unicode to operate on (always
    do string operations on canonized form)
    Second problem, if I try an
    >>print line
    into the loop, I too get the same error. I though that unicode() would
    force python to consider the given text as unicode, not to try to
    convert it to unicode.
    But it is what it does. Basically unicode() is a constructor for
    unicode objects.
    Here again, trying several normalize/decode combination did not helped
    at all.
    Its not too complicated, you just need to keep unicode and byte strings
    separate and draw a clean line between the two. (the line is decode()
    and encode() )
    Then, looking for help through google, I have found this post:

    and I gave it a try. What I did, though, was not to override
    sys.stdout, but to declare a new writer stream as a property of my
    main class:
    >>self.out=OutS treamEncoder(sy s.stdout, 'utf-8')
    >
    This is fancy but not needed if you take care like above.

    HTH
    Tino

    Comment

    • Marc 'BlackJack' Rintsch

      #3
      Re: sys.stdout, urllib and unicode... I don't understand.

      On Tue, 11 Nov 2008 12:18:26 -0800, Thierry wrote:
      I have realized an wxPython simple application, that takes the input of
      a user, send it to a web service, and get back translations in several
      languages.
      The service itself is fully UTF-8.
      >
      The "source" string is first encoded to "latin1" after a passage into
      unicode.normali ze(), as urllib.quote() cannot work on unicode
      >>>srcText=unic odedata.normali ze('NFKD',srcTe xt).encode('lat in1','ignore')
      If the service uses UTF-8 why don't you just encode the data you send as
      UTF-8 but Latin-1 with potentially throwing away data because of the
      'ignore' argument!? Make that ``src_text = unicodedata.enc ode('utf-8')``
      >>>req=urllib2. urlopen(con)
      >
      First problem, how to determine the encoding of the return ? If I
      inspect a request from firefox, I see that the server return header
      specify UTF-8
      But if I use this code:
      >>>ret=U''
      >>>for line in req:
      >> ret=ret+string. replace(line.st rip(),'\n',chr( 10))
      I end up with an UnicodeDecodeEr ror.
      Because `line` contains bytes and `ret` is a `unicode` object. If you
      add a `unicode` object and a `str` object, Python tries to convert the
      `str` to `unicode` using the default == ASCII encoding. And this fails
      if there are byte value >127. *You* have to decode `line` from a bunch
      of bytes to a bunch of (unicode)charac ters before you concatenate the
      strings.

      BTW: ``line.strip()` ` removes all whitespace at both ends *including
      newlines*, so there are no '\n' to replace anymore. And functions in the
      `string` module that are also implemented as method on `str` or `unicode`
      are deprecated.

      Ciao,
      Marc 'BlackJack' Rintsch

      Comment

      • Thierry

        #4
        Re: sys.stdout, urllib and unicode... I don't understand.

        Thank you to both of you (Marc and Tino).

        I feel a bit stupid right now, because as both of you said, encoding
        my source string to utf-8 do not produce an exception when I pass it
        to urllib.quote() and is what it should be.
        I was certain that this created an error sooner, and id not tried it
        again.
        The result of 2 days making random changes and hoping it works. I
        know, reflection should have primed. My bad...

        The same goes for my treatment in the iteration over the request
        result.
        I now have an
        >line=line.enco de('utf-8')
        and no errors (as long as I don't try to print this to stdout, which I
        understand).
        So, I'm now really getting back an unicode string that I can handle as
        such.

        I really am confused about what I was trying to do...
        I cannot understand what I did that caused those errors, because the
        state the script is now correspond to what I have in mind originally.
        >>BTW: ``line.strip()` ` removes all whitespace at both ends *including
        >>newlines*, so there are no '\n' to replace anymore.
        Not exactly...
        It's that I receive a string, with 2 literal characters in it: "\" and
        "n".
        What I (want to) do here is that I replace those 2 characters with 1
        chr(10).
        >>And functions in the
        >>`string` module that are also implemented as method on `str` or `unicode`
        >>are deprecated.
        I actually had read that, but not modified my code.
        Thank to point it out

        Anyway, thanks again to both of you.
        I'm quite happy to see it working the way I intended.

        Comment

        • Steve Holden

          #5
          Re: sys.stdout, urllib and unicode... I don't understand.

          Thierry wrote:
          Thank you to both of you (Marc and Tino).
          >
          I feel a bit stupid right now, because as both of you said, encoding
          my source string to utf-8 do not produce an exception when I pass it
          to urllib.quote() and is what it should be.
          I was certain that this created an error sooner, and id not tried it
          again.
          The result of 2 days making random changes and hoping it works. I
          know, reflection should have primed. My bad...
          >
          The same goes for my treatment in the iteration over the request
          result.
          I now have an
          >>line=line.enc ode('utf-8')
          and no errors (as long as I don't try to print this to stdout, which I
          understand).
          So, I'm now really getting back an unicode string that I can handle as
          such.
          >
          I really am confused about what I was trying to do...
          I cannot understand what I did that caused those errors, because the
          state the script is now correspond to what I have in mind originally.
          >
          >>BTW: ``line.strip()` ` removes all whitespace at both ends *including
          >>newlines*, so there are no '\n' to replace anymore.
          Not exactly...
          It's that I receive a string, with 2 literal characters in it: "\" and
          "n".
          What I (want to) do here is that I replace those 2 characters with 1
          chr(10).
          >
          In that case you would need the following code:

          ret=U''
          for line in req:
          ret=ret+string. replace(line.st rip(),'\\n', '\n')

          Otherwise you just replace chr(10)'s with chr(10)'s, which won't help you.

          Are you sure that Python wasn't just printing out "\n" because you'd
          asked it to show you the repr() of a string containing newlines?
          >>And functions in the
          >>`string` module that are also implemented as method on `str` or `unicode`
          >>are deprecated.
          I actually had read that, but not modified my code.
          Thank to point it out
          >
          Anyway, thanks again to both of you.
          I'm quite happy to see it working the way I intended.
          regards
          Steve
          --
          Steve Holden +1 571 484 6266 +1 800 494 3119
          Holden Web LLC http://www.holdenweb.com/

          Comment

          • Thierry

            #6
            Re: sys.stdout, urllib and unicode... I don't understand.

            Are you sure that Python wasn't just printing out "\n" because you'd
            asked it to show you the repr() of a string containing newlines?
            Yes, I am sure. Because I dumped the ord() values to check them.
            But again, I'm stumped on how complicated I have made this.
            I should not try to code anymore at 2am.

            Comment

            Working...