Binary strings, unicode and encodings

Collapse
This topic is closed.
X
X
 
  • Time
  • Show
Clear All
new posts
  • Laurent Therond

    Binary strings, unicode and encodings

    Maybe you have a minute to clarify the following matter...

    Consider:

    ---

    from cStringIO import StringIO

    def bencode_rec(x, b):
    t = type(x)

    if t is str:
    b.write('%d:%s' % (len(x), x))
    else:
    assert 0

    def bencode(x):
    b = StringIO()

    bencode_rec(x, b)

    return b.getvalue()

    ---

    Now, if I write bencode('failur e reason') into a socket, what will I get
    on the other side of the connection?

    a) A sequence of bytes where each byte represents an ASCII character

    b) A sequence of bytes where each byte represents the UTF-8 encoding of a
    Unicode character

    c) It depends on the system locale/it depends on what the site module
    specifies using setdefaultencod ing(name)

    ---

    So, if a Python client in China connects to a Python server in Europe,
    must they be careful to specify a common encoding on both sides of the
    connection?

    Regards,

    L.
  • Peter Hansen

    #2
    Re: Binary strings, unicode and encodings

    Laurent Therond wrote:[color=blue]
    >
    > Consider:
    > ---
    > from cStringIO import StringIO
    >
    > def bencode_rec(x, b):
    > t = type(x)
    > if t is str:
    > b.write('%d:%s' % (len(x), x))
    > else:
    > assert 0[/color]

    The above is confusing. Why not just do

    def bencode_rec(x, b):
    assert type(x) is str
    b.write(.....)

    Why the if/else etc?

    [color=blue]
    > def bencode(x):
    > b = StringIO()
    > bencode_rec(x, b)
    > return b.getvalue()
    >
    > ---
    > Now, if I write bencode('failur e reason') into a socket, what will I get
    > on the other side of the connection?[/color]

    This is Python. Why not try it and see? I wrote a quick test at
    the interactive prompt and concluded that StringIO converts to
    strings, so if your input is Unicode it has to be encodeable or
    you'll get the usual exception.
    [color=blue]
    > a) A sequence of bytes where each byte represents an ASCII character[/color]

    Yes, provided your input is exclusively ASCII (7-bit) data.
    [color=blue]
    > b) A sequence of bytes where each byte represents the UTF-8 encoding of a
    > Unicode character[/color]

    Yes, if UTF-8 is your default encoding and you're using Unicode input.
    [color=blue]
    > c) It depends on the system locale/it depends on what the site module
    > specifies using setdefaultencod ing(name)[/color]

    Yes, as it always does if you are using Unicode but converting to byte strings
    as it appears StringIO does.

    -Peter

    Comment

    • Laurent Therond

      #3
      Re: Binary strings, unicode and encodings

      Peter Hansen <peter@engcorp. com> wrote in message news:<4006F13C. 7D432B98@engcor p.com>...[color=blue]
      > The above is confusing. Why not just do
      >
      > def bencode_rec(x, b):
      > assert type(x) is str
      > b.write(.....)
      >
      > Why the if/else etc?[/color]

      That's a code extract. The real code was more complicated.
      [color=blue]
      > This is Python. Why not try it and see? I wrote a quick test at
      > the interactive prompt and concluded that StringIO converts to
      > strings, so if your input is Unicode it has to be encodeable or
      > you'll get the usual exception.[/color]

      Good point. Sorry, I don't have those good reflexes--I am new to
      Python.

      So, your test revealed that StringIO converts to byte strings.
      Does that mean:
      - If the input string contains characters that cannot be encoded
      in ASCII, bencode_rec will fail?

      Yet, if your locale specifies UTF-8 as the default encoding, it should
      not fail, right?

      Hence, I conclude your test was made on a system that uses ASCII/ISO
      8859-1 as its default encoding. Is that right?
      [color=blue][color=green]
      > > a) A sequence of bytes where each byte represents an ASCII character[/color]
      >
      > Yes, provided your input is exclusively ASCII (7-bit) data.[/color]

      OK.
      [color=blue][color=green]
      > > b) A sequence of bytes where each byte represents the UTF-8 encoding of a
      > > Unicode character[/color]
      >
      > Yes, if UTF-8 is your default encoding and you're using Unicode input.[/color]

      OK.
      [color=blue][color=green]
      > > c) It depends on the system locale/it depends on what the site module
      > > specifies using setdefaultencod ing(name)[/color]
      >
      > Yes, as it always does if you are using Unicode but converting to byte strings
      > as it appears StringIO does.[/color]

      Umm...not sure here...I think StringIO must behave differently
      depending on your locale and depending on how you assigned the string.

      Thanks for your help!

      L.

      Comment

      • Laurent Therond

        #4
        Re: Binary strings, unicode and encodings

        I forgot to ask something else...

        If a client and a server run on locales/platforms that use different
        encodings, they are bound to wrongly interpret string bytes. Correct?

        Comment

        • Laurent Therond

          #5
          Re: Binary strings, unicode and encodings

          I used the interpreter on my system:
          [color=blue][color=green][color=darkred]
          >>> import sys
          >>> sys.getdefaulte ncoding()[/color][/color][/color]
          'ascii'

          OK
          [color=blue][color=green][color=darkred]
          >>> from cStringIO import StringIO
          >>> b = StringIO()
          >>> b.write('%d:%s' % (len('string'), 'string'))
          >>> print b.getvalue()[/color][/color][/color]
          6:string

          OK
          [color=blue][color=green][color=darkred]
          >>> c = StringIO()
          >>> c.write('%d:%s' % (len('stringé') , 'stringé'))
          >>> print c.getvalue()[/color][/color][/color]
          7:stringé

          OK

          Did StringIO just recognize Extended ASCII?
          Did StringIO just recognize ISO 8859-1?

          é belongs to Extended ASCII AND ISO 8859-1.
          [color=blue][color=green][color=darkred]
          >>> print c.getvalue().de code('US-ASCII')[/color][/color][/color]
          Traceback (most recent call last):
          File "<stdin>", line 1, in ?
          UnicodeDecodeEr ror: 'ascii' codec can't decode byte 0x82 in position 8: ordinal
          not in range(128)
          [color=blue][color=green][color=darkred]
          >>> print c.getvalue().de code('ISO-8859-1')[/color][/color][/color]
          Traceback (most recent call last):
          File "<stdin>", line 1, in ?
          File "C:\Python23\li b\encodings\cp4 37.py", line 18, in encode
          return codecs.charmap_ encode(input,er rors,encoding_m ap)
          UnicodeEncodeEr ror: 'charmap' codec can't encode character u'\x82' in position 8
          : character maps to <undefined>[color=blue][color=green][color=darkred]
          >>>[/color][/color][/color]

          OK

          It must have been Extended ASCII, then.

          I must do other tests.

          Comment

          • Jp Calderone

            #6
            Re: Binary strings, unicode and encodings

            On Thu, Jan 15, 2004 at 11:38:39AM -0800, Laurent Therond wrote:[color=blue]
            > Maybe you have a minute to clarify the following matter...
            >
            > Consider:
            >
            > ---
            >
            > from cStringIO import StringIO
            >
            > def bencode_rec(x, b):
            > t = type(x)
            >
            > if t is str:
            > b.write('%d:%s' % (len(x), x))
            > else:
            > assert 0
            >
            > def bencode(x):
            > b = StringIO()
            >
            > bencode_rec(x, b)
            >
            > return b.getvalue()
            >
            > ---
            >
            > Now, if I write bencode('failur e reason') into a socket, what will I get
            > on the other side of the connection?
            >
            > a) A sequence of bytes where each byte represents an ASCII character[/color]

            Yes.
            [color=blue]
            >
            > b) A sequence of bytes where each byte represents the UTF-8 encoding of a
            > Unicode character[/color]

            Coincidentally, yes. This is not because the unicode you wrote to the
            socket is encoded as UTF-8 before it is sent, but because the *non*-unicode
            you wrote to the socket *happened* to be a valid UTF-8 byte string (All
            ASCII byte strings fall into this coincidental case).
            [color=blue]
            >
            > c) It depends on the system locale/it depends on what the site module
            > specifies using setdefaultencod ing(name)[/color]

            Not at all. 'failure reason' isn't unicode, there are no unicode
            transformations going on in the example program, the default encoding is
            never used and has no effect on the program's behavior.

            bencode_rec has an assert in it for a reason. *Only* byte strings can be
            sent using it. If you want to send unicode, you'll have to encode it
            yourself and send the encoded bytes, then decode it on the other end. If
            you choose to depend on the default system encoding, you'll probably end up
            with problems, but if you explicitly select an encoding yourself, you won't.

            Jp

            Comment

            • Martin v. Löwis

              #7
              Re: Binary strings, unicode and encodings

              Laurent Therond wrote:
              [color=blue]
              > Now, if I write bencode('failur e reason') into a socket, what will I get
              > on the other side of the connection?[/color]

              Jp has already explained this, but let me stress his observations.
              [color=blue]
              > a) A sequence of bytes where each byte represents an ASCII character[/color]

              A sequence of bytes, period. 'failure reason' is a byte string. The
              bytes in this string are literally copied from the source code .py file
              to the cStringIO object.

              If your source code was in an encoding that is an ASCII superset
              (such as ascii, iso-8859-1, cp1252), then yes: the text 'failure reason'
              will come out as a byte string representing ASCII characters.

              Python has a second, independent string type, called unicode. Literals
              of that type are not simply written in quotes, but with a leading u''.

              You should never use the unicode type in a place where byte strings
              are expected. Python will apply the system default encoding to these,
              which gives exceptions if the Unicode characters are outside the
              characters supported in the system default encoding (which is us-ascii).

              You also should avoid byte string literals with non-ASCII characters
              such as 'stringé'; use unicode literals. The user invoking your script
              may use a different encoding on his system, so he would get moji-bake,
              as the last character in the string literal does *not* denote
              LATIN SMALL LETTER E WITH ACUTE, but instead denotes the byte '\xe9'
              (which is that character only if you use a latin-1-like encoding).

              HTH,
              Martin

              Comment

              • Peter Hansen

                #8
                Re: Binary strings, unicode and encodings

                Laurent Therond wrote:[color=blue]
                >
                > So, your test revealed that StringIO converts to byte strings.
                > Does that mean:
                > - If the input string contains characters that cannot be encoded
                > in ASCII, bencode_rec will fail?[/color]

                Yes, if your default encoding is ASCII.
                [color=blue]
                > Yet, if your locale specifies UTF-8 as the default encoding, it should
                > not fail, right?[/color]

                True, provided you are actually creating UTF-8 strings... just sticking
                in a character that has the 8th bit set doesn't mean the string is UTF-8
                of course.
                [color=blue]
                > Hence, I conclude your test was made on a system that uses ASCII/ISO
                > 8859-1 as its default encoding. Is that right?[/color]

                Correct, Windows 98, sys.getdefaulte ncoding() returns 'ascii'.
                [color=blue][color=green][color=darkred]
                > > > c) It depends on the system locale/it depends on what the site module
                > > > specifies using setdefaultencod ing(name)[/color]
                > >
                > > Yes, as it always does if you are using Unicode but converting to byte strings
                > > as it appears StringIO does.[/color]
                >
                > Umm...not sure here...I think StringIO must behave differently
                > depending on your locale and depending on how you assigned the string.[/color]

                It's always possible that StringIO takes locale into account in some
                special way, but I suspect it does not. As for "how you assigned the string"
                I'm not sure I understand what that might mean. How many ways do you know
                to assign a string in Python?

                -Peter

                Comment

                • Peter Hansen

                  #9
                  Re: Binary strings, unicode and encodings

                  Laurent Therond wrote:[color=blue]
                  >
                  > I forgot to ask something else...
                  >
                  > If a client and a server run on locales/platforms that use different
                  > encodings, they are bound to wrongly interpret string bytes. Correct?[/color]

                  Since the byte strings are by definition *encoded* forms of the Unicode
                  data, they definitely need to have a shared frame of reference or they
                  will misinterpret the data, as you surmise. You can't decode something
                  if you don't know how it was encoded.

                  -Peter

                  Comment

                  • Peter Hansen

                    #10
                    Re: Binary strings, unicode and encodings

                    Laurent Therond wrote:[color=blue]
                    >
                    > I used the interpreter on my system:[color=green][color=darkred]
                    > >>> c = StringIO()
                    > >>> c.write('%d:%s' % (len('stringé') , 'stringé'))
                    > >>> print c.getvalue()[/color][/color]
                    > 7:stringé
                    >
                    > OK
                    >
                    > Did StringIO just recognize Extended ASCII?
                    > Did StringIO just recognize ISO 8859-1?
                    >
                    > é belongs to Extended ASCII AND ISO 8859-1.[/color]

                    No, StringIO didn't "recognize" anything but a simple string. There is
                    no issue of codecs and encoding and such going on here, because you are
                    sending in a string (as it happens, one that's not 8-bit clean, but that's
                    irrelevant though it may be the cause of your confusion) and getting out
                    a string. StringIO does not make any attempt to "encode" something that
                    is already a string.
                    [color=blue][color=green][color=darkred]
                    > >>> print c.getvalue().de code('US-ASCII')[/color][/color]
                    > Traceback (most recent call last):
                    > File "<stdin>", line 1, in ?
                    > UnicodeDecodeEr ror: 'ascii' codec can't decode byte 0x82 in position 8: ordinal
                    > not in range(128)
                    >[color=green][color=darkred]
                    > >>> print c.getvalue().de code('ISO-8859-1')[/color][/color]
                    > Traceback (most recent call last):
                    > File "<stdin>", line 1, in ?
                    > File "C:\Python23\li b\encodings\cp4 37.py", line 18, in encode
                    > return codecs.charmap_ encode(input,er rors,encoding_m ap)
                    > UnicodeEncodeEr ror: 'charmap' codec can't encode character u'\x82' in position 8
                    > : character maps to <undefined>[color=green][color=darkred]
                    > >>>[/color][/color]
                    >
                    > OK
                    >
                    > It must have been Extended ASCII, then.[/color]

                    Hmm... note that when you are trying to decode that string, you are
                    attempting to print a unicode rather than a string. When you try to
                    print that on your console, the console must decode it using the default
                    encoding again. I think you know this, but in case you didn't: it explains
                    why you got a DecodeError in the first place, but an EncodeError in the
                    second. The second example worked, treating the string as having been
                    encoded using ISO-8859-1, and returns a unicode. If you had assigned
                    it instead of printing it, you should have seen now errors.

                    -Peter

                    Comment

                    • Laurent Therond

                      #11
                      Re: Binary strings, unicode and encodings

                      Peter, thank you for taking the time to answer.

                      I will need some time to digest this information.

                      From where I stand, a Python newbie who knows more about Java, this
                      concept of binary string is puzzling. I wish Python dealt in Unicode
                      natively, as Java does. It makes things a lot easier to comprehend.
                      Having strings be byte arrays, on the other, seems to confuse me.

                      Comment

                      • Serge Orlov

                        #12
                        Re: Binary strings, unicode and encodings


                        "Laurent Therond" <google@axiomat ize.com> wrote in message news:265368cb.0 401161604.58099 d89@posting.goo gle.com...[color=blue]
                        > Peter, thank you for taking the time to answer.
                        >
                        > I will need some time to digest this information.
                        >
                        > From where I stand, a Python newbie who knows more about Java, this
                        > concept of binary string is puzzling. I wish Python dealt in Unicode
                        > natively, as Java does. It makes things a lot easier to comprehend[/color]

                        Python does deal with Unicode natively. You just need to put u
                        character before the string. This of course a violation of the rule
                        "There should be one-- and preferably only one --obvious way to do it."
                        'a' == u'a'. But remember that Python appeared before Unicode,
                        so strings in Python could not be unicode strings from the beginning
                        ..[color=blue]
                        > Having strings be byte arrays, on the other, seems to confuse me.[/color]

                        Use unicode strings only.

                        -- Serge.


                        Comment

                        Working...