Why does the "".join(r) do this?

Collapse
This topic is closed.
X
X
 
  • Time
  • Show
Clear All
new posts
  • Jim Hefferon

    Why does the "".join(r) do this?

    Hello,

    I'm getting an error join-ing strings and wonder if someone can
    explain why the function is behaving this way? If I .join in a string
    that contains a high character then I get an ascii codec decoding
    error. (The code below illustrates.) Why doesn't it just
    concatenate?

    I'm building up a web page by stuffing an array and then doing
    "".join(r) at
    the end. I intend to later encode it as 'latin1', so I'd like it to
    just concatenate. While I can work around this error, the reason for
    it escapes me.

    Thanks,
    Jim

    =============== == program: try.py
    #!/usr/bin/python2.3 -u
    t="abc"+chr(174 )+"def"
    print(u"next: %s :there" % (t.decode('lati n1'),))
    print t
    r=["x",'y',u'z ']
    r.append(t)
    k="".join(r)
    print k

    =============== === command line (on my screen between the first abc
    and def is
    a circle-R, while between the second two is a black oval with a
    white
    question mark, in case anyone cares):
    jim@joshua:~$ ./try.py
    next: abc®def :there
    abc�def
    Traceback (most recent call last):
    File "./try.py", line 7, in ?
    k="".join(r)
    UnicodeDecodeEr ror: 'ascii' codec can't decode byte 0xae in position
    3: ordinal not in range(128)
  • Peter Hansen

    #2
    Re: Why does the "".jo in(r) do this?

    Jim Hefferon wrote:
    [color=blue]
    > I'm getting an error join-ing strings and wonder if someone can
    > explain why the function is behaving this way? If I .join in a string
    > that contains a high character then I get an ascii codec decoding
    > error. (The code below illustrates.) Why doesn't it just
    > concatenate?[/color]

    It can't just concatenate because your list contains other
    items which are unicode strings. Python is attempting to convert
    your strings to unicode strings to do the join, and it fails
    because your strings contain characters which don't have
    meaning to the default decoder.

    -Peter

    Comment

    • Skip Montanaro

      #3
      Re: Why does the "".jo in(r) do this?


      Jim> I'm building up a web page by stuffing an array and then doing
      Jim> "".join(r) at the end. I intend to later encode it as 'latin1', so
      Jim> I'd like it to just concatenate. While I can work around this
      Jim> error, the reason for it escapes me.

      Try

      u"".join(r)

      instead. I think the join operation is trying to convert the Unicode bits
      in your list of strings to strings by encoding using the default codec,
      which appears to be ASCII.

      Skip

      Comment

      • Peter Otten

        #4
        Re: Why does the "".jo in(r) do this?

        Jim Hefferon wrote:
        [color=blue]
        > I'm getting an error join-ing strings and wonder if someone can
        > explain why the function is behaving this way? If I .join in a string
        > that contains a high character then I get an ascii codec decoding
        > error. (The code below illustrates.) Why doesn't it just
        > concatenate?[/color]

        Let's reduce the problem to its simplest case:
        [color=blue][color=green][color=darkred]
        >>> unichr(174) + chr(174)[/color][/color][/color]
        Traceback (most recent call last):
        File "<stdin>", line 1, in ?
        UnicodeDecodeEr ror: 'ascii' codec can't decode byte 0xae in position 0:
        ordinal not in range(128)

        So why doesn't it just concatenate? Because there is no way of knowing how
        to properly decode chr(174) or any other non-ascii character to unicode:
        [color=blue][color=green][color=darkred]
        >>> chr(174).decode ("latin1")[/color][/color][/color]
        u'\xae'[color=blue][color=green][color=darkred]
        >>> chr(174).decode ("latin2")[/color][/color][/color]
        u'\u017d'[color=blue][color=green][color=darkred]
        >>>[/color][/color][/color]

        Use either unicode or str, but don't mix them. That should keep you out of
        trouble.

        Peter

        Comment

        • Peter Otten

          #5
          Re: Why does the &quot;&quot;.jo in(r) do this?

          Skip Montanaro wrote:
          [color=blue]
          > Try
          >
          > u"".join(r)
          >
          > instead. I think the join operation is trying to convert the Unicode bits
          > in your list of strings to strings by encoding using the default codec,
          > which appears to be ASCII.[/color]

          This is bound to fail when the first non-ascii str occurs:
          [color=blue][color=green][color=darkred]
          >>> u"".join(["a", "b"])[/color][/color][/color]
          u'ab'[color=blue][color=green][color=darkred]
          >>> u"".join(["a", chr(174)])[/color][/color][/color]
          Traceback (most recent call last):
          File "<stdin>", line 1, in ?
          UnicodeDecodeEr ror: 'ascii' codec can't decode byte 0xae in position 0:
          ordinal not in range(128)[color=blue][color=green][color=darkred]
          >>>[/color][/color][/color]

          Apart from that, Python automatically switches to unicode if the list
          contains unicode items:
          [color=blue][color=green][color=darkred]
          >>> "".join(["a", u"o"])[/color][/color][/color]
          u'ao'

          Peter

          Comment

          • moma

            #6
            Re: Why does the &quot;&quot;.jo in(r) do this?

            Jim Hefferon wrote:[color=blue]
            > Hello,
            >
            > I'm getting an error join-ing strings and wonder if someone can
            > explain why the function is behaving this way? If I .join in a string
            > that contains a high character then I get an ascii codec decoding
            > error. (The code below illustrates.) Why doesn't it just
            > concatenate?
            >
            > I'm building up a web page by stuffing an array and then doing
            > "".join(r) at
            > the end. I intend to later encode it as 'latin1', so I'd like it to
            > just concatenate. While I can work around this error, the reason for
            > it escapes me.
            >
            > Thanks,
            > Jim
            >
            > =============== == program: try.py
            > #!/usr/bin/python2.3 -u
            > t="abc"+chr(174 )+"def"
            > print(u"next: %s :there" % (t.decode('lati n1'),))
            > print t
            > r=["x",'y',u'z ']
            > r.append(t)
            > k="".join(r)
            > print k
            >
            > =============== === command line (on my screen between the first abc
            > and def is
            > a circle-R, while between the second two is a black oval with a
            > white
            > question mark, in case anyone cares):
            > jim@joshua:~$ ./try.py
            > next: abc®def :there
            > abc�def
            > Traceback (most recent call last):
            > File "./try.py", line 7, in ?
            > k="".join(r)
            > UnicodeDecodeEr ror: 'ascii' codec can't decode byte 0xae in position
            > 3: ordinal not in range(128)[/color]

            What about unichr() ?


            #!/usr/bin/python2.3 -u
            t="abc"+unichr( 174)+"def"
            print t
            print(u"next: %s :there" % (t),)
            print t
            r=["x",'y',u'z ']
            r.append(t)
            k="".join(r)
            print k









            Comment

            • moma

              #7
              Re: Why does the &quot;&quot;.jo in(r) do this?

              Jim Hefferon wrote:[color=blue]
              > Hello,
              >
              > I'm getting an error join-ing strings and wonder if someone can
              > explain why the function is behaving this way? If I .join in a string
              > that contains a high character then I get an ascii codec decoding
              > error. (The code below illustrates.) Why doesn't it just
              > concatenate?
              >
              > I'm building up a web page by stuffing an array and then doing
              > "".join(r) at
              > the end. I intend to later encode it as 'latin1', so I'd like it to
              > just concatenate. While I can work around this error, the reason for
              > it escapes me.
              >
              > Thanks,
              > Jim
              >
              > =============== == program: try.py
              > #!/usr/bin/python2.3 -u
              > t="abc"+chr(174 )+"def"
              > print(u"next: %s :there" % (t.decode('lati n1'),))
              > print t
              > r=["x",'y',u'z ']
              > r.append(t)
              > k="".join(r)
              > print k
              >
              > =============== === command line (on my screen between the first abc
              > and def is
              > a circle-R, while between the second two is a black oval with a
              > white
              > question mark, in case anyone cares):
              > jim@joshua:~$ ./try.py
              > next: abc®def :there
              > abc�def
              > Traceback (most recent call last):
              > File "./try.py", line 7, in ?
              > k="".join(r)
              > UnicodeDecodeEr ror: 'ascii' codec can't decode byte 0xae in position
              > 3: ordinal not in range(128)[/color]

              What about unichr() ?


              #!/usr/bin/python2.3 -u
              t="abc"+unichr( 174)+"def"
              print t
              print(u"next: %s :there" % (t),)
              print t
              r=["x",'y',u'z ']
              r.append(t)
              # k=u"".join(r)
              k="".join(r)
              print k


              // moma
              What is Ubuntu and how is it different from Linux? Amidst the battle of two giant operating systems – Mac and Windows – a third OS has grown silently; that is, Linux. The reason behind Linux’s growing popularity is its free availability and customized OS design feature. However, you cannot really download mere Linux as

              Comment

              • Skip Montanaro

                #8
                Re: Why does the &quot;&quot;.jo in(r) do this?


                Peter> Skip Montanaro wrote:[color=blue][color=green]
                >> Try
                >>
                >> u"".join(r)
                >>
                >> instead. I think the join operation is trying to convert the Unicode bits
                >> in your list of strings to strings by encoding using the default codec,
                >> which appears to be ASCII.[/color][/color]

                Peter> This is bound to fail when the first non-ascii str occurs:

                ...

                Yeah I realized that later. I missed that he was appending non-ASCII
                strings to his list. I thought he was only appending unicode objects and
                ASCII strings (in which case what he was trying should have worked). Serves
                me right for trying to respond with a head cold.

                Skip

                Comment

                • Ivan Voras

                  #9
                  Re: Why does the &quot;&quot;.jo in(r) do this?

                  Peter Otten wrote:
                  [color=blue]
                  > Skip Montanaro wrote:
                  >
                  >[color=green]
                  >>Try
                  >>
                  >> u"".join(r)
                  >>
                  >>instead. I think the join operation is trying to convert the Unicode bits
                  >>in your list of strings to strings by encoding using the default codec,
                  >>which appears to be ASCII.[/color]
                  >
                  >
                  > This is bound to fail when the first non-ascii str occurs:[/color]

                  Is there a way to change the default codec in a part of a program?
                  (Meaning that different parts of program deal with strings they know are
                  in a specific different code pages?)


                  --
                  C isn't that hard: void (*(*f[])())() defines f as an array of
                  unspecified size, of pointers to functions that return pointers to
                  functions that return void.

                  Comment

                  • John Roth

                    #10
                    Re: Why does the &quot;&quot;.jo in(r) do this?

                    "Ivan Voras" <ivoras@__geri. cc.fer.hr> wrote in message
                    news:c8itrm$epg $1@bagan.srce.h r...[color=blue]
                    > Peter Otten wrote:
                    >[color=green]
                    > > Skip Montanaro wrote:
                    > >
                    > >[color=darkred]
                    > >>Try
                    > >>
                    > >> u"".join(r)
                    > >>
                    > >>instead. I think the join operation is trying to convert the Unicode[/color][/color][/color]
                    bits[color=blue][color=green][color=darkred]
                    > >>in your list of strings to strings by encoding using the default codec,
                    > >>which appears to be ASCII.[/color]
                    > >
                    > >
                    > > This is bound to fail when the first non-ascii str occurs:[/color]
                    >
                    > Is there a way to change the default codec in a part of a program?
                    > (Meaning that different parts of program deal with strings they know are
                    > in a specific different code pages?)[/color]

                    Does the encoding line (1st or second line of program) do this?
                    I don't remember if it does or not - although I'd suspect not.
                    Otherwise it seems like a reasonably straightforward function
                    to write.

                    John Roth[color=blue]
                    >
                    >
                    > --
                    > C isn't that hard: void (*(*f[])())() defines f as an array of
                    > unspecified size, of pointers to functions that return pointers to
                    > functions that return void.[/color]


                    Comment

                    • Peter Otten

                      #11
                      Re: Why does the &quot;&quot;.jo in(r) do this?

                      John Roth wrote:
                      [color=blue]
                      > "Ivan Voras" <ivoras@__geri. cc.fer.hr> wrote in message
                      > news:c8itrm$epg $1@bagan.srce.h r...[/color]
                      [color=blue][color=green]
                      >> Is there a way to change the default codec in a part of a program?
                      >> (Meaning that different parts of program deal with strings they know are
                      >> in a specific different code pages?)[/color]
                      >
                      > Does the encoding line (1st or second line of program) do this?
                      > I don't remember if it does or not - although I'd suspect not.
                      > Otherwise it seems like a reasonably straightforward function
                      > to write.[/color]

                      As a str does not preserve information about the encoding, the
                      # -*- coding: XXX -*-
                      comment does not help here. It does however control the decoding of unicode
                      strings. I suppose using unicode for non-ascii literals plus the above
                      coding comment is as close as you can get to the desired effect.

                      With some more work you could probably automate string conversion like it is
                      done with quixote's htmltext. Not sure if that would be worth the effort,
                      though.

                      Peter

                      Comment

                      • Jim Hefferon

                        #12
                        Re: Why does the &quot;&quot;.jo in(r) do this?

                        Peter Otten <__peter__@web. de> wrote[color=blue]
                        > So why doesn't it just concatenate? Because there is no way of knowing how
                        > to properly decode chr(174) or any other non-ascii character to unicode:
                        >[color=green][color=darkred]
                        > >>> chr(174).decode ("latin1")[/color][/color]
                        > u'\xae'[color=green][color=darkred]
                        > >>> chr(174).decode ("latin2")[/color][/color]
                        > u'\u017d'[color=green][color=darkred]
                        > >>>[/color][/color][/color]

                        Forgive me, Peter, but you've only rephrased my question: I'm going to
                        decode them later, so why does the concatenator insist on decoding
                        them now? As I understand it (perhaps this is my error),
                        encoding/decoding is stuff that you do external to manipulating the
                        arrays of characters.
                        [color=blue]
                        > Use either unicode or str, but don't mix them. That should keep you out of
                        > trouble.[/color]

                        Well, I got this string as the filename of some kind of Macintosh file
                        (I'm on Linux but I'm working with an archive that contains some pre-X
                        Mac stuff) while calling some os and os.path functions. So I'm taking
                        strings from a Python library function (and using % to stuff them into
                        strings that will end up on the web, which should preserve
                        unicode-type-ness, right?) and then .join-ing them.

                        I didn't go into the whole story when posting, because I tried to boil
                        the question down. Perhaps I should have.

                        Thanks; I am often struck by how helpful this group is,
                        Jim

                        Comment

                        • John Roth

                          #13
                          Re: Why does the &quot;&quot;.jo in(r) do this?

                          "Jim Hefferon" <jhefferon@smcv t.edu> wrote in message
                          news:545cb8c2.0 405201645.16ac3 364@posting.goo gle.com...[color=blue]
                          > Peter Otten <__peter__@web. de> wrote[color=green]
                          > > So why doesn't it just concatenate? Because there is no way of knowing[/color][/color]
                          how[color=blue][color=green]
                          > > to properly decode chr(174) or any other non-ascii character to unicode:
                          > >[color=darkred]
                          > > >>> chr(174).decode ("latin1")[/color]
                          > > u'\xae'[color=darkred]
                          > > >>> chr(174).decode ("latin2")[/color]
                          > > u'\u017d'[color=darkred]
                          > > >>>[/color][/color]
                          >
                          > Forgive me, Peter, but you've only rephrased my question: I'm going to
                          > decode them later, so why does the concatenator insist on decoding
                          > them now? As I understand it (perhaps this is my error),
                          > encoding/decoding is stuff that you do external to manipulating the
                          > arrays of characters.[/color]

                          Maybe I can simplify it? The result has to be in a single encoding,
                          which will be UTF-8 if any of the strings is a unicode string.
                          Ascii-7 is a proper subset of UTF-8, so there is no difficulty with
                          the concatination. 8-bit encodings are not, so the concatination
                          checks that any normal strings are, in fact, Ascii-7. The encoding
                          is actually doing the validity check, not an encoding conversion.

                          The only way the system could do a clean concatination between
                          unicode and one of the 8-bit encodings is to know beforehand which
                          of the 8-bit encodings it is dealing with, and there is no way that it
                          currently has of knowing that.

                          The people who implemented unicode (in 2.0, I believe) seem to
                          have decided not to guess. That's in line with the "explicit is better
                          than implicit" principle.
                          [color=blue][color=green]
                          > > Use either unicode or str, but don't mix them. That should keep you out[/color][/color]
                          of[color=blue][color=green]
                          > > trouble.[/color]
                          >
                          > Well, I got this string as the filename of some kind of Macintosh file
                          > (I'm on Linux but I'm working with an archive that contains some pre-X
                          > Mac stuff) while calling some os and os.path functions. So I'm taking
                          > strings from a Python library function (and using % to stuff them into
                          > strings that will end up on the web, which should preserve
                          > unicode-type-ness, right?) and then .join-ing them.[/color]

                          Ah. The issue then is rather simple: what is the encoding of the normal
                          strings? I'd presume Latin-1. So simply run the list of strings through a
                          function that converts any normal string to unicode using the Latin-1
                          codec, and then they should concatinate fine.

                          As far as the web goes, I'd suggest you make sure you specify UTF-8
                          in both the HTTP headers and in a <meta> tag in the HTML header,
                          and make sure that what you write out is, indeed, UTF-8.

                          John Roth
                          [color=blue]
                          >
                          > I didn't go into the whole story when posting, because I tried to boil
                          > the question down. Perhaps I should have.
                          >
                          > Thanks; I am often struck by how helpful this group is,
                          > Jim[/color]


                          Comment

                          • Erik Max Francis

                            #14
                            Re: Why does the &quot;&quot;.jo in(r) do this?

                            Jim Hefferon wrote:
                            [color=blue]
                            > Forgive me, Peter, but you've only rephrased my question: I'm going to
                            > decode them later, so why does the concatenator insist on decoding
                            > them now?[/color]

                            Because you're mixing normal strings and Unicode strings. To do that,
                            it needs to convert the normal strings to Unicode, and to do that it has
                            to know what encoding you want.
                            [color=blue]
                            > As I understand it (perhaps this is my error),
                            > encoding/decoding is stuff that you do external to manipulating the
                            > arrays of characters.[/color]

                            It's the process by which you turn an arbitrary string into a Unicode
                            string and back. When you're adding normal strings and Unicode strings,
                            you end up with a Unicode string, which means the normal strings have to
                            be implicitly converted. That's why you're getting the error.

                            Work with strings or Unicode strings, not a mixture, and you won't have
                            this problem.

                            --
                            __ Erik Max Francis && max@alcyone.com && http://www.alcyone.com/max/
                            / \ San Jose, CA, USA && 37 20 N 121 53 W && AIM erikmaxfrancis
                            \__/ She glanced at her watch ... It was 9:23.
                            -- James Clavell

                            Comment

                            • Peter Otten

                              #15
                              Re: Why does the &quot;&quot;.jo in(r) do this?

                              Jim Hefferon wrote:
                              [color=blue]
                              > Peter Otten <__peter__@web. de> wrote[color=green]
                              >> So why doesn't it just concatenate? Because there is no way of knowing
                              >> how to properly decode chr(174) or any other non-ascii character to
                              >> unicode:
                              >>[color=darkred]
                              >> >>> chr(174).decode ("latin1")[/color]
                              >> u'\xae'[color=darkred]
                              >> >>> chr(174).decode ("latin2")[/color]
                              >> u'\u017d'[color=darkred]
                              >> >>>[/color][/color]
                              >
                              > Forgive me, Peter, but you've only rephrased my question: I'm going to
                              > decode them later, so why does the concatenator insist on decoding
                              > them now? As I understand it (perhaps this is my error),
                              > encoding/decoding is stuff that you do external to manipulating the
                              > arrays of characters.[/color]

                              Perhaps another example will help in addition to the answers already given:
                              [color=blue][color=green][color=darkred]
                              >>> 1 + 2.0[/color][/color][/color]
                              3.0

                              In the above 1 is converted to 1.0 before it can be added to 2.0, i. e. we
                              have
                              [color=blue][color=green][color=darkred]
                              >>> float(1) + 2.0[/color][/color][/color]
                              3.0

                              In the same spirit
                              [color=blue][color=green][color=darkred]
                              >>> u"a" + "b"[/color][/color][/color]
                              u'ab'

                              "b" is converted to unicode before u"a" and u"b" can be concatenated. The
                              same goes for string formatting:
                              [color=blue][color=green][color=darkred]
                              >>> "a%s" % u"b"[/color][/color][/color]
                              u'ab'[color=blue][color=green][color=darkred]
                              >>> u"a%s" % "b"[/color][/color][/color]
                              u'ab'

                              The following might be the conversion function:
                              [color=blue][color=green][color=darkred]
                              >>> def tounicode(s, encoding="ascii "):[/color][/color][/color]
                              .... return s.decode(encodi ng)
                              ....[color=blue][color=green][color=darkred]
                              >>> u"a" + tounicode("b")[/color][/color][/color]
                              u'ab'

                              Of course it would fail with non-ascii characters in the string that shall
                              be converted. Why not allow strings with all 256 chars? Again, as stated in
                              my above post, that would be ambiguous:
                              [color=blue][color=green][color=darkred]
                              >>> u"a" + tounicode(chr(1 74), "latin1")[/color][/color][/color]
                              u'a\xae'[color=blue][color=green][color=darkred]
                              >>> u"a" + tounicode(chr(1 74), "latin2")[/color][/color][/color]
                              u'a\u017d'[color=blue][color=green][color=darkred]
                              >>>[/color][/color][/color]

                              By the way, in the real conversion routine the encoding isn't hardcoded, see
                              sys.get/setdefaultencod ing() for the details. Therefore you _could_ modify
                              site.py to assume e. g. latin1 as the encoding of 8 bit strings. The
                              practical benefit of that is limited as you cannot make assumptions about
                              machines not under your control and therefore are stuck with ascii as the
                              least common denominator for scripts meant to be portable - which brings us
                              back to:
                              [color=blue][color=green]
                              >> Use either unicode or str, but don't mix them. That should keep you out
                              >> of trouble.[/color][/color]

                              Or make all conversions explicit with the str.decode()/unicode.encode( )
                              methods.
                              [color=blue]
                              > Well, I got this string as the filename of some kind of Macintosh file
                              > (I'm on Linux but I'm working with an archive that contains some pre-X
                              > Mac stuff) while calling some os and os.path functions. So I'm taking
                              > strings from a Python library function (and using % to stuff them into
                              > strings that will end up on the web, which should preserve
                              > unicode-type-ness, right?) and then .join-ing them.
                              >
                              > I didn't go into the whole story when posting, because I tried to boil
                              > the question down. Perhaps I should have.[/color]

                              While details are often helpful to identify a problem that is different from
                              the poster's guess, unicode handling is pretty general, and it was rather
                              my post that was lacking clarity.

                              Peter

                              Comment

                              Working...