Why does the "".join(r) do this?

**Peter Hansen** · Jul 18 '05, 11:03 AM

Re: Why does the "".jo in(r) do this?

Jim Hefferon wrote:
[color=blue]
> I'm getting an error join-ing strings and wonder if someone can
> explain why the function is behaving this way? If I .join in a string
> that contains a high character then I get an ascii codec decoding
> error. (The code below illustrates.) Why doesn't it just
> concatenate?[/color]

It can't just concatenate because your list contains other
items which are unicode strings. Python is attempting to convert
your strings to unicode strings to do the join, and it fails
because your strings contain characters which don't have
meaning to the default decoder.

-Peter

**Skip Montanaro** · Jul 18 '05, 11:04 AM

Re: Why does the "".jo in(r) do this?

Jim> I'm building up a web page by stuffing an array and then doing
Jim> "".join(r) at the end. I intend to later encode it as 'latin1', so
Jim> I'd like it to just concatenate. While I can work around this
Jim> error, the reason for it escapes me.

Try

u"".join(r)

instead. I think the join operation is trying to convert the Unicode bits
in your list of strings to strings by encoding using the default codec,
which appears to be ASCII.

Skip

**Peter Otten** · Jul 18 '05, 11:04 AM

Re: Why does the "".jo in(r) do this?

Jim Hefferon wrote:
[color=blue]
> I'm getting an error join-ing strings and wonder if someone can
> explain why the function is behaving this way? If I .join in a string
> that contains a high character then I get an ascii codec decoding
> error. (The code below illustrates.) Why doesn't it just
> concatenate?[/color]

Let's reduce the problem to its simplest case:
[color=blue][color=green][color=darkred]
>>> unichr(174) + chr(174)[/color][/color][/color]
Traceback (most recent call last):
File "<stdin>", line 1, in ?
UnicodeDecodeEr ror: 'ascii' codec can't decode byte 0xae in position 0:
ordinal not in range(128)

So why doesn't it just concatenate? Because there is no way of knowing how
to properly decode chr(174) or any other non-ascii character to unicode:
[color=blue][color=green][color=darkred]
>>> chr(174).decode ("latin1")[/color][/color][/color]
u'\xae'[color=blue][color=green][color=darkred]
>>> chr(174).decode ("latin2")[/color][/color][/color]
u'\u017d'[color=blue][color=green][color=darkred]
>>>[/color][/color][/color]

Use either unicode or str, but don't mix them. That should keep you out of
trouble.

Peter

**Peter Otten** · Jul 18 '05, 11:04 AM

Re: Why does the "".jo in(r) do this?

Skip Montanaro wrote:
[color=blue]
> Try
>
> u"".join(r)
>
> instead. I think the join operation is trying to convert the Unicode bits
> in your list of strings to strings by encoding using the default codec,
> which appears to be ASCII.[/color]

This is bound to fail when the first non-ascii str occurs:
[color=blue][color=green][color=darkred]
>>> u"".join(["a", "b"])[/color][/color][/color]
u'ab'[color=blue][color=green][color=darkred]
>>> u"".join(["a", chr(174)])[/color][/color][/color]
Traceback (most recent call last):
File "<stdin>", line 1, in ?
UnicodeDecodeEr ror: 'ascii' codec can't decode byte 0xae in position 0:
ordinal not in range(128)[color=blue][color=green][color=darkred]
>>>[/color][/color][/color]

Apart from that, Python automatically switches to unicode if the list
contains unicode items:
[color=blue][color=green][color=darkred]
>>> "".join(["a", u"o"])[/color][/color][/color]
u'ao'

Peter

**moma** · Jul 18 '05, 11:04 AM

Re: Why does the "".jo in(r) do this?

Jim Hefferon wrote:[color=blue]
> Hello,
>
> I'm getting an error join-ing strings and wonder if someone can
> explain why the function is behaving this way? If I .join in a string
> that contains a high character then I get an ascii codec decoding
> error. (The code below illustrates.) Why doesn't it just
> concatenate?
>
> I'm building up a web page by stuffing an array and then doing
> "".join(r) at
> the end. I intend to later encode it as 'latin1', so I'd like it to
> just concatenate. While I can work around this error, the reason for
> it escapes me.
>
> Thanks,
> Jim
>
> =============== == program: try.py
> #!/usr/bin/python2.3 -u
> t="abc"+chr(174 )+"def"
> print(u"next: %s :there" % (t.decode('lati n1'),))
> print t
> r=["x",'y',u'z ']
> r.append(t)
> k="".join(r)
> print k
>
> =============== === command line (on my screen between the first abc
> and def is
> a circle-R, while between the second two is a black oval with a
> white
> question mark, in case anyone cares):
> jim@joshua:~$ ./try.py
> next: abc®def :there
> abc�def
> Traceback (most recent call last):
> File "./try.py", line 7, in ?
> k="".join(r)
> UnicodeDecodeEr ror: 'ascii' codec can't decode byte 0xae in position
> 3: ordinal not in range(128)[/color]

What about unichr() ?

#!/usr/bin/python2.3 -u
t="abc"+unichr( 174)+"def"
print t
print(u"next: %s :there" % (t),)
print t
r=["x",'y',u'z ']
r.append(t)
k="".join(r)
print k

**moma** · Jul 18 '05, 11:04 AM

Re: Why does the "".jo in(r) do this?

Jim Hefferon wrote:[color=blue]
> Hello,
>
> I'm getting an error join-ing strings and wonder if someone can
> explain why the function is behaving this way? If I .join in a string
> that contains a high character then I get an ascii codec decoding
> error. (The code below illustrates.) Why doesn't it just
> concatenate?
>
> I'm building up a web page by stuffing an array and then doing
> "".join(r) at
> the end. I intend to later encode it as 'latin1', so I'd like it to
> just concatenate. While I can work around this error, the reason for
> it escapes me.
>
> Thanks,
> Jim
>
> =============== == program: try.py
> #!/usr/bin/python2.3 -u
> t="abc"+chr(174 )+"def"
> print(u"next: %s :there" % (t.decode('lati n1'),))
> print t
> r=["x",'y',u'z ']
> r.append(t)
> k="".join(r)
> print k
>
> =============== === command line (on my screen between the first abc
> and def is
> a circle-R, while between the second two is a black oval with a
> white
> question mark, in case anyone cares):
> jim@joshua:~$ ./try.py
> next: abc®def :there
> abc�def
> Traceback (most recent call last):
> File "./try.py", line 7, in ?
> k="".join(r)
> UnicodeDecodeEr ror: 'ascii' codec can't decode byte 0xae in position
> 3: ordinal not in range(128)[/color]

What about unichr() ?

#!/usr/bin/python2.3 -u
t="abc"+unichr( 174)+"def"
print t
print(u"next: %s :there" % (t),)
print t
r=["x",'y',u'z ']
r.append(t)
# k=u"".join(r)
k="".join(r)
print k

// moma

What is Ubuntu? - www.futuredesktop.org

http://www.futuredesktop.org

What is Ubuntu and how is it different from Linux? Amidst the battle of two giant operating systems – Mac and Windows – a third OS has grown silently; that is, Linux. The reason behind Linux’s growing popularity is its free availability and customized OS design feature. However, you cannot really download mere Linux as

**Skip Montanaro** · Jul 18 '05, 11:04 AM

Re: Why does the "".jo in(r) do this?

Peter> Skip Montanaro wrote:[color=blue][color=green]
>> Try
>>
>> u"".join(r)
>>
>> instead. I think the join operation is trying to convert the Unicode bits
>> in your list of strings to strings by encoding using the default codec,
>> which appears to be ASCII.[/color][/color]

Peter> This is bound to fail when the first non-ascii str occurs:

...

Yeah I realized that later. I missed that he was appending non-ASCII
strings to his list. I thought he was only appending unicode objects and
ASCII strings (in which case what he was trying should have worked). Serves
me right for trying to respond with a head cold.

Skip

**Ivan Voras** · Jul 18 '05, 11:04 AM

Re: Why does the "".jo in(r) do this?

Peter Otten wrote:
[color=blue]
> Skip Montanaro wrote:
>
>[color=green]
>>Try
>>
>> u"".join(r)
>>
>>instead. I think the join operation is trying to convert the Unicode bits
>>in your list of strings to strings by encoding using the default codec,
>>which appears to be ASCII.[/color]
>
>
> This is bound to fail when the first non-ascii str occurs:[/color]

Is there a way to change the default codec in a part of a program?
(Meaning that different parts of program deal with strings they know are
in a specific different code pages?)

--
C isn't that hard: void (*(*f[])())() defines f as an array of
unspecified size, of pointers to functions that return pointers to
functions that return void.

**John Roth** · Jul 18 '05, 11:04 AM

Re: Why does the "".jo in(r) do this?

"Ivan Voras" <ivoras@__geri. cc.fer.hr> wrote in message
news:c8itrm$epg $1@bagan.srce.h r...[color=blue]
> Peter Otten wrote:
>[color=green]
> > Skip Montanaro wrote:
> >
> >[color=darkred]
> >>Try
> >>
> >> u"".join(r)
> >>
> >>instead. I think the join operation is trying to convert the Unicode[/color][/color][/color]
bits[color=blue][color=green][color=darkred]
> >>in your list of strings to strings by encoding using the default codec,
> >>which appears to be ASCII.[/color]
> >
> >
> > This is bound to fail when the first non-ascii str occurs:[/color]
>
> Is there a way to change the default codec in a part of a program?
> (Meaning that different parts of program deal with strings they know are
> in a specific different code pages?)[/color]

Does the encoding line (1st or second line of program) do this?
I don't remember if it does or not - although I'd suspect not.
Otherwise it seems like a reasonably straightforward function
to write.

John Roth[color=blue]
>
>
> --
> C isn't that hard: void (*(*f[])())() defines f as an array of
> unspecified size, of pointers to functions that return pointers to
> functions that return void.[/color]

**Peter Otten** · Jul 18 '05, 11:04 AM

Re: Why does the "".jo in(r) do this?

John Roth wrote:
[color=blue]
> "Ivan Voras" <ivoras@__geri. cc.fer.hr> wrote in message
> news:c8itrm$epg $1@bagan.srce.h r...[/color]
[color=blue][color=green]
>> Is there a way to change the default codec in a part of a program?
>> (Meaning that different parts of program deal with strings they know are
>> in a specific different code pages?)[/color]
>
> Does the encoding line (1st or second line of program) do this?
> I don't remember if it does or not - although I'd suspect not.
> Otherwise it seems like a reasonably straightforward function
> to write.[/color]

As a str does not preserve information about the encoding, the
# -*- coding: XXX -*-
comment does not help here. It does however control the decoding of unicode
strings. I suppose using unicode for non-ascii literals plus the above
coding comment is as close as you can get to the desired effect.

With some more work you could probably automate string conversion like it is
done with quixote's htmltext. Not sure if that would be worth the effort,
though.

Peter

**Jim Hefferon** · Jul 18 '05, 11:04 AM

Re: Why does the "".jo in(r) do this?

Peter Otten <__peter__@web. de> wrote[color=blue]
> So why doesn't it just concatenate? Because there is no way of knowing how
> to properly decode chr(174) or any other non-ascii character to unicode:
>[color=green][color=darkred]
> >>> chr(174).decode ("latin1")[/color][/color]
> u'\xae'[color=green][color=darkred]
> >>> chr(174).decode ("latin2")[/color][/color]
> u'\u017d'[color=green][color=darkred]
> >>>[/color][/color][/color]

Forgive me, Peter, but you've only rephrased my question: I'm going to
decode them later, so why does the concatenator insist on decoding
them now? As I understand it (perhaps this is my error),
encoding/decoding is stuff that you do external to manipulating the
arrays of characters.
[color=blue]
> Use either unicode or str, but don't mix them. That should keep you out of
> trouble.[/color]

Well, I got this string as the filename of some kind of Macintosh file
(I'm on Linux but I'm working with an archive that contains some pre-X
Mac stuff) while calling some os and os.path functions. So I'm taking
strings from a Python library function (and using % to stuff them into
strings that will end up on the web, which should preserve
unicode-type-ness, right?) and then .join-ing them.

I didn't go into the whole story when posting, because I tried to boil
the question down. Perhaps I should have.

Thanks; I am often struck by how helpful this group is,
Jim

**John Roth** · Jul 18 '05, 11:04 AM

Re: Why does the "".jo in(r) do this?

"Jim Hefferon" <jhefferon@smcv t.edu> wrote in message
news:545cb8c2.0 405201645.16ac3 364@posting.goo gle.com...[color=blue]
> Peter Otten <__peter__@web. de> wrote[color=green]
> > So why doesn't it just concatenate? Because there is no way of knowing[/color][/color]
how[color=blue][color=green]
> > to properly decode chr(174) or any other non-ascii character to unicode:
> >[color=darkred]
> > >>> chr(174).decode ("latin1")[/color]
> > u'\xae'[color=darkred]
> > >>> chr(174).decode ("latin2")[/color]
> > u'\u017d'[color=darkred]
> > >>>[/color][/color]
>
> Forgive me, Peter, but you've only rephrased my question: I'm going to
> decode them later, so why does the concatenator insist on decoding
> them now? As I understand it (perhaps this is my error),
> encoding/decoding is stuff that you do external to manipulating the
> arrays of characters.[/color]

Maybe I can simplify it? The result has to be in a single encoding,
which will be UTF-8 if any of the strings is a unicode string.
Ascii-7 is a proper subset of UTF-8, so there is no difficulty with
the concatination. 8-bit encodings are not, so the concatination
checks that any normal strings are, in fact, Ascii-7. The encoding
is actually doing the validity check, not an encoding conversion.

The only way the system could do a clean concatination between
unicode and one of the 8-bit encodings is to know beforehand which
of the 8-bit encodings it is dealing with, and there is no way that it
currently has of knowing that.

The people who implemented unicode (in 2.0, I believe) seem to
have decided not to guess. That's in line with the "explicit is better
than implicit" principle.
[color=blue][color=green]
> > Use either unicode or str, but don't mix them. That should keep you out[/color][/color]
of[color=blue][color=green]
> > trouble.[/color]
>
> Well, I got this string as the filename of some kind of Macintosh file
> (I'm on Linux but I'm working with an archive that contains some pre-X
> Mac stuff) while calling some os and os.path functions. So I'm taking
> strings from a Python library function (and using % to stuff them into
> strings that will end up on the web, which should preserve
> unicode-type-ness, right?) and then .join-ing them.[/color]

Ah. The issue then is rather simple: what is the encoding of the normal
strings? I'd presume Latin-1. So simply run the list of strings through a
function that converts any normal string to unicode using the Latin-1
codec, and then they should concatinate fine.

As far as the web goes, I'd suggest you make sure you specify UTF-8
in both the HTTP headers and in a <meta> tag in the HTML header,
and make sure that what you write out is, indeed, UTF-8.

John Roth
[color=blue]
>
> I didn't go into the whole story when posting, because I tried to boil
> the question down. Perhaps I should have.
>
> Thanks; I am often struck by how helpful this group is,
> Jim[/color]

**Erik Max Francis** · Jul 18 '05, 11:04 AM

Re: Why does the "".jo in(r) do this?

Jim Hefferon wrote:
[color=blue]
> Forgive me, Peter, but you've only rephrased my question: I'm going to
> decode them later, so why does the concatenator insist on decoding
> them now?[/color]

Because you're mixing normal strings and Unicode strings. To do that,
it needs to convert the normal strings to Unicode, and to do that it has
to know what encoding you want.
[color=blue]
> As I understand it (perhaps this is my error),
> encoding/decoding is stuff that you do external to manipulating the
> arrays of characters.[/color]

It's the process by which you turn an arbitrary string into a Unicode
string and back. When you're adding normal strings and Unicode strings,
you end up with a Unicode string, which means the normal strings have to
be implicitly converted. That's why you're getting the error.

Work with strings or Unicode strings, not a mixture, and you won't have
this problem.

--
__ Erik Max Francis && max@alcyone.com && http://www.alcyone.com/max/
/ \ San Jose, CA, USA && 37 20 N 121 53 W && AIM erikmaxfrancis
\__/ She glanced at her watch ... It was 9:23.
-- James Clavell

**Peter Otten** · Jul 18 '05, 11:05 AM

Re: Why does the "".jo in(r) do this?

Jim Hefferon wrote:
[color=blue]
> Peter Otten <__peter__@web. de> wrote[color=green]
>> So why doesn't it just concatenate? Because there is no way of knowing
>> how to properly decode chr(174) or any other non-ascii character to
>> unicode:
>>[color=darkred]
>> >>> chr(174).decode ("latin1")[/color]
>> u'\xae'[color=darkred]
>> >>> chr(174).decode ("latin2")[/color]
>> u'\u017d'[color=darkred]
>> >>>[/color][/color]
>
> Forgive me, Peter, but you've only rephrased my question: I'm going to
> decode them later, so why does the concatenator insist on decoding
> them now? As I understand it (perhaps this is my error),
> encoding/decoding is stuff that you do external to manipulating the
> arrays of characters.[/color]

Perhaps another example will help in addition to the answers already given:
[color=blue][color=green][color=darkred]
>>> 1 + 2.0[/color][/color][/color]
3.0

In the above 1 is converted to 1.0 before it can be added to 2.0, i. e. we
have
[color=blue][color=green][color=darkred]
>>> float(1) + 2.0[/color][/color][/color]
3.0

In the same spirit
[color=blue][color=green][color=darkred]
>>> u"a" + "b"[/color][/color][/color]
u'ab'

"b" is converted to unicode before u"a" and u"b" can be concatenated. The
same goes for string formatting:
[color=blue][color=green][color=darkred]
>>> "a%s" % u"b"[/color][/color][/color]
u'ab'[color=blue][color=green][color=darkred]
>>> u"a%s" % "b"[/color][/color][/color]
u'ab'

The following might be the conversion function:
[color=blue][color=green][color=darkred]
>>> def tounicode(s, encoding="ascii "):[/color][/color][/color]
.... return s.decode(encodi ng)
....[color=blue][color=green][color=darkred]
>>> u"a" + tounicode("b")[/color][/color][/color]
u'ab'

Of course it would fail with non-ascii characters in the string that shall
be converted. Why not allow strings with all 256 chars? Again, as stated in
my above post, that would be ambiguous:
[color=blue][color=green][color=darkred]
>>> u"a" + tounicode(chr(1 74), "latin1")[/color][/color][/color]
u'a\xae'[color=blue][color=green][color=darkred]
>>> u"a" + tounicode(chr(1 74), "latin2")[/color][/color][/color]
u'a\u017d'[color=blue][color=green][color=darkred]
>>>[/color][/color][/color]

By the way, in the real conversion routine the encoding isn't hardcoded, see
sys.get/setdefaultencod ing() for the details. Therefore you _could_ modify
site.py to assume e. g. latin1 as the encoding of 8 bit strings. The
practical benefit of that is limited as you cannot make assumptions about
machines not under your control and therefore are stuck with ascii as the
least common denominator for scripts meant to be portable - which brings us
back to:
[color=blue][color=green]
>> Use either unicode or str, but don't mix them. That should keep you out
>> of trouble.[/color][/color]

Or make all conversions explicit with the str.decode()/unicode.encode( )
methods.
[color=blue]
> Well, I got this string as the filename of some kind of Macintosh file
> (I'm on Linux but I'm working with an archive that contains some pre-X
> Mac stuff) while calling some os and os.path functions. So I'm taking
> strings from a Python library function (and using % to stuff them into
> strings that will end up on the web, which should preserve
> unicode-type-ness, right?) and then .join-ing them.
>
> I didn't go into the whole story when posting, because I tried to boil
> the question down. Perhaps I should have.[/color]

While details are often helpful to identify a problem that is different from
the poster's guess, unicode handling is pretty general, and it was rather
my post that was lacking clarity.

Peter

Why does the "".join(r) do this?

Why does the "".join(r) do this?

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment