Python parsing iTunes XML/COM

**william tanksley** · Jul 29 '08, 05:55 PM

Re: Python parsing iTunes XML/COM

To ask another way: how do I convert from a file:// URL to a local
path in a standard way, so that filepaths from two different sources
will work the same way in a dictionary?

Right now I'm using the following source:

track_id = url2pathname(ur lparse(track_id ).path)

url2pathname is from urllib; urlparse is from the urlparse module.

The problems occur when the filenames have non-ascii characters in
them -- I suspect that the URLs are having some encoding placed on
them that Python's decoder doesn't know about.

Thank you all in advance, and thank you for Python.

-Wm

**John Machin** · Jul 29 '08, 11:55 PM

Re: Python parsing iTunes XML/COM

On Jul 30, 3:53 am, william tanksley <wtanksle...@gm ail.comwrote:

To ask another way: how do I convert from a file:// URL to a local
path in a standard way, so that filepaths from two different sources
will work the same way in a dictionary?
>
Right now I'm using the following source:
>
track_id = url2pathname(ur lparse(track_id ).path)
>
url2pathname is from urllib; urlparse is from the urlparse module.
>
The problems occur when the filenames have non-ascii characters in
them -- I suspect that the URLs are having some encoding placed on
them that Python's decoder doesn't know about.

WHAT problems? WHAT non-ASCII characters?? Consider e.g.

# track_id = url2pathname(ur lparse(track_id ).path)
print repr(track_id)
parse_result = urlparse(track_ id).path
print repr(parse_resu lt)
track_id_replac ement = url2pathname(pa rse_result)
print repr(track_id_r eplacement)

and copy/paste the results into your next posting.

**pyshib@googlemail.com** · Jul 30 '08, 09:15 AM

Re: Python parsing iTunes XML/COM

If you want to convert the file names which use standard URL encoding
(with %20 for space, etc) use:

from urllib import unquote
new_filename = unquote(filenam e)

I have found this does not convert encoded characters of the form
'&#CC;' so you may have to do that manually. I think these are just
ascii encodings in hexadecimal.

**william tanksley** · Jul 30 '08, 03:05 PM

Re: Python parsing iTunes XML/COM

Thank you for the response. Here's some more info, including a little
that you didn't ask me for but which might be useful.

John Machin <sjmac...@lexic on.netwrote:

william tanksley <wtanksle...@gm ail.comwrote:

To ask another way: how do I convert from a file:// URL to a local
path in a standard way, so that filepaths from two different sources
will work the same way in a dictionary?
The problems occur when the filenames have non-ascii characters in
them -- I suspect that the URLs are having some encoding placed on
them that Python's decoder doesn't know about.

# track_id = url2pathname(ur lparse(track_id ).path)
print repr(track_id)
parse_result = urlparse(track_ id).path
print repr(parse_resu lt)
track_id_replac ement = url2pathname(pa rse_result)
print repr(track_id_r eplacement)

The "important" value here is track_id_replac ement; it contains the
data that's throwing me. It appears that some UTF-8 characters are
being read as multiple bytes by ElementTree rather than being decoded
into Unicode. Could this be a bug in ElementTree's Unicode support? If
so, can I work around it?

Here's one example. The others are similar -- they have the same
things that look like problems to me.

"Buffett Time - Annual Shareholders\xc 2\xa0L.mp3"

Note some problems here:

1. This isn't Unicode; it's missing the u"" (I printed using repr).
2. It's got the UTF-8 bytes there in the middle.

I tried doing track_id.encode ("utf-8"), but it doesn't seem to make
any difference at all.

Of course, my ultimate goal is to compare the track_id to the track_id
I get from iTunes' COM interface, including hashing to the same value
for dict lookups.

and copy/paste the results into your next posting.

In addition to the above results, while trying to get more diagnostic
printouts I got the following warning from Python:

C:\projects\pod casts\podstrand \podcast.py:280 : UnicodeWarning: Unicode
equal comparison failed to convert both arguments to Unicode -
interpreting them as being unequal
return track.databaseI D == trackLocation

The code that triggered this is as follows:

if trackLocation in self.podcasts:
track = self.podcasts[trackLocation]
if trackRelease:
track.release_d ate = trackRelease
elif track.is_podcas t:
print "No release date:", repr(track.name )
else:
# For the sake of diagnostics, try to find the track.
def track_has_locat ion(track):
return track.databaseI D == trackLocation
fillers = filter(track_ha s_location, self.fillers)
if len(fillers):
return
disabled = filter(track_ha s_location, self.deferred)
if len(disabled):
return
print "Location not known:", repr(trackLocat ion)

-Wm

**Jerry Hill** · Jul 30 '08, 03:15 PM

Re: Python parsing iTunes XML/COM

On Wed, Jul 30, 2008 at 10:58 AM, william tanksley
<wtanksleyjr@gm ail.comwrote:

Here's one example. The others are similar -- they have the same
things that look like problems to me.
>
"Buffett Time - Annual Shareholders\xc 2\xa0L.mp3"
>
Note some problems here:
>
1. This isn't Unicode; it's missing the u"" (I printed using repr).
2. It's got the UTF-8 bytes there in the middle.
>
I tried doing track_id.encode ("utf-8"), but it doesn't seem to make
any difference at all.

I don't have anything to say about your iTunes problems, but encode()
is the wrong method to turn a byte string into a unicode string.
Instead, use decode(), like this:

>>track_id = "Buffett Time - Annual Shareholders\xc 2\xa0L.mp3"
>>utrack_id = track_id.decode ('utf-8')
>>type(utrack_i d)

>>print utrack_id

Buffett Time - Annual Shareholders L.mp3

>>print repr(utrack_id)

u'Buffett Time - Annual Shareholders\xa 0L.mp3'

>>>

--
Jerry

**Stefan Behnel** · Jul 30 '08, 08:55 PM

Re: Python parsing iTunes XML/COM

william tanksley wrote:

Okay, so you decode to go from raw
byes into a given encoding, and you encode to go from a given encoding
to raw bytes.

No, decoding goes from a byte sequence to a Unicode string and encoding goes
from a Unicode string to a byte sequence.

Unicode is not an encoding. A Unicode string is a character sequence, not a
byte sequence.

Stefan

**Jerry Hill** · Jul 30 '08, 09:05 PM

Re: Python parsing iTunes XML/COM

On Wed, Jul 30, 2008 at 2:27 PM, william tanksley <wtanksleyjr@gm ail.comwrote:

Awesome... Thank you! I had my mental model of Python turned around
backwards. That's an odd feeling. Okay, so you decode to go from raw
byes into a given encoding, and you encode to go from a given encoding
to raw bytes. Not what I thought it was, but that's cool, makes sense.

That's not quite right. Decoding takes a byte string that is already
in a particular encoding and transforms it to unicode. Unicode isn't
a encoding of it's own. Decoding takes a unicode string (which
doesn't have any encoding associated with it), and gives you back a
sequence of bytes in a particular encoding.

This article isn't specific to Python, but it provides a good overview
of unicode and character encodings that may be useful:

The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!)

http://www.joelonsoftware.com/articles/Unicode.html

Ever wonder about that mysterious Content-Type tag? You know, the one you’re supposed to put in HTML and you never quite know what it should be? Did you ever get an email from your friends in…

--
Jerry

**william tanksley** · Jul 31 '08, 12:35 AM

Re: Python parsing iTunes XML/COM

"Jerry Hill" <malaclyp...@gm ail.comwrote:

On Wed, Jul 30, 2008 at 2:27 PM, william tanksley <wtanksle...@gm ail.com>wrote:

Awesome... Thank you! I had my mental model of Python turned around
backwards. That's an odd feeling. Okay, so you decode to go from raw
byes into a given encoding, and you encode to go from a given encoding
to raw bytes. Not what I thought it was, but that's cool, makes sense.

That's not quite right. Decoding takes a byte string that is already
in a particular encoding and transforms it to unicode. Unicode isn't
a encoding of it's own. Decoding takes a unicode string (which
doesn't have any encoding associated with it), and gives you back a
sequence of bytes in a particular encoding.

Okay, this is useful. Thank you for straightening out my mental model.
It makes sense to define strings as just naturally Unicode... and
anything else is in some ways not really a string, although it's
something that might have many of the same methods. I guess this
mental model is being implemented more thoroughly in Py3K... Anyhow,
it makes sense.

I'm still puzzled why I'm getting some non-Unicode out of an
ElementTree's text, though.

Jerry

-Wm

**william tanksley** · Jul 31 '08, 12:55 AM

Re: Python parsing iTunes XML/COM

william tanksley <wtanksle...@gm ail.comwrote:

I'm still puzzled why I'm getting some non-Unicode out of an
ElementTree's text, though.

Now I know.

Okay, my answer is that cElementTree (in Python 2.5) is simply
deranged when it comes to Unicode. It assumes everything's ASCII.

Reference: http://codespeak.net/lxml/compatibility.html

(Note that the lxml version also doesn't handle Unicode correctly; it
errors when XML declares its encoding.)

This is unpleasant, but at least now I know WHY it was driving me
insane.

-Wm

**Stefan Behnel** · Jul 31 '08, 06:15 AM

Re: Python parsing iTunes XML/COM

william tanksley wrote:

william tanksley <wtanksle...@gm ail.comwrote:

>I'm still puzzled why I'm getting some non-Unicode out of an
>ElementTree' s text, though.

>
Now I know.
>
Okay, my answer is that cElementTree (in Python 2.5) is simply
deranged when it comes to Unicode. It assumes everything's ASCII.

It does not "assume" that. It *requires* byte strings to be ASCII. If it
didn't enforce that, how could it possibly know what encoding they were using,
i.e. what they were supposed to mean at all? Read the Python Zen, in the face
of ambiguity, ElementTree refuses the temptation to guess. Python 2.x does
exactly the same thing when it comes to implicit conversion between encoded
strings and Unicode strings.

If you want to pass plain ASCII strings, you can either pass a byte string or
a Unicode string (that's a plain convenience feature). If you want to pass
anything that's not ASCII, you *must* pass a Unicode string.

Reference: http://codespeak.net/lxml/compatibility.html
>
(Note that the lxml version also doesn't handle Unicode correctly; it
errors when XML declares its encoding.)

It definitely does "handle Unicode correctly". Let me guess, you tried passing
XML as a Unicode string into the parser, and your XML declared itself as
having a byte encoding (<?xml encoding="..."? >). How can that *not* be an error?

This is unpleasant, but at least now I know WHY it was driving me
insane.

You should *really* read a bit about Unicode and byte encodings. Not
understanding a topic is not a good excuse for complaining about it being
broken for you.

Stefan

**John Machin** · Jul 31 '08, 08:45 AM

Re: Python parsing iTunes XML/COM

On Jul 31, 12:58 am, william tanksley <wtanksle...@gm ail.comwrote:

Thank you for the response. Here's some more info, including a little
that you didn't ask me for but which might be useful.
>
John Machin <sjmac...@lexic on.netwrote:

william tanksley <wtanksle...@gm ail.comwrote:

To ask another way: how do I convert from a file:// URL to a local
path in a standard way, so that filepaths from two different sources
will work the same way in a dictionary?
The problems occur when the filenames have non-ascii characters in
them -- I suspect that the URLs are having some encoding placed on
them that Python's decoder doesn't know about.

# track_id = url2pathname(ur lparse(track_id ).path)
print repr(track_id)
parse_result = urlparse(track_ id).path
print repr(parse_resu lt)
track_id_replac ement = url2pathname(pa rse_result)
print repr(track_id_r eplacement)

>
The "important" value here is track_id_replac ement; it contains the
data that's throwing me. It appears that some UTF-8 characters are
being read as multiple bytes by ElementTree rather than being decoded
into Unicode.

Appearances can be deceptive. You present no evidence.

Could this be a bug in ElementTree's Unicode support?

It could, yes, but the probability is extremely low.

If
so, can I work around it?
>
Here's one example. The others are similar -- they have the same
things that look like problems to me.
>
"Buffett Time - Annual Shareholders\xc 2\xa0L.mp3"
>
Note some problems here:

Where?

>
1. This isn't Unicode; it's missing the u"" (I printed using repr).
2. It's got the UTF-8 bytes there in the middle.
>
I tried doing track_id.encode ("utf-8"), but it doesn't seem to make
any difference at all.
>
Of course, my ultimate goal is to compare the track_id to the track_id
I get from iTunes' COM interface, including hashing to the same value
for dict lookups.
>

and copy/paste the results into your next posting.

>
In addition to the above results,

*WHAT* results? I don't see any repr() output, just your
interpretation of what you think you saw!

**william tanksley** · Jul 31 '08, 01:55 PM

Re: Python parsing iTunes XML/COM

John Machin <sjmac...@lexic on.netwrote:

william tanksley <wtanksle...@gm ail.comwrote:

"Buffett Time - Annual Shareholders\xc 2\xa0L.mp3"
1. This isn't Unicode; it's missing the u"" (I printed using repr).
2. It's got the UTF-8 bytes there in the middle.
In addition to the above results,

*WHAT* results? I don't see any repr() output, just your
interpretation of what you think you saw!

That *is* the repr. I said it's the repr, and it IS. It's not an
interpretation; it's a screenscrape. Really, truly. If I paste it in
again it'll look the same.

What do you want? Can I post something that will convince you it's a
repr?

Oh well. You guys have been immensely helpful; my mental model of how
Python works was vastly backwards, so it's a relief to get it
corrected. Thanks to that, I was able to hack my code into working. I
wish I could get entirely correct behavior, but at this point the
miscommunicatio n is too strong. I'll settle for the hack I've got now,
and hope iTunes doesn't ever change its XML encoding (hey, I think
I've got cause to be optimistic).

-Wm

**Stefan Behnel** · Jul 31 '08, 06:05 PM

Re: Python parsing iTunes XML/COM

william tanksley wrote:

I didn't
pass a string. I passed a file. It didn't error out; instead, it
produced bytestring-encoded output (not Unicode).

From my experience (and from the source code I have seen so far), ElementTree
does not return UTF-8 encoded strings at the API level. Can you produce any
evidence for your claims? Some code and an XML file that together produce the
result you are talking about? From what you have written so far, it seems far
more likely to me that your code is messed up than that you found a bug in
ElementTree.

Stefan

**John Machin** · Jul 31 '08, 08:45 PM

Re: Python parsing iTunes XML/COM

On Jul 31, 11:54 pm, william tanksley <wtanksle...@gm ail.comwrote:

John Machin <sjmac...@lexic on.netwrote:

william tanksley <wtanksle...@gm ail.comwrote:

"Buffett Time - Annual Shareholders\xc 2\xa0L.mp3"
1. This isn't Unicode; it's missing the u"" (I printed using repr).
2. It's got the UTF-8 bytes there in the middle.
In addition to the above results,

*WHAT* results? I don't see any repr() output, just your
interpretation of what you think you saw!

>
That *is* the repr. I said it's the repr, and it IS. It's not an
interpretation; it's a screenscrape. Really, truly. If I paste it in
again it'll look the same.
>
What do you want? Can I post something that will convince you it's a
repr?
>

Let's try again:

># track_id = url2pathname(ur lparse(track_id ).path)
>print repr(track_id)
>parse_result = urlparse(track_ id).path
>print repr(parse_resu lt)
>track_id_repla cement = url2pathname(pa rse_result)
>print repr(track_id_r eplacement)

The "important" value here is track_id_replac ement; it contains the
data that's throwing me. It appears that some UTF-8 characters are
being read as multiple bytes by ElementTree rather than being decoded
into Unicode.

Here's one example. The others are similar -- they have the same
things that look like problems to me.

"Buffett Time - Annual Shareholders\xc 2\xa0L.mp3"

ROTFL! I thought the Buffett thing was a Windows filename! What I was
expecting was THREE lots of repr() output, and I'm quite unused to
seeing repr() output with quotes around it instead of apostrophes; how
did you achieve that?

So you're saying that track_id_replac ement contains utf8 characters.
It is obtained by track_id_replac ement = url2pathname(pa rse_result).
You don't show us what is in parse_result. url2pathname() is nothing
to do with ElementTree. urlparse() is nothing to do with ElementTree.
You have provided no evidence that ElementTree is doing what you
accuse it of.

Please try again. Backtrack in your code to where you are pulling the
url out of an element. Do print repr(some_eleme nt.some_attribu te).
Show us.

Python parsing iTunes XML/COM

Python parsing iTunes XML/COM

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment