Python parsing iTunes XML/COM

Collapse
This topic is closed.
X
X
 
  • Time
  • Show
Clear All
new posts
  • william tanksley

    Python parsing iTunes XML/COM

    I'm trying to convert the URLs contained in iTunes' XML file into a
    form comparable with the filenames returned by iTunes' COM interface.

    I'm writing a podcast sorter in Python; I'm using iTunes under Windows
    right now. iTunes' COM provides most of my data input and all of my
    mp3/aac editing capabilities; the one thing I can't access through COM
    is the Release Date, which is my primary sorting field. So I read
    everything in through COM, then read all the release dates from the
    iTunes XML file, then try to join the two together... But so far I
    have zero success.

    Is there _any_ way to match up tracks between iTunes COM and iTunes
    XML? I've spent far too much effort on this. I'm not stuck on using
    filenames, if that's a bad idea... But I haven't found anything else
    that works, and filenames seem like an obvious solution.

    -Wm
  • william tanksley

    #2
    Re: Python parsing iTunes XML/COM

    To ask another way: how do I convert from a file:// URL to a local
    path in a standard way, so that filepaths from two different sources
    will work the same way in a dictionary?

    Right now I'm using the following source:

    track_id = url2pathname(ur lparse(track_id ).path)

    url2pathname is from urllib; urlparse is from the urlparse module.

    The problems occur when the filenames have non-ascii characters in
    them -- I suspect that the URLs are having some encoding placed on
    them that Python's decoder doesn't know about.

    Thank you all in advance, and thank you for Python.

    -Wm

    Comment

    • John Machin

      #3
      Re: Python parsing iTunes XML/COM

      On Jul 30, 3:53 am, william tanksley <wtanksle...@gm ail.comwrote:
      To ask another way: how do I convert from a file:// URL to a local
      path in a standard way, so that filepaths from two different sources
      will work the same way in a dictionary?
      >
      Right now I'm using the following source:
      >
      track_id = url2pathname(ur lparse(track_id ).path)
      >
      url2pathname is from urllib; urlparse is from the urlparse module.
      >
      The problems occur when the filenames have non-ascii characters in
      them -- I suspect that the URLs are having some encoding placed on
      them that Python's decoder doesn't know about.
      WHAT problems? WHAT non-ASCII characters?? Consider e.g.

      # track_id = url2pathname(ur lparse(track_id ).path)
      print repr(track_id)
      parse_result = urlparse(track_ id).path
      print repr(parse_resu lt)
      track_id_replac ement = url2pathname(pa rse_result)
      print repr(track_id_r eplacement)

      and copy/paste the results into your next posting.

      Comment

      • pyshib@googlemail.com

        #4
        Re: Python parsing iTunes XML/COM

        If you want to convert the file names which use standard URL encoding
        (with %20 for space, etc) use:

        from urllib import unquote
        new_filename = unquote(filenam e)

        I have found this does not convert encoded characters of the form
        '&#CC;' so you may have to do that manually. I think these are just
        ascii encodings in hexadecimal.

        Comment

        • william tanksley

          #5
          Re: Python parsing iTunes XML/COM

          Thank you for the response. Here's some more info, including a little
          that you didn't ask me for but which might be useful.

          John Machin <sjmac...@lexic on.netwrote:
          william tanksley <wtanksle...@gm ail.comwrote:
          To ask another way: how do I convert from a file:// URL to a local
          path in a standard way, so that filepaths from two different sources
          will work the same way in a dictionary?
          The problems occur when the filenames have non-ascii characters in
          them -- I suspect that the URLs are having some encoding placed on
          them that Python's decoder doesn't know about.
          # track_id = url2pathname(ur lparse(track_id ).path)
          print repr(track_id)
          parse_result = urlparse(track_ id).path
          print repr(parse_resu lt)
          track_id_replac ement = url2pathname(pa rse_result)
          print repr(track_id_r eplacement)
          The "important" value here is track_id_replac ement; it contains the
          data that's throwing me. It appears that some UTF-8 characters are
          being read as multiple bytes by ElementTree rather than being decoded
          into Unicode. Could this be a bug in ElementTree's Unicode support? If
          so, can I work around it?

          Here's one example. The others are similar -- they have the same
          things that look like problems to me.

          "Buffett Time - Annual Shareholders\xc 2\xa0L.mp3"

          Note some problems here:

          1. This isn't Unicode; it's missing the u"" (I printed using repr).
          2. It's got the UTF-8 bytes there in the middle.

          I tried doing track_id.encode ("utf-8"), but it doesn't seem to make
          any difference at all.

          Of course, my ultimate goal is to compare the track_id to the track_id
          I get from iTunes' COM interface, including hashing to the same value
          for dict lookups.
          and copy/paste the results into your next posting.
          In addition to the above results, while trying to get more diagnostic
          printouts I got the following warning from Python:

          C:\projects\pod casts\podstrand \podcast.py:280 : UnicodeWarning: Unicode
          equal comparison failed to convert both arguments to Unicode -
          interpreting them as being unequal
          return track.databaseI D == trackLocation

          The code that triggered this is as follows:

          if trackLocation in self.podcasts:
          track = self.podcasts[trackLocation]
          if trackRelease:
          track.release_d ate = trackRelease
          elif track.is_podcas t:
          print "No release date:", repr(track.name )
          else:
          # For the sake of diagnostics, try to find the track.
          def track_has_locat ion(track):
          return track.databaseI D == trackLocation
          fillers = filter(track_ha s_location, self.fillers)
          if len(fillers):
          return
          disabled = filter(track_ha s_location, self.deferred)
          if len(disabled):
          return
          print "Location not known:", repr(trackLocat ion)

          -Wm

          Comment

          • Jerry Hill

            #6
            Re: Python parsing iTunes XML/COM

            On Wed, Jul 30, 2008 at 10:58 AM, william tanksley
            <wtanksleyjr@gm ail.comwrote:
            Here's one example. The others are similar -- they have the same
            things that look like problems to me.
            >
            "Buffett Time - Annual Shareholders\xc 2\xa0L.mp3"
            >
            Note some problems here:
            >
            1. This isn't Unicode; it's missing the u"" (I printed using repr).
            2. It's got the UTF-8 bytes there in the middle.
            >
            I tried doing track_id.encode ("utf-8"), but it doesn't seem to make
            any difference at all.
            I don't have anything to say about your iTunes problems, but encode()
            is the wrong method to turn a byte string into a unicode string.
            Instead, use decode(), like this:
            >>track_id = "Buffett Time - Annual Shareholders\xc 2\xa0L.mp3"
            >>utrack_id = track_id.decode ('utf-8')
            >>type(utrack_i d)
            <type 'unicode'>
            >>print utrack_id
            Buffett Time - Annual Shareholders L.mp3
            >>print repr(utrack_id)
            u'Buffett Time - Annual Shareholders\xa 0L.mp3'
            >>>
            --
            Jerry

            Comment

            • Stefan Behnel

              #7
              Re: Python parsing iTunes XML/COM

              william tanksley wrote:
              Okay, so you decode to go from raw
              byes into a given encoding, and you encode to go from a given encoding
              to raw bytes.
              No, decoding goes from a byte sequence to a Unicode string and encoding goes
              from a Unicode string to a byte sequence.

              Unicode is not an encoding. A Unicode string is a character sequence, not a
              byte sequence.

              Stefan

              Comment

              • Jerry Hill

                #8
                Re: Python parsing iTunes XML/COM

                On Wed, Jul 30, 2008 at 2:27 PM, william tanksley <wtanksleyjr@gm ail.comwrote:
                Awesome... Thank you! I had my mental model of Python turned around
                backwards. That's an odd feeling. Okay, so you decode to go from raw
                byes into a given encoding, and you encode to go from a given encoding
                to raw bytes. Not what I thought it was, but that's cool, makes sense.
                That's not quite right. Decoding takes a byte string that is already
                in a particular encoding and transforms it to unicode. Unicode isn't
                a encoding of it's own. Decoding takes a unicode string (which
                doesn't have any encoding associated with it), and gives you back a
                sequence of bytes in a particular encoding.

                This article isn't specific to Python, but it provides a good overview
                of unicode and character encodings that may be useful:
                Ever wonder about that mysterious Content-Type tag? You know, the one you’re supposed to put in HTML and you never quite know what it should be? Did you ever get an email from your friends in…


                --
                Jerry

                Comment

                • william tanksley

                  #9
                  Re: Python parsing iTunes XML/COM

                  "Jerry Hill" <malaclyp...@gm ail.comwrote:
                  On Wed, Jul 30, 2008 at 2:27 PM, william tanksley <wtanksle...@gm ail.com>wrote:
                  Awesome... Thank you! I had my mental model of Python turned around
                  backwards. That's an odd feeling. Okay, so you decode to go from raw
                  byes into a given encoding, and you encode to go from a given encoding
                  to raw bytes. Not what I thought it was, but that's cool, makes sense.
                  That's not quite right.  Decoding takes a byte string that is already
                  in a particular encoding and transforms it to unicode.  Unicode isn't
                  a encoding of it's own.  Decoding takes a unicode string (which
                  doesn't have any encoding associated with it), and gives you back a
                  sequence of bytes in a particular encoding.
                  Okay, this is useful. Thank you for straightening out my mental model.
                  It makes sense to define strings as just naturally Unicode... and
                  anything else is in some ways not really a string, although it's
                  something that might have many of the same methods. I guess this
                  mental model is being implemented more thoroughly in Py3K... Anyhow,
                  it makes sense.

                  I'm still puzzled why I'm getting some non-Unicode out of an
                  ElementTree's text, though.
                  Jerry
                  -Wm

                  Comment

                  • william tanksley

                    #10
                    Re: Python parsing iTunes XML/COM

                    william tanksley <wtanksle...@gm ail.comwrote:
                    I'm still puzzled why I'm getting some non-Unicode out of an
                    ElementTree's text, though.
                    Now I know.

                    Okay, my answer is that cElementTree (in Python 2.5) is simply
                    deranged when it comes to Unicode. It assumes everything's ASCII.

                    Reference: http://codespeak.net/lxml/compatibility.html

                    (Note that the lxml version also doesn't handle Unicode correctly; it
                    errors when XML declares its encoding.)

                    This is unpleasant, but at least now I know WHY it was driving me
                    insane.
                    -Wm
                    -Wm

                    Comment

                    • Stefan Behnel

                      #11
                      Re: Python parsing iTunes XML/COM

                      william tanksley wrote:
                      william tanksley <wtanksle...@gm ail.comwrote:
                      >I'm still puzzled why I'm getting some non-Unicode out of an
                      >ElementTree' s text, though.
                      >
                      Now I know.
                      >
                      Okay, my answer is that cElementTree (in Python 2.5) is simply
                      deranged when it comes to Unicode. It assumes everything's ASCII.
                      It does not "assume" that. It *requires* byte strings to be ASCII. If it
                      didn't enforce that, how could it possibly know what encoding they were using,
                      i.e. what they were supposed to mean at all? Read the Python Zen, in the face
                      of ambiguity, ElementTree refuses the temptation to guess. Python 2.x does
                      exactly the same thing when it comes to implicit conversion between encoded
                      strings and Unicode strings.

                      If you want to pass plain ASCII strings, you can either pass a byte string or
                      a Unicode string (that's a plain convenience feature). If you want to pass
                      anything that's not ASCII, you *must* pass a Unicode string.

                      Reference: http://codespeak.net/lxml/compatibility.html
                      >
                      (Note that the lxml version also doesn't handle Unicode correctly; it
                      errors when XML declares its encoding.)
                      It definitely does "handle Unicode correctly". Let me guess, you tried passing
                      XML as a Unicode string into the parser, and your XML declared itself as
                      having a byte encoding (<?xml encoding="..."? >). How can that *not* be an error?

                      This is unpleasant, but at least now I know WHY it was driving me
                      insane.
                      You should *really* read a bit about Unicode and byte encodings. Not
                      understanding a topic is not a good excuse for complaining about it being
                      broken for you.

                      Stefan

                      Comment

                      • John Machin

                        #12
                        Re: Python parsing iTunes XML/COM

                        On Jul 31, 12:58 am, william tanksley <wtanksle...@gm ail.comwrote:
                        Thank you for the response. Here's some more info, including a little
                        that you didn't ask me for but which might be useful.
                        >
                        John Machin <sjmac...@lexic on.netwrote:
                        william tanksley <wtanksle...@gm ail.comwrote:
                        To ask another way: how do I convert from a file:// URL to a local
                        path in a standard way, so that filepaths from two different sources
                        will work the same way in a dictionary?
                        The problems occur when the filenames have non-ascii characters in
                        them -- I suspect that the URLs are having some encoding placed on
                        them that Python's decoder doesn't know about.
                        # track_id = url2pathname(ur lparse(track_id ).path)
                        print repr(track_id)
                        parse_result = urlparse(track_ id).path
                        print repr(parse_resu lt)
                        track_id_replac ement = url2pathname(pa rse_result)
                        print repr(track_id_r eplacement)
                        >
                        The "important" value here is track_id_replac ement; it contains the
                        data that's throwing me. It appears that some UTF-8 characters are
                        being read as multiple bytes by ElementTree rather than being decoded
                        into Unicode.
                        Appearances can be deceptive. You present no evidence.
                        Could this be a bug in ElementTree's Unicode support?
                        It could, yes, but the probability is extremely low.
                        If
                        so, can I work around it?
                        >
                        Here's one example. The others are similar -- they have the same
                        things that look like problems to me.
                        >
                        "Buffett Time - Annual Shareholders\xc 2\xa0L.mp3"
                        >
                        Note some problems here:
                        Where?
                        >
                        1. This isn't Unicode; it's missing the u"" (I printed using repr).
                        2. It's got the UTF-8 bytes there in the middle.
                        >
                        I tried doing track_id.encode ("utf-8"), but it doesn't seem to make
                        any difference at all.
                        >
                        Of course, my ultimate goal is to compare the track_id to the track_id
                        I get from iTunes' COM interface, including hashing to the same value
                        for dict lookups.
                        >
                        and copy/paste the results into your next posting.
                        >
                        In addition to the above results,
                        *WHAT* results? I don't see any repr() output, just your
                        interpretation of what you think you saw!

                        Comment

                        • william tanksley

                          #13
                          Re: Python parsing iTunes XML/COM

                          John Machin <sjmac...@lexic on.netwrote:
                          william tanksley <wtanksle...@gm ail.comwrote:
                          "Buffett Time - Annual Shareholders\xc 2\xa0L.mp3"
                          1. This isn't Unicode; it's missing the u"" (I printed using repr).
                          2. It's got the UTF-8 bytes there in the middle.
                          In addition to the above results,
                          *WHAT* results? I don't see any repr() output, just your
                          interpretation of what you think you saw!
                          That *is* the repr. I said it's the repr, and it IS. It's not an
                          interpretation; it's a screenscrape. Really, truly. If I paste it in
                          again it'll look the same.

                          What do you want? Can I post something that will convince you it's a
                          repr?

                          Oh well. You guys have been immensely helpful; my mental model of how
                          Python works was vastly backwards, so it's a relief to get it
                          corrected. Thanks to that, I was able to hack my code into working. I
                          wish I could get entirely correct behavior, but at this point the
                          miscommunicatio n is too strong. I'll settle for the hack I've got now,
                          and hope iTunes doesn't ever change its XML encoding (hey, I think
                          I've got cause to be optimistic).

                          -Wm

                          Comment

                          • Stefan Behnel

                            #14
                            Re: Python parsing iTunes XML/COM

                            william tanksley wrote:
                            I didn't
                            pass a string. I passed a file. It didn't error out; instead, it
                            produced bytestring-encoded output (not Unicode).
                            From my experience (and from the source code I have seen so far), ElementTree
                            does not return UTF-8 encoded strings at the API level. Can you produce any
                            evidence for your claims? Some code and an XML file that together produce the
                            result you are talking about? From what you have written so far, it seems far
                            more likely to me that your code is messed up than that you found a bug in
                            ElementTree.

                            Stefan

                            Comment

                            • John Machin

                              #15
                              Re: Python parsing iTunes XML/COM

                              On Jul 31, 11:54 pm, william tanksley <wtanksle...@gm ail.comwrote:
                              John Machin <sjmac...@lexic on.netwrote:
                              william tanksley <wtanksle...@gm ail.comwrote:
                              "Buffett Time - Annual Shareholders\xc 2\xa0L.mp3"
                              1. This isn't Unicode; it's missing the u"" (I printed using repr).
                              2. It's got the UTF-8 bytes there in the middle.
                              In addition to the above results,
                              *WHAT* results? I don't see any repr() output, just your
                              interpretation of what you think you saw!
                              >
                              That *is* the repr. I said it's the repr, and it IS. It's not an
                              interpretation; it's a screenscrape. Really, truly. If I paste it in
                              again it'll look the same.
                              >
                              What do you want? Can I post something that will convince you it's a
                              repr?
                              >
                              Let's try again:
                              ># track_id = url2pathname(ur lparse(track_id ).path)
                              >print repr(track_id)
                              >parse_result = urlparse(track_ id).path
                              >print repr(parse_resu lt)
                              >track_id_repla cement = url2pathname(pa rse_result)
                              >print repr(track_id_r eplacement)
                              The "important" value here is track_id_replac ement; it contains the
                              data that's throwing me. It appears that some UTF-8 characters are
                              being read as multiple bytes by ElementTree rather than being decoded
                              into Unicode.
                              Here's one example. The others are similar -- they have the same
                              things that look like problems to me.
                              "Buffett Time - Annual Shareholders\xc 2\xa0L.mp3"
                              ROTFL! I thought the Buffett thing was a Windows filename! What I was
                              expecting was THREE lots of repr() output, and I'm quite unused to
                              seeing repr() output with quotes around it instead of apostrophes; how
                              did you achieve that?

                              So you're saying that track_id_replac ement contains utf8 characters.
                              It is obtained by track_id_replac ement = url2pathname(pa rse_result).
                              You don't show us what is in parse_result. url2pathname() is nothing
                              to do with ElementTree. urlparse() is nothing to do with ElementTree.
                              You have provided no evidence that ElementTree is doing what you
                              accuse it of.

                              Please try again. Backtrack in your code to where you are pulling the
                              url out of an element. Do print repr(some_eleme nt.some_attribu te).
                              Show us.

                              Comment

                              Working...