Python for Vcard Parsing in UTF16

Collapse
This topic is closed.
X
X
 
  • Time
  • Show
Clear All
new posts
  • R Wood

    Python for Vcard Parsing in UTF16

    Greetings -

    A recent Perl experiment hasn't turned out so well, which has piqued my
    interest in Python. The project is this: take a Vcard file exported from
    Apple's Addressbook and use a language that is good at parsing text to convert
    it into a mutt alias file. There are better ways to use Mutt with Mac's
    addressbook, but I want to be able to periodically convert my working
    addressbook file into an alias file I can then transfer across all my different
    machines - two Macs, two Linux, and one FreeBSD. It's basically a couple of
    regexes that look for FN: followed by a name and convert all the words of the
    name into a single structure separated by underscores, followed by the email
    addresses. You would wind up with

    alias Linus_Torvalds Linus Torvalds <lt@linux.com >

    To me this was a natural task for Perl. Turns out however, there's a catch.
    Apple exports the file in UTF-16 to ensure anyone with Chinese characters in
    their addressbook gets a legitimate Vcard file. And of course Perl somewhat
    chokes on UTF. I've found several ways to do it that involve complicated
    downloads and installations of Perl modules, but that defeats the purpose of
    making it simple. In an ideal world you should be able to say "try this cool
    script" and be done with it. Once you have to say "go to CPAN, download and
    compile this module, then ..." it gets less exciting.

    I know nothing about Python except that it interests me and has interested me
    since I first learned the Rekall database frontend (Linux) runs on it. I just
    ordered Learning Python and if that works out satisfactorily I'm going to go
    back for Programming Python. In the meantime, I thought I would pose the
    question to this newsgroup: would Python be useful for a parsing exercise like
    this one?
  • Alex Martelli

    #2
    Re: Python for Vcard Parsing in UTF16

    R Wood <rwood@therandy mon.comwrote:
    ...
    alias Linus_Torvalds Linus Torvalds <lt@linux.com >
    >
    To me this was a natural task for Perl. Turns out however, there's a catch.
    Apple exports the file in UTF-16 to ensure anyone with Chinese characters in
    their addressbook gets a legitimate Vcard file. And of course Perl somewhat
    chokes on UTF. I've found several ways to do it that involve complicated
    downloads and installations of Perl modules, but that defeats the purpose of
    making it simple. In an ideal world you should be able to say "try this cool
    script" and be done with it. Once you have to say "go to CPAN, download and
    compile this module, then ..." it gets less exciting.
    >
    I know nothing about Python except that it interests me and has interested me
    since I first learned the Rekall database frontend (Linux) runs on it. I just
    ordered Learning Python and if that works out satisfactorily I'm going to go
    back for Programming Python. In the meantime, I thought I would pose the
    question to this newsgroup: would Python be useful for a parsing exercise like
    this one?
    Sure, Python and Perl (and Ruby) should be equally suitable for the
    task, so, if Python appears more suitable by having built-in unicode
    capabilities, go for it. I'm a bit uncertain about the UTF-16 export
    though; I know some applications do use it (e.g., Microsoft Entourage),
    but I thought Apple's Address Book didn't, and, having just tried a
    VCard export from mine, it looks quite ASCII to me. Maybe you've set
    some kind of preference, or...?


    Alex

    Comment

    • R Wood

      #3
      Re: Python for Vcard Parsing in UTF16

      Alex Martelli wrote:
      R Wood <rwood@therandy mon.comwrote:
      ...
      >alias Linus_Torvalds Linus Torvalds <lt@linux.com >
      >>
      >To me this was a natural task for Perl. Turns out however, there's a
      >catch. Apple exports the file in UTF-16 to ensure anyone with Chinese
      >characters in
      >their addressbook gets a legitimate Vcard file. And of course Perl
      >somewhat
      >chokes on UTF.
      >
      Sure, Python and Perl (and Ruby) should be equally suitable for the
      task, so, if Python appears more suitable by having built-in unicode
      capabilities, go for it. I'm a bit uncertain about the UTF-16 export
      though; I know some applications do use it (e.g., Microsoft Entourage),
      but I thought Apple's Address Book didn't, and, having just tried a
      VCard export from mine, it looks quite ASCII to me. Maybe you've set
      some kind of preference, or...?
      >
      >
      Alex
      I did the same thing. Apple's clever. If your addressbook doesn't have any
      higher characters, ie nothing but ASCII, it will export your addressbook in
      ASCII. But if you have anything else (in my case, Spanish, French, and
      Italian) it goes for UTF16. I first thought it was UTF8 but realized since
      Apple supports all sorts of Asian languages really well they need UTF16 to
      deal with it, and importing the exported file into Jedit using UTF16
      encoding confirmed that's what it is.

      Comment

      • Adam Atlas

        #4
        Re: Python for Vcard Parsing in UTF16

        On Apr 21, 7:28 pm, R Wood <r...@therandym on.comwrote:
        I know nothing about Python except that it interests me and has interested me
        since I first learned the Rekall database frontend (Linux) runs on it. I just
        ordered Learning Python and if that works out satisfactorily I'm going to go
        back for Programming Python. In the meantime, I thought I would pose the
        question to this newsgroup: would Python be useful for a parsing exercise like
        this one?
        Here's a little function that takes some `str`-type data (i.e. what
        you'd get from doing open(...).read( )) and, assuming it's a Vcard,
        detects its encoding and converts it to a canonical `unicode` object.

        def fix_encoding(s) :
        m = u'BEGIN:VCARD'
        for c in ('ascii', 'utf_16_be', 'utf_16_le', 'utf_8'):
        try: u = unicode(s, c)
        except UnicodeDecodeEr ror: continue
        if m in u: return u
        return None

        Comment

        • Adam Atlas

          #5
          Re: Python for Vcard Parsing in UTF16

          On Apr 21, 7:28 pm, R Wood <r...@therandym on.comwrote:
          To me this was a natural task for Perl. Turns out however, there's a catch.
          Apple exports the file in UTF-16 to ensure anyone with Chinese characters in
          their addressbook gets a legitimate Vcard file.
          Here's a function that, given a `str` containing a vcard in some
          encoding, guesses the encoding and returns a canonical representation
          as a `unicode` object.

          def fix_encoding(s) :
          m = u'BEGIN:VCARD'
          for c in ('ascii', 'utf_16_be', 'utf_16_le', 'utf_8'):
          try: u = unicode(s, c)
          except UnicodeDecodeEr ror: continue
          if m in u: return u
          return None

          Comment

          Working...