Python and decimal character entities over 128.

Collapse
This topic is closed.
X
X
 
  • Time
  • Show
Clear All
new posts
  • bsagert@gmail.com

    Python and decimal character entities over 128.

    Some web feeds use decimal character entities that seem to confuse
    Python (or me). For example, the string "doesn't" may be coded as
    "doesn’t" which should produce a right leaning apostrophe.
    Python hates decimal entities beyond 128 so it chokes unless you do
    something like string.encode(' utf-8'). Even then, what should have
    been a right-leaning apostrophe ends up as "’". The following script
    does just that. Look for the string "The Canuck iPhone: Apple doesnâ
    €™t care" after running it.

    # coding: UTF-8
    import feedparser

    s = ''
    d = feedparser.pars e('http://feeds.feedburne r.com/Mathewingramcom/
    work')
    title = d.feed.title
    link = d.feed.link
    for i in range(0,4):
    title = d.entries[i].title
    link = d.entries[i].link
    s += title +'\n' + link + '\n'

    f = open('c:/x/test.txt', 'w')
    f.write(s.encod e('utf-8'))
    f.close()

    This useless script is adapted from a "useful" script. Its only
    purpose is to ask the Python community how I can deal with decimal
    entities 128. Thanks in advance, Bill


  • Marc 'BlackJack' Rintsch

    #2
    Re: Python and decimal character entities over 128.

    On Wed, 09 Jul 2008 16:39:24 -0700, bsagert wrote:
    Some web feeds use decimal character entities that seem to confuse
    Python (or me).
    I guess they confuse you. Python is fine.
    For example, the string "doesn't" may be coded as "doesn’t" which
    should produce a right leaning apostrophe. Python hates decimal entities
    beyond 128 so it chokes unless you do something like
    string.encode(' utf-8').
    Python doesn't hate nor chokes on these entities. It just refuses to
    guess which encoding you want, if you try to write *unicode* objects into
    a file. Files contain byte values not characters.
    Even then, what should have been a right-leaning apostrophe ends up as
    "’". The following script does just that. Look for the string "The
    Canuck iPhone: Apple doesnâ €™t care" after running it.
    Then you didn't tell the application you used to look at the result, that
    the text is UTF-8 encoded. I guess you are using Windows and
    the application expects cp1252 encoded text because an UTF-8 encoded
    apostrophe looks like '’' in cp1252.

    Choose the encoding you want the result to have and anything is fine.
    Unless you stumble over a feed using characters which can't be encoded
    in the encoding of your choice. That's why UTF-8 might have been a good
    idea.

    Ciao,
    Marc 'BlackJack' Rintsch

    Comment

    • Manuel Vazquez Acosta

      #3
      Re: Python and decimal character entities over 128.

      -----BEGIN PGP SIGNED MESSAGE-----
      Hash: SHA1

      bsagert@gmail.c om wrote:
      Some web feeds use decimal character entities that seem to confuse
      Python (or me). For example, the string "doesn't" may be coded as
      "doesn’t" which should produce a right leaning apostrophe.
      Python hates decimal entities beyond 128 so it chokes unless you do
      something like string.encode(' utf-8'). Even then, what should have
      been a right-leaning apostrophe ends up as "’". The following script
      does just that. Look for the string "The Canuck iPhone: Apple doesnâ
      €™t care" after running it.
      >
      # coding: UTF-8
      import feedparser
      >
      s = ''
      d = feedparser.pars e('http://feeds.feedburne r.com/Mathewingramcom/
      work')
      title = d.feed.title
      link = d.feed.link
      for i in range(0,4):
      title = d.entries[i].title
      link = d.entries[i].link
      s += title +'\n' + link + '\n'
      >
      f = open('c:/x/test.txt', 'w')
      f.write(s.encod e('utf-8'))
      f.close()
      >
      This useless script is adapted from a "useful" script. Its only
      purpose is to ask the Python community how I can deal with decimal
      entities 128. Thanks in advance, Bill
      >
      >
      --

      >
      This is a two-fold issue: encodings/charsets and entities. Encodings are
      a way to _encode_ charsets to a sequence of octets. Entities are a way
      to avoid a (harder) encoding/decoding process at the expense of
      readability: when you type #8217; no one actually see the intended
      character, but those are easily encoded in ascii.

      When dealing with multiples sources of information, like your script may
      be, I always include a middleware of normalization to Python's Unicode
      Type. Web sites may use whatever encoding they please.

      The whole process is like this:
      1. Fetch the content
      2. Use whatever clue in the contents to guess the encoding used by the
      document, e.g Content-type HTTP header; <meta http-equiv="content-type"
      ....>; <?xml version="1.0" encoding="utf-8"?>, and so on.
      3. If none are present, then use chardet to guess for an acceptable decoder.
      4. Decode ignoring those character that cannot be decoded.
      5. The result is further processed to find entities and "decode" them to
      actual Unicode characters. (See below)

      You may find these helpful:




      This is function I have used to process entities:
      Code:
      from htmlentitydefs import name2codepoint
      def __processhtmlentities__(text):
      assert type(text) is unicode, "Non-normalized text"
      html = []
      (buffer, amp, text) = text.partition('&')
      while amp:
      html.append(buffer)
      (entity, semicolon, text) = text.partition(';')
      if entity[0] != '#':
      if entity in name2codepoint:
      html.append(unichr(name2codepoint[entity]))
      else:
      html.append(int(entity[1:])))
      (buffer, amp, text) = text.partition('&')
      html.append(buffer)
      return u''.join(html)

      Best regards,
      Manuel.
      -----BEGIN PGP SIGNATURE-----
      Version: GnuPG v1.4.9 (GNU/Linux)
      Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org

      iEYEARECAAYFAkh 2S+sACgkQI2zpkm cEAhil6gCgkAnRE 4s5b8oQHamk6utk bAl7
      m+YAoIZH2/u73hDcs0G/u294use27v17
      =mXuK
      -----END PGP SIGNATURE-----

      Comment

      • Ben Finney

        #4
        Re: Python and decimal character entities over 128.

        I don't have an answer for why Python might be mis-handling the data,
        but wanted to make a factual correction:

        bsagert@gmail.c om writes:
        Some web feeds use decimal character entities that seem to confuse
        Python (or me). For example, the string "doesn't" may be coded as
        "doesn’t" which should produce a right leaning apostrophe.
        That character isn't a "right leaning apostrophe"; it has nothing to
        do with apostrophes. It is the character called "right single
        quotation mark" in <URL:http://www.w3.org/TR/html4/sgml/entities.html>
        and in Unicode (code point U+2019).

        It's a typographical error to use a quotation mark as an apostrophe.
        Use the apostrophe character (U+0027) where an apostrophe is intended,
        and quotation mark characters where those are intended.

        This is directed, of course, at the person generating that output.

        --
        \ “If you go to a costume party at your boss's house, wouldn't |
        `\ you think a good costume would be to dress up like the boss's |
        _o__) wife? Trust me, it's not.” —Jack Handey |
        Ben Finney

        Comment

        Working...