xHTML/XML to Unicode (and back)

Collapse
This topic is closed.
X
X
 
  • Time
  • Show
Clear All
new posts
  • Robin Haswell

    xHTML/XML to Unicode (and back)

    Hey guys

    I'm currently screenscraping some Swedish site, and i need a method to
    convert XML entities (& etc, plus d etc) to Unicode characters.
    I'm sure one of python's myriad of XML processors can do this but I can't
    find which one.

    Can anyone make any suggestions?

    Thanks

    -Rob
  • Fredrik Lundh

    #2
    Re: xHTML/XML to Unicode (and back)

    Robin Haswell wrote:
    [color=blue]
    > I'm currently screenscraping some Swedish site, and i need a method to
    > convert XML entities (& etc, plus d etc) to Unicode characters.
    > I'm sure one of python's myriad of XML processors can do this but I can't
    > find which one.
    >
    > Can anyone make any suggestions?[/color]

    any decent html-aware screen scraper library should be able to do
    this for you.

    if you've already extracted the strings, the strip_html function on
    this page might be what you need:



    </F>



    Comment

    • Robin Haswell

      #3
      Re: xHTML/XML to Unicode (and back)

      On Tue, 24 Jan 2006 14:46:46 +0100, Fredrik Lundh wrote:
      [color=blue]
      > Robin Haswell wrote:
      >[color=green]
      >> I'm currently screenscraping some Swedish site, and i need a method to
      >> convert XML entities (&amp; etc, plus &#100; etc) to Unicode characters.
      >> I'm sure one of python's myriad of XML processors can do this but I can't
      >> find which one.
      >>
      >> Can anyone make any suggestions?[/color]
      >
      > any decent html-aware screen scraper library should be able to do
      > this for you.[/color]

      I'm using BeautifulSoup and it appears that it doesn't. I'd also like to
      know the answer to this for when I do screenscraping with regular
      expressions :-)

      Thanks
      [color=blue]
      >
      > if you've already extracted the strings, the strip_html function on
      > this page might be what you need:
      >
      > http://effbot.org/zone/re-sub.htm#strip-html
      >
      > </F>[/color]

      Comment

      • Paul Boddie

        #4
        Re: xHTML/XML to Unicode (and back)

        Robin Haswell wrote:[color=blue]
        > On Tue, 24 Jan 2006 14:46:46 +0100, Fredrik Lundh wrote:
        >[color=green]
        > > Robin Haswell wrote:
        > >[color=darkred]
        > >> I'm currently screenscraping some Swedish site, and i need a method to
        > >> convert XML entities (&amp; etc, plus &#100; etc) to Unicode characters.
        > >> I'm sure one of python's myriad of XML processors can do this but I can't
        > >> find which one.
        > >>
        > >> Can anyone make any suggestions?[/color]
        > >
        > > any decent html-aware screen scraper library should be able to do
        > > this for you.[/color][/color]

        And if it's really XHTML/XML, why not just use an XML parser? ;-)
        [color=blue]
        > I'm using BeautifulSoup and it appears that it doesn't. I'd also like to
        > know the answer to this for when I do screenscraping with regular
        > expressions :-)[/color]

        Anyway, on the subject of XML parsers, here's something to try out:

        import libxml2dom
        import urllib
        f = urllib.urlopen( "http://www.sweden.se/") # some Swedish site!
        s = f.read()
        f.close()
        d = libxml2dom.pars eString(s, html=1)

        Here, we assume that the site isn't well-formed XML and must be treated
        as HTML, which libxml2 seems to be fairly good at doing. Then...

        for a in d.xpath("//a"):
        print repr(a.getAttri bute("href")), \
        repr(a.getAttri bute("title")), \
        repr(a.nodeValu e)

        Here, we print out some of the hyperlinks in the page using repr to
        show what the strings look like (and in a way that doesn't require you
        to encode them for your terminal). On the above Swedish site, you'll
        see some things like this:

        u'Fran\xe7ais'

        What's interesting is that in some cases such strings may have been
        encoded using entities (such as in the title attributes), whereas in
        other cases they may have been encoded using UTF-8 byte sequences (such
        as in the link texts). The nice thing is that libxml2 just works it out
        on your behalf.

        So there's no compelling need for regular expressions, but I'm sure
        Fredrik will offer some alternative suggestions... and possibly some
        good Swedish links, too. ;-)

        Paul

        Comment

        Working...