HTML Scrapping using XmlTextReader

Collapse
This topic is closed.
X
X
 
  • Time
  • Show
Clear All
new posts
  • Daniel

    HTML Scrapping using XmlTextReader

    Greetings.

    Just wondering if it is possible to use XmlTextReader to
    read off a html doc:

    e.g. XmlTextReader tr = new XmlTextReader
    ("http://localhost/test.xml");

    where test.xml contains the following:

    <table cellspacing="1" cellpadding="1" width="100%">
    <tr valign="top">
    <td class="head" width="20%">tes t heading1</td>
    <td class="head" width="10%">tes t heading2</td>
    </tr>
    <tr valign="top">
    <td class="content" width="20%">con tent1</td>
    <td class="content" width="10%">
    <table cellspacing="0" width="100%">
    <tr>
    <td align="left">te st</td>
    <td nowarp align="right">
    <nobr>0.12345 6</nobr>
    </td>
    </tr>
    </table>
    </td>
    </tr>
    </table>

    It seems to work for the first few seconds and then it
    crashes my win app after the XmlTextReader come across
    certain situation when doing a Xml.TextReader. Read(). Is
    it to do with the well-formness(is there such a word??) of
    this html doc? Also, is there a way to detect and convert
    &nbsp; to the #1390(can't remember if this is right but I
    am trying to say the equivalent special character) on the
    fly (i.e. without saving the html onto disk)?

    Any thought will be appreciated.
  • Oleg Tkachenko

    #2
    Re: HTML Scrapping using XmlTextReader

    Daniel wrote:
    [color=blue]
    > Just wondering if it is possible to use XmlTextReader to
    > read off a html doc:[/color]
    Not really, because html is not xml. Some html docs might be well-formed, so
    they can be read be XmlTextReader, but in general a single <br> tag or
    ubiquitous in HTML &nbsp; will stop reading.
    [color=blue]
    > e.g. XmlTextReader tr = new XmlTextReader
    > ("http://localhost/test.xml");
    >
    > where test.xml contains the following:
    >
    > <table cellspacing="1" cellpadding="1" width="100%">
    > <tr valign="top">
    > <td class="head" width="20%">tes t heading1</td>
    > <td class="head" width="10%">tes t heading2</td>
    > </tr>
    > <tr valign="top">
    > <td class="content" width="20%">con tent1</td>
    > <td class="content" width="10%">
    > <table cellspacing="0" width="100%">
    > <tr>
    > <td align="left">te st</td>
    > <td nowarp align="right">[/color]

    Watch nowrap - it's so-called boolean attribute, XML doesn't support that.

    Try SGMLReader instead of XmlTextReader

    --
    Oleg Tkachenko

    Multiconn Technologies, Israel

    Comment

    • Daniel

      #3
      Re: HTML Scrapping using XmlTextReader

      Thanks Oleg,

      The url you provided looks very interesting. And looking
      at the replies the sgmlreader has got, people are
      definitely finding it useful. And I will definitely
      download it and have a play with it.

      However, I do want to learn more about reading html using
      the XmlTextReader. Do you (or anybody out there) know of a
      good url to get me started?

      Cheers.
      [color=blue]
      >-----Original Message-----
      >Daniel wrote:
      >[color=green]
      >> Just wondering if it is possible to use XmlTextReader[/color][/color]
      to[color=blue][color=green]
      >> read off a html doc:[/color]
      >Not really, because html is not xml. Some html docs might[/color]
      be well-formed, so[color=blue]
      >they can be read be XmlTextReader, but in general a[/color]
      single <br> tag or[color=blue]
      >ubiquitous in HTML will stop reading.
      >[color=green]
      >> e.g. XmlTextReader tr = new XmlTextReader
      >> ("http://localhost/test.xml");
      >>
      >> where test.xml contains the following:
      >>
      >> <table cellspacing="1" cellpadding="1" width="100%">
      >> <tr valign="top">
      >> <td class="head" width="20%">tes t heading1</td>
      >> <td class="head" width="10%">tes t heading2</td>
      >> </tr>
      >> <tr valign="top">
      >> <td class="content" width="20%">con tent1</td>
      >> <td class="content" width="10%">
      >> <table cellspacing="0" width="100%">
      >> <tr>
      >> <td align="left">te st</td>
      >> <td nowarp align="right">[/color]
      >
      >Watch nowrap - it's so-called boolean attribute, XML[/color]
      doesn't support that.[color=blue]
      >
      >Try SGMLReader instead of XmlTextReader
      >http://www.gotdotnet.com/Community/U...es/Details.asp[/color]
      x?SampleGuid=B9 0FDDCE-E60D-43F8-A5C4-C3BD760564BC[color=blue]
      >--
      >Oleg Tkachenko
      >http://www.tkachenko.com/blog
      >Multiconn Technologies, Israel
      >
      >.
      >[/color]

      Comment

      • Oleg Tkachenko

        #4
        Re: HTML Scrapping using XmlTextReader

        Daniel wrote:
        [color=blue]
        > However, I do want to learn more about reading html using
        > the XmlTextReader. Do you (or anybody out there) know of a
        > good url to get me started?[/color]
        Not really. It's just technically impossible to read HTML by XmlTextReader
        without some sort of preprocessing of HTML (aka conversion HTML to XML or
        XHTML). Often Tidy is used for that too. Google for "HTML Tidy".
        --
        Oleg Tkachenko

        Multiconn Technologies, Israel

        Comment

        Working...