HTML Scrapping using XmlTextReader

**Oleg Tkachenko** · Nov 11 '05, 10:59 PM

Re: HTML Scrapping using XmlTextReader

Daniel wrote:
[color=blue]
> Just wondering if it is possible to use XmlTextReader to
> read off a html doc:[/color]
Not really, because html is not xml. Some html docs might be well-formed, so
they can be read be XmlTextReader, but in general a single <br> tag or
ubiquitous in HTML   will stop reading.
[color=blue]
> e.g. XmlTextReader tr = new XmlTextReader
> ("http://localhost/test.xml");
>
> where test.xml contains the following:
>
> <table cellspacing="1" cellpadding="1" width="100%">
> <tr valign="top">
> <td class="head" width="20%">tes t heading1</td>
> <td class="head" width="10%">tes t heading2</td>
> </tr>
> <tr valign="top">
> <td class="content" width="20%">con tent1</td>
> <td class="content" width="10%">
> <table cellspacing="0" width="100%">
> <tr>
> <td align="left">te st</td>
> <td nowarp align="right">[/color]

Watch nowrap - it's so-called boolean attribute, XML doesn't support that.

Try SGMLReader instead of XmlTextReader

http://www.gotdotnet.com/Community/UserSamples/Details.aspx?SampleGuid=B90FDDCE-E60D-43F8-A5C4-C3BD760564BC

--
Oleg Tkachenko

Signs on the Sand

http://www.tkachenko.com/blog

Multiconn Technologies, Israel

**Daniel** · Nov 11 '05, 10:59 PM

Re: HTML Scrapping using XmlTextReader

Thanks Oleg,

The url you provided looks very interesting. And looking
at the replies the sgmlreader has got, people are
definitely finding it useful. And I will definitely
download it and have a play with it.

However, I do want to learn more about reading html using
the XmlTextReader. Do you (or anybody out there) know of a
good url to get me started?

Cheers.
[color=blue]
>-----Original Message-----
>Daniel wrote:
>[color=green]
>> Just wondering if it is possible to use XmlTextReader[/color][/color]
to[color=blue][color=green]
>> read off a html doc:[/color]
>Not really, because html is not xml. Some html docs might[/color]
be well-formed, so[color=blue]
>they can be read be XmlTextReader, but in general a[/color]
single <br> tag or[color=blue]
>ubiquitous in HTML will stop reading.
>[color=green]
>> e.g. XmlTextReader tr = new XmlTextReader
>> ("http://localhost/test.xml");
>>
>> where test.xml contains the following:
>>
>> <table cellspacing="1" cellpadding="1" width="100%">
>> <tr valign="top">
>> <td class="head" width="20%">tes t heading1</td>
>> <td class="head" width="10%">tes t heading2</td>
>> </tr>
>> <tr valign="top">
>> <td class="content" width="20%">con tent1</td>
>> <td class="content" width="10%">
>> <table cellspacing="0" width="100%">
>> <tr>
>> <td align="left">te st</td>
>> <td nowarp align="right">[/color]
>
>Watch nowrap - it's so-called boolean attribute, XML[/color]
doesn't support that.[color=blue]
>
>Try SGMLReader instead of XmlTextReader
>http://www.gotdotnet.com/Community/U...es/Details.asp[/color]
x?SampleGuid=B9 0FDDCE-E60D-43F8-A5C4-C3BD760564BC[color=blue]
>--
>Oleg Tkachenko
>http://www.tkachenko.com/blog
>Multiconn Technologies, Israel
>
>.
>[/color]

**Oleg Tkachenko** · Nov 11 '05, 11:00 PM

Re: HTML Scrapping using XmlTextReader

Daniel wrote:
[color=blue]
> However, I do want to learn more about reading html using
> the XmlTextReader. Do you (or anybody out there) know of a
> good url to get me started?[/color]
Not really. It's just technically impossible to read HTML by XmlTextReader
without some sort of preprocessing of HTML (aka conversion HTML to XML or
XHTML). Often Tidy is used for that too. Google for "HTML Tidy".
--
Oleg Tkachenko

Signs on the Sand

http://www.tkachenko.com/blog

Multiconn Technologies, Israel

HTML Scrapping using XmlTextReader

HTML Scrapping using XmlTextReader

Comment

Comment

Comment