Finding HTML tags in streaming HTML

Collapse
This topic is closed.
X
X
 
  • Time
  • Show
Clear All
new posts
  • jehugaleahsa@gmail.com

    Finding HTML tags in streaming HTML

    Hello:

    Currently, I have a system that will use Regex to find tags in a
    string of HTML. Recently my company needs me to read the HTML
    dynamically from a stream, so as to avoid long waits on large pages or
    slow servers.

    Does anyone know of a good way to do this? There is no guarantee that
    the pages are proper HTML, since this pulls from real web sites.

    How tolerant are the XmlReaders when it comes to bad HTML?

    Thanks,
    Travis
  • rossum

    #2
    Re: Finding HTML tags in streaming HTML

    On Sun, 6 Jul 2008 13:51:29 -0700 (PDT), "jehugaleahsa@g mail.com"
    <jehugaleahsa@g mail.comwrote:
    >Hello:
    >
    >Currently, I have a system that will use Regex to find tags in a
    >string of HTML. Recently my company needs me to read the HTML
    >dynamically from a stream, so as to avoid long waits on large pages or
    >slow servers.
    You cannot be sure that you have seen all that there is on the page
    until it has all loaded. Reading from the input stream will not make
    a slow external server run any faster. Unless your processing is
    taking a long time I suspect your boses will be disappointed.

    Finding tags is a matter of looking for "<" and parsing the subsequent
    characters. Do you need all tags or just a subset of them?
    >
    >Does anyone know of a good way to do this? There is no guarantee that
    >the pages are proper HTML, since this pulls from real web sites.
    >
    >How tolerant are the XmlReaders when it comes to bad HTML?
    Not at all. Better to run the page through an HTML to XHTML
    translator first, that way the XML parser will not throw a wobbly.

    rossum
    >
    >Thanks,
    >Travis

    Comment

    Working...