Re: How can I programmatically validate html ?

Collapse
This topic is closed.
X
X
 
  • Time
  • Show
Clear All
new posts
  • Lars Eighner

    Re: How can I programmatically validate html ?

    In our last episode, <004f629c$0$102 65$c3e8da3@news .astraweb.com>, the
    lovely and talented mark4asp broadcast on
    comp.infosystem s.www.authoring.html:
    I am importing text from a column of a database table to display as
    part of a web page in asp.net. There are about 7000 rows in the table.
    About 10% of the columns have their content as html and about 10% of
    those columns have badly broken html. When broke it generally uses <tr>
    and <tdcontent with no enclosing <table>.
    I have two alternatives.
    1. I could write a program to create a page from each occurence of html
    content in a row and validate that against a html parser.
    As anyone done this. If so how could it be done. which parser could be
    used?
    I have not done this, not even on TV.

    From what I read down thread, if you can automate this at all, your best bet
    probably is to pass stuff through tidy --- which is not a validator, but
    which can fix many kinds of brokenness and then through a real parser like
    nsgmls, either from the SP or OpenSP package.

    You have to decide on a DOCTYPE because otherwise "validate" is meaningless.
    In your circumstance 4.01 loose seems reasonable. So it looks like this:

    1. slap your doctype on the string from the database and a TITLE element. In
    html 4.01, open and close HTML, HEAD, and BODY tags are optional, but the
    TITLE element is required. You can also take this oportunity to groom the
    empty tags. Parsing HTML with regexes is in general a bad idea, but fixing
    the empty tags is a piece of cake. This might be a good place to look for
    markdown and common types of wikisms such as naïve users might have
    introduced and filter for them if you can determine which type they are.

    (An all-wiki diagnoser and filter would be a useful contribution to the
    world.)

    At this point you have something that purports to be an HTML document.

    2) Send it through tidy (with appropriate tidy configuration or arguments).
    Check to see if tidy died. If it did, write the unique ID to a list of
    things that need manual intervention and go to the next record. Tidy is
    very chatty, but you may want to save its error output anyway. You should
    take this opportunity to get tidy to close tags and use lowercase in tags
    and attribute names in case XHTML is your ultimate target or might ever
    become your target one day.

    3) (Tidy did not die) Send your tidified document through nsgmls. Discard
    the output and look at the errors. You are looking for zilch in the error
    file. If there are errors, record the unique ID as needing manual
    intervention. You probably want to save the nsgmls errors as nsgmls is not
    chatty and when it says something, it means it.

    4) (Passed validation). Put the stuff through a filter to remove the
    doctype and TITLE element and any extraneous stuff tidy may have added. At
    this point (which will be HEAD, BODY, and HTML tags if you got tidy to add
    optional tags). If 4.01 loose is not your target you may have to add stuff
    to the filter to conform to what you want (such as closing empty tags if you
    are going for XHTML). Write the now valid fragment back to the database
    (you are not crazy enough to do any of this without extensive testing and
    backing up the database first, right?)

    5) Examine the exceptions. You may find enough commonality of some failures
    to devise stuff that will fix most of them in the filter at step 1. Tidy is
    the weak link --- when it does what you want, it's great. When it doesn't
    perhaps you can convince it with stuff in step 1.

    6) Don't sue me if you screw it up.


    --
    Lars Eighner <http://larseighner.com/usenet@larseigh ner.com
    War on Terrorism: Bad News from the Sanity Front
    "There's one thing ... that I do like about Rumsfeld, he's just a little bit
    crazy, OK"? --Thomas Friedman, _The New York Times_
Working...