Re: How can I programmatically validate html ?

**Lars Eighner** · Aug 1 '08, 04:35 PM

In our last episode, <004f629c$0$102 65$c3e8da3@news .astraweb.com>, the
lovely and talented mark4asp broadcast on
comp.infosystem s.www.authoring.html:

I am importing text from a column of a database table to display as
part of a web page in asp.net. There are about 7000 rows in the table.

About 10% of the columns have their content as html and about 10% of
those columns have badly broken html. When broke it generally uses <tr>
and <tdcontent with no enclosing <table>.

I have two alternatives.

1. I could write a program to create a page from each occurence of html
content in a row and validate that against a html parser.

As anyone done this. If so how could it be done. which parser could be
used?

I have not done this, not even on TV.

From what I read down thread, if you can automate this at all, your best bet
probably is to pass stuff through tidy --- which is not a validator, but
which can fix many kinds of brokenness and then through a real parser like
nsgmls, either from the SP or OpenSP package.

You have to decide on a DOCTYPE because otherwise "validate" is meaningless.
In your circumstance 4.01 loose seems reasonable. So it looks like this:

1. slap your doctype on the string from the database and a TITLE element. In
html 4.01, open and close HTML, HEAD, and BODY tags are optional, but the
TITLE element is required. You can also take this oportunity to groom the
empty tags. Parsing HTML with regexes is in general a bad idea, but fixing
the empty tags is a piece of cake. This might be a good place to look for
markdown and common types of wikisms such as naïve users might have
introduced and filter for them if you can determine which type they are.

(An all-wiki diagnoser and filter would be a useful contribution to the
world.)

At this point you have something that purports to be an HTML document.

2) Send it through tidy (with appropriate tidy configuration or arguments).
Check to see if tidy died. If it did, write the unique ID to a list of
things that need manual intervention and go to the next record. Tidy is
very chatty, but you may want to save its error output anyway. You should
take this opportunity to get tidy to close tags and use lowercase in tags
and attribute names in case XHTML is your ultimate target or might ever
become your target one day.

3) (Tidy did not die) Send your tidified document through nsgmls. Discard
the output and look at the errors. You are looking for zilch in the error
file. If there are errors, record the unique ID as needing manual
intervention. You probably want to save the nsgmls errors as nsgmls is not
chatty and when it says something, it means it.

4) (Passed validation). Put the stuff through a filter to remove the
doctype and TITLE element and any extraneous stuff tidy may have added. At
this point (which will be HEAD, BODY, and HTML tags if you got tidy to add
optional tags). If 4.01 loose is not your target you may have to add stuff
to the filter to conform to what you want (such as closing empty tags if you
are going for XHTML). Write the now valid fragment back to the database
(you are not crazy enough to do any of this without extensive testing and
backing up the database first, right?)

5) Examine the exceptions. You may find enough commonality of some failures
to devise stuff that will fix most of them in the filter at step 1. Tidy is
the weak link --- when it does what you want, it's great. When it doesn't
perhaps you can convince it with stuff in step 1.

6) Don't sue me if you screw it up.

--
Lars Eighner <http://larseighner.com/usenet@larseigh ner.com
War on Terrorism: Bad News from the Sanity Front
"There's one thing ... that I do like about Rumsfeld, he's just a little bit
crazy, OK"? --Thomas Friedman, _The New York Times_