Help with regular expressions

**Sybren Stuvel** · Jul 18 '05, 02:01 AM

Re: Help with regular expressions

dmbkiwi enlightened us with:[color=blue]
> A couple of other people have contributed code to this project,
> particularly relating to the parsing of the websites.
> Unfortunately, it is not parsing one particular part of the website
> properly. This is because it is expecting the data to be in a
> certain form, and occasionally it is in a different form.
> Unfortunately this causes the entire script to fail to run.[/color]

You seem to expect old HTML. Why not use XHTML only ('tidy' can
convert between them) and use a regular XML parser? Much, much, much
easier! And you won't have to be afraid of messing up your regular
expressions ;-)

Sybren
--
The problem with the world is stupidity. Not saying there should be a
capital punishment for stupidity, but why don't we just take the
safety labels off of everything and let the problem solve itself?

**dmbkiwi** · Jul 18 '05, 02:01 AM

Re: Help with regular expressions

On Tue, 26 Aug 2003 08:47:33 +0000, Sybren Stuvel wrote:
[color=blue]
> dmbkiwi enlightened us with:[color=green]
>> A couple of other people have contributed code to this project,
>> particularly relating to the parsing of the websites.
>> Unfortunately, it is not parsing one particular part of the website
>> properly. This is because it is expecting the data to be in a
>> certain form, and occasionally it is in a different form.
>> Unfortunately this causes the entire script to fail to run.[/color]
>
> You seem to expect old HTML. Why not use XHTML only ('tidy' can
> convert between them) and use a regular XML parser? Much, much, much
> easier! And you won't have to be afraid of messing up your regular
> expressions ;-)
>
> Sybren[/color]

XML would be nice, but unfortunately I have no choice as to the markup
language used by the site. It's a website on the world wide web, not a
site overwhich I have any control. My regular expressions are at the
mercy of the developers of that site.

Any other suggestions?

Matt

**John J. Lee** · Jul 18 '05, 02:03 AM

Re: Help with regular expressions

dmbkiwi <dmbkiwi@yahoo. com> writes:[color=blue]
> On Tue, 26 Aug 2003 08:47:33 +0000, Sybren Stuvel wrote:[/color]
[...][color=blue][color=green]
> > You seem to expect old HTML. Why not use XHTML only ('tidy' can
> > convert between them) and use a regular XML parser? Much, much, much
> > easier! And you won't have to be afraid of messing up your regular
> > expressions ;-)
> >
> > Sybren[/color]
>
> XML would be nice, but unfortunately I have no choice as to the markup
> language used by the site. It's a website on the world wide web, not a
> site overwhich I have any control. My regular expressions are at the
> mercy of the developers of that site.[/color]

You misunderstand. HTMLTidy (or its descendant, tidylib) reads ugly,
non-conformant HTML and spits out clean, conformant XHTML (or HTML).

uTidylib is a ctypes wrapper of tidylib.

import tidy
from cStringIO import StringIO
tidydoc = tidy.parseStrin g(html)
s = StringIO()
tidydoc.write(s )
tidied_html = s.getvalue()

mxTidy is a wrapper of a shared-library-ized HTMLTidy.

from mx.Tidy import tidy
tidied_html = tidy(html)[2]

John

Help with regular expressions

Help with regular expressions

Comment

Comment

Comment