HTMLParser problems.

**Terry Reedy** · Jul 18 '05, 04:39 AM

Re: HTMLParser problems.

"Sean Cody" <sean@-[NOSPAMPLEASE]-tfh.ca> wrote in message
news:kwfob.1019 7$f7.552358@loc alhost...[color=blue]
> I could try:
> def handle_entityre f(self,entity):
> if self.in_td == 1:
> if entity == "nbsp":
> self.row.append (-1)
>
> But that seems ulgy... (comments?).[/color]

Does this work? For me, that comes first.

tjr

**Sean Cody** · Jul 18 '05, 04:40 AM

Re: HTMLParser problems.

> > I could try:[color=blue][color=green]
> > def handle_entityre f(self,entity):
> > if self.in_td == 1:
> > if entity == "nbsp":
> > self.row.append (-1)
> >
> > But that seems ulgy... (comments?).[/color]
>
> Does this work? For me, that comes first.
>[/color]
Actually yes it does.

I wonder if there is a better way as I'm just stumbling through the
HTMLParser class.
The best thing about python is the stumbling through getting things done is
not as painful as it would be in other languages.

I use a lot of member variables. Is there a way to not have to reference
members by self.member. Back in the day in pascal you could do stuff like
"with self begin do_stuff(member _variable); end;" which was extremely useful
for large 'records.'

--
Sean

**Peter Otten** · Jul 18 '05, 04:40 AM

Re: HTMLParser problems.

Sean Cody wrote:
[color=blue]
> I'm trying to take a webpage that has a nxn table of entries (bus times)
> and
> convert it to a 2D array (list of lists). Initially this was simple but I
> need to be able to access whole 'columns' of data so the 2D array cannot
> be sparse but in the HTML file I'm parsing there can be sparse entries
> which
> are repsented in the table as &nbsp entities. The sparse output breaks my
> ability to use entire columns and have entries correspond properly.
>
> Is there a simple way to tell the parser whenever you see a &nbsp in table
> data return say... "-1" or "NaN"?
> The HTMLParser documentation is a bit.... terse. I was considering using
> the handle_entityre f() method but I would assume the data has already been
> parsed at that point.
>
> I could try:
> def handle_entityre f(self,entity):
> if self.in_td == 1:
> if entity == "nbsp":
> self.row.append (-1)
>
> But that seems ulgy... (comments?).
>
> As an example here is some code I'm using and partial output:[/color]

[...]
[color=blue]
> parser.feed(soc ket.read())[/color]

The simplest solution is to replace the above line with

parser.feed(soc ket.read().repl ace(" ", "NaN")

Below is an only slightly more robust solution. It implements a rudimentary
"what table are we in?" check and can handle table cells with multiple data
chunks.

import htmllib,os,stri ng,urllib
from HTMLParser import HTMLParser

class foo(HTMLParser) :
def __init__(self):
self.matrix = []
self.row = None
self.cell = None
self.in_table = 0
self.empty = "NaN"
self.reset()

def handle_starttag (self,tag,attrs ):
if tag == "table":
self.in_table += 1
elif self.in_table == 2:
if tag == "td":
assert self.cell is None
self.cell = []
elif tag == "tr":
self.row = []
self.matrix.app end(self.row)

def handle_data(sel f,data):
if self.in_table == 2:
if self.cell is not None:
data = string.strip(da ta)
if data or True:
self.cell.appen d(data)

def handle_endtag(s elf,tag):
if tag == "table":
self.in_table -= 1
elif self.in_table == 2:
if tag == "td":
s = " ".join(self.cel l).replace("\n" , " ")
if s == "":
s = self.empty
self.row.append (s)
self.cell = None
elif tag == "tr":
self.row = None

parser = foo()
if 0:
instream = urllib.urlopen(

"http://winnipegtransit .com/TIMETABLE/TODAY/STOPS/105413bottom.ht ml")
else:
instream = file("105413bot tom.html")
data = instream.read()
parser.feed(dat a)
instream.close( )
parser.close()
for row in parser.matrix:
assert len(row) == 4
print row

I've replaced the urlopen() call with access to a local file as you might
want to run your tests with a local copy of the time table, too.

Peter

**John J. Lee** · Jul 18 '05, 04:40 AM

Re: HTMLParser problems.

"Sean Cody" <sean@-[NOSPAMPLEASE]-tfh.ca> writes:
[color=blue][color=green][color=darkred]
> > > I could try:
> > > def handle_entityre f(self,entity):
> > > if self.in_td == 1:
> > > if entity == "nbsp":
> > > self.row.append (-1)
> > >
> > > But that seems ulgy... (comments?).[/color]
> >
> > Does this work? For me, that comes first.
> >[/color]
> Actually yes it does.
>
> I wonder if there is a better way as I'm just stumbling through the
> HTMLParser class.[/color]
[...]

Seems OK to me.

[color=blue]
> I use a lot of member variables. Is there a way to not have to reference
> members by self.member. Back in the day in pascal you could do stuff like
> "with self begin do_stuff(member _variable); end;" which was extremely useful
> for large 'records.'[/color]

Well, obviously, there's:

mv = self.member_var iable
do_stuff(mv)

or if you have lots of names that are annoying you, things like:

for name in "foo", "bar", "baz":
do_stuff(getatt r(self, name))

can help.

John

**John J. Lee** · Jul 18 '05, 04:40 AM

Re: HTMLParser problems.

Peter Otten <__peter__@web. de> writes:
[color=blue]
> Sean Cody wrote:[/color]
[...][color=blue]
> The simplest solution is to replace the above line with
>
> parser.feed(soc ket.read().repl ace(" ", "NaN")[/color]
[...]

That's platform-dependent, if you're relying on float("NaN").

John

**Paul Clinch** · Jul 18 '05, 04:40 AM

Re: HTMLParser problems.

You could patch it with:=

def handle_starttag (self,tag,attrs ):
if tag == "td":
self.in_td = 1
self.row.append ("")
elif tag == "tr":
self.in_tr = 1

def handle_data(sel f,data):
if self.in_td == 1:
data = string.lstrip(d ata)
if data != "":
self.row[-1]=data

i.e. create the element and then later possible replace it.

BTW True can be used as 1, an empty string is false, strings have
methods and "if self.in_td:" ok, so:-

def handle_data(sel f,data):
if self.in_td:
data = data.lstrip()
if data:
self.row[-1]=data

is equivalent.

Regards, Paul Clinch

**Terry Reedy** · Jul 18 '05, 04:41 AM

Re: HTMLParser problems.

"John J. Lee" <jjl@pobox.co m> wrote in message
news:873cd9m6mo .fsf@pobox.com. ..[color=blue]
> "Sean Cody" <sean@-[NOSPAMPLEASE]-tfh.ca> writes:[color=green]
> > I use a lot of member variables. Is there a way to not have
> > to reference members by self.member.[/color][/color]

1. call the parameter s instead of self; then it is s.member.
But best not to post code with that, lest you upset some readers ;-).
[color=blue][color=green]
> > Back in the day in pascal you could do stuff like
> > "with self begin do_stuff(member _variable); end;"
> > which was extremely useful for large 'records.'[/color][/color]

There have been proposals something like that, but they do not seem to
fit Python too well.

2.
[color=blue]
> Well, obviously, there's:
>
> mv = self.member_var iable
> do_stuff(mv)[/color]

In case you think this a hack, it is not. Copying things into the
local variable space (from builtins, globals, attributes) is a fairly
common idiom. When a value is used repeatedly (like in a loop), the
copying is paid for by faster repeated access.

Terry J. Reedy

**Peter Otten** · Jul 18 '05, 04:41 AM

Re: HTMLParser problems.

John J. Lee wrote:
[color=blue]
> [...][color=green]
>> The simplest solution is to replace the above line with
>>
>> parser.feed(soc ket.read().repl ace(" ", "NaN")[/color]
> [...]
>
> That's platform-dependent, if you're relying on float("NaN").[/color]

Actually, I'm not, any non-empty string would have done as well, given the
original poster's parser implementation.

Peter

**Peter Otten** · Jul 18 '05, 04:41 AM

Re: HTMLParser problems.

Peter Otten wrote:
[color=blue]
> Actually, I'm not, any non-empty string would have done as well, given the
> original poster's parser implementation.[/color]

Nitpicking myself: any string containing at least one non-white character.

Peter

**Alex Martelli** · Jul 18 '05, 04:41 AM

Re: HTMLParser problems.

Terry Reedy wrote:
[color=blue][color=green][color=darkred]
>> > Back in the day in pascal you could do stuff like
>> > "with self begin do_stuff(member _variable); end;"
>> > which was extremely useful for large 'records.'[/color][/color]
>
> There have been proposals something like that, but they do not seem to
> fit Python too well.[/color]

No, but, for the record: just last week in python-dev Guido rejected
a syntax proposal using a leading dot to strop a variablename in some
circumstances (writing '.var' rather than 'var' in those cases) for
the stated reason that, and I quote:
"I want to reserve .var for the "with" statement (a la VB)."

So, something like "with self: dostuff(.member _variable)" MIGHT be
in Python's future (the leading dot, like in VB, at least does make
things more explicit than leaving it implied like Pascal does).

[color=blue][color=green]
>> mv = self.member_var iable
>> do_stuff(mv)[/color]
>
> In case you think this a hack, it is not. Copying things into the
> local variable space (from builtins, globals, attributes) is a fairly
> common idiom. When a value is used repeatedly (like in a loop), the
> copying is paid for by faster repeated access.[/color]

Sure, good point. It _is_ a hack by some definitions of the word,
but that's not necessarily a bad thing. A quibble: the optimization
may be worth it when the NAME is used repeatedly -- repeated uses
of the VALUE, e.g. within the body of do_stuff, accessing the value
through another name [e.g. the parametername for do_stuff] do not
count, because what you're optimizing is specifically name lookup.

E.g., one silly example:

[alex@lancelot bo]$ timeit.py -c -s'x=range(999)' 'for i in range(999):
x[i]=id(x[i])'
1000 loops, best of 3: 590 usec per loop

[alex@lancelot bo]$ timeit.py -c -s'x=range(999)' -s'lid=id' 'for i in
range(999): x[i]=lid(x[i])'
1000 loops, best of 3: 490 usec per loop

the repeated lookups of builtin name 'id' in the first case accounted
for almost 17% of the CPU time, so the simple optimization leading to
the second case may be worth it if this code is in a bottleneck. In a
way, this is a special case of the general principle that Python does
NOT get into the thorny business of hosting constant subexpressions
(requiring it to prove that something IS constant...), so, when you're
looking at a major bottleneck, you have to consider doing such hoisting
yourself manually. Name lookup for anything but local bare names IS
"a subexpression", so if you KNOW it's a constant subex. in some case
where every cycle matter, you can hoist it.

(Or, you can use psyco, which among many other things can do the
hoisting on your behalf...:

[alex@lancelot bo]$ timeit.py -c -s'x=range(999)' -s'import psyco;
psyco.full()' 'for i in range(999): x[i]=id(x[i])'
10000 loops, best of 3: 43 usec per loop

[alex@lancelot bo]$ timeit.py -c -s'x=range(999); lid=id' -s'import psyco;
psyco.full()' 'for i in range(999): x[i]=lid(x[i])'
10000 loops, best of 3: 43 usec per loop

as you can see, with psyco, this manual hoisting gives no further
benefit -- so you can use whatever construct you find clearer and
not worry about performance effects, just enjoying the order-of-
magnitude speedup that psyco achieves in this case either way:-).

Alex

**John J. Lee** · Jul 18 '05, 04:41 AM

Re: HTMLParser problems.

Peter Otten <__peter__@web. de> writes:
[color=blue]
> John J. Lee wrote:
>[color=green]
> > [...][color=darkred]
> >> The simplest solution is to replace the above line with
> >>
> >> parser.feed(soc ket.read().repl ace(" ", "NaN")[/color]
> > [...]
> >
> > That's platform-dependent, if you're relying on float("NaN").[/color]
>
> Actually, I'm not, any non-empty string would have done as well, given the
> original poster's parser implementation.[/color]

I was referring to the fact that (despite what he said), his code did
..append(-1), not .append("-1"). But only the OP knows what he really
meant to do.

John

HTMLParser problems.

HTMLParser problems.

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment