HTMLParser problems.

Collapse
This topic is closed.
X
X
 
  • Time
  • Show
Clear All
new posts
  • Sean Cody

    HTMLParser problems.

    I'm trying to take a webpage that has a nxn table of entries (bus times) and
    convert it to a 2D array (list of lists). Initially this was simple but I
    need to be able to access whole 'columns' of data so the 2D array cannot be
    sparse but in the HTML file I'm parsing there can be sparse entries which
    are repsented in the table as &nbsp entities. The sparse output breaks my
    ability to use entire columns and have entries correspond properly.

    Is there a simple way to tell the parser whenever you see a &nbsp in table
    data return say... "-1" or "NaN"?
    The HTMLParser documentation is a bit.... terse. I was considering using
    the handle_entityre f() method but I would assume the data has already been
    parsed at that point.

    I could try:
    def handle_entityre f(self,entity):
    if self.in_td == 1:
    if entity == "nbsp":
    self.row.append (-1)

    But that seems ulgy... (comments?).

    As an example here is some code I'm using and partial output:

    #!/usr/local/bin/python
    import htmllib,os,stri ng,urllib
    from HTMLParser import HTMLParser

    class foo(HTMLParser) :
    def __init__(self):
    self.in_td = 0
    self.in_tr = 0
    self.matrix = []
    self.row = []
    self.reset()

    def handle_starttag (self,tag,attrs ):
    if tag == "td":
    self.in_td = 1
    elif tag == "tr":
    self.in_tr = 1

    def handle_data(sel f,data):
    if self.in_td == 1:
    data = string.lstrip(d ata)
    if data != "":
    self.row.append (data)

    def handle_endtag(s elf,tag):
    if tag == "td":
    self.in_td = 0
    elif tag == "tr":
    self.in_tr = 0
    if self.row != []:
    self.matrix.app end(self.row)
    self.row=[]

    parser = foo()
    socket =
    urllib.urlopen( "http://winnipegtransit .com/TIMETABLE/TODAY/STOPS/105413botto
    m.html")
    parser.feed(soc ket.read())
    socket.close()
    parser.close()
    for row in parser.matrix:
    print row

    A partial output of the above code is:
    ['5:12 C', '5:52 W']
    ['5:34 C']
    ['5:50 P']
    ['6:01 P', '6:10 G', '6:09 S', '6:59 U']
    ['6:10 P', '6:26 G', '6:23 C']
    ['6:23 P', '6:42 G', '6:35 W']
    ['6:34 P', '6:54 G', '6:47 S']
    ['6:46 P', '6:59 C']

    Any tips or suggestions or comments would be greatly appriciated,

    --
    Sean
    p.s. If I already answered my question that's great but it would be nice to
    have this in the groups archive for people with similar problems in the
    future.


  • Terry Reedy

    #2
    Re: HTMLParser problems.


    "Sean Cody" <sean@-[NOSPAMPLEASE]-tfh.ca> wrote in message
    news:kwfob.1019 7$f7.552358@loc alhost...[color=blue]
    > I could try:
    > def handle_entityre f(self,entity):
    > if self.in_td == 1:
    > if entity == "nbsp":
    > self.row.append (-1)
    >
    > But that seems ulgy... (comments?).[/color]

    Does this work? For me, that comes first.

    tjr


    Comment

    • Sean Cody

      #3
      Re: HTMLParser problems.

      > > I could try:[color=blue][color=green]
      > > def handle_entityre f(self,entity):
      > > if self.in_td == 1:
      > > if entity == "nbsp":
      > > self.row.append (-1)
      > >
      > > But that seems ulgy... (comments?).[/color]
      >
      > Does this work? For me, that comes first.
      >[/color]
      Actually yes it does.

      I wonder if there is a better way as I'm just stumbling through the
      HTMLParser class.
      The best thing about python is the stumbling through getting things done is
      not as painful as it would be in other languages.

      I use a lot of member variables. Is there a way to not have to reference
      members by self.member. Back in the day in pascal you could do stuff like
      "with self begin do_stuff(member _variable); end;" which was extremely useful
      for large 'records.'

      --
      Sean


      Comment

      • Peter Otten

        #4
        Re: HTMLParser problems.

        Sean Cody wrote:
        [color=blue]
        > I'm trying to take a webpage that has a nxn table of entries (bus times)
        > and
        > convert it to a 2D array (list of lists). Initially this was simple but I
        > need to be able to access whole 'columns' of data so the 2D array cannot
        > be sparse but in the HTML file I'm parsing there can be sparse entries
        > which
        > are repsented in the table as &nbsp entities. The sparse output breaks my
        > ability to use entire columns and have entries correspond properly.
        >
        > Is there a simple way to tell the parser whenever you see a &nbsp in table
        > data return say... "-1" or "NaN"?
        > The HTMLParser documentation is a bit.... terse. I was considering using
        > the handle_entityre f() method but I would assume the data has already been
        > parsed at that point.
        >
        > I could try:
        > def handle_entityre f(self,entity):
        > if self.in_td == 1:
        > if entity == "nbsp":
        > self.row.append (-1)
        >
        > But that seems ulgy... (comments?).
        >
        > As an example here is some code I'm using and partial output:[/color]

        [...]
        [color=blue]
        > parser.feed(soc ket.read())[/color]

        The simplest solution is to replace the above line with

        parser.feed(soc ket.read().repl ace("&nbsp;", "NaN")

        Below is an only slightly more robust solution. It implements a rudimentary
        "what table are we in?" check and can handle table cells with multiple data
        chunks.

        import htmllib,os,stri ng,urllib
        from HTMLParser import HTMLParser

        class foo(HTMLParser) :
        def __init__(self):
        self.matrix = []
        self.row = None
        self.cell = None
        self.in_table = 0
        self.empty = "NaN"
        self.reset()

        def handle_starttag (self,tag,attrs ):
        if tag == "table":
        self.in_table += 1
        elif self.in_table == 2:
        if tag == "td":
        assert self.cell is None
        self.cell = []
        elif tag == "tr":
        self.row = []
        self.matrix.app end(self.row)

        def handle_data(sel f,data):
        if self.in_table == 2:
        if self.cell is not None:
        data = string.strip(da ta)
        if data or True:
        self.cell.appen d(data)

        def handle_endtag(s elf,tag):
        if tag == "table":
        self.in_table -= 1
        elif self.in_table == 2:
        if tag == "td":
        s = " ".join(self.cel l).replace("\n" , " ")
        if s == "":
        s = self.empty
        self.row.append (s)
        self.cell = None
        elif tag == "tr":
        self.row = None

        parser = foo()
        if 0:
        instream = urllib.urlopen(

        "http://winnipegtransit .com/TIMETABLE/TODAY/STOPS/105413bottom.ht ml")
        else:
        instream = file("105413bot tom.html")
        data = instream.read()
        parser.feed(dat a)
        instream.close( )
        parser.close()
        for row in parser.matrix:
        assert len(row) == 4
        print row

        I've replaced the urlopen() call with access to a local file as you might
        want to run your tests with a local copy of the time table, too.

        Peter

        Comment

        • John J. Lee

          #5
          Re: HTMLParser problems.

          "Sean Cody" <sean@-[NOSPAMPLEASE]-tfh.ca> writes:
          [color=blue][color=green][color=darkred]
          > > > I could try:
          > > > def handle_entityre f(self,entity):
          > > > if self.in_td == 1:
          > > > if entity == "nbsp":
          > > > self.row.append (-1)
          > > >
          > > > But that seems ulgy... (comments?).[/color]
          > >
          > > Does this work? For me, that comes first.
          > >[/color]
          > Actually yes it does.
          >
          > I wonder if there is a better way as I'm just stumbling through the
          > HTMLParser class.[/color]
          [...]

          Seems OK to me.

          [color=blue]
          > I use a lot of member variables. Is there a way to not have to reference
          > members by self.member. Back in the day in pascal you could do stuff like
          > "with self begin do_stuff(member _variable); end;" which was extremely useful
          > for large 'records.'[/color]

          Well, obviously, there's:

          mv = self.member_var iable
          do_stuff(mv)


          or if you have lots of names that are annoying you, things like:

          for name in "foo", "bar", "baz":
          do_stuff(getatt r(self, name))


          can help.


          John

          Comment

          • John J. Lee

            #6
            Re: HTMLParser problems.

            Peter Otten <__peter__@web. de> writes:
            [color=blue]
            > Sean Cody wrote:[/color]
            [...][color=blue]
            > The simplest solution is to replace the above line with
            >
            > parser.feed(soc ket.read().repl ace("&nbsp;", "NaN")[/color]
            [...]

            That's platform-dependent, if you're relying on float("NaN").


            John

            Comment

            • Paul Clinch

              #7
              Re: HTMLParser problems.

              You could patch it with:=

              def handle_starttag (self,tag,attrs ):
              if tag == "td":
              self.in_td = 1
              self.row.append ("")
              elif tag == "tr":
              self.in_tr = 1

              def handle_data(sel f,data):
              if self.in_td == 1:
              data = string.lstrip(d ata)
              if data != "":
              self.row[-1]=data

              i.e. create the element and then later possible replace it.

              BTW True can be used as 1, an empty string is false, strings have
              methods and "if self.in_td:" ok, so:-

              def handle_data(sel f,data):
              if self.in_td:
              data = data.lstrip()
              if data:
              self.row[-1]=data

              is equivalent.

              Regards, Paul Clinch

              Comment

              • Terry Reedy

                #8
                Re: HTMLParser problems.


                "John J. Lee" <jjl@pobox.co m> wrote in message
                news:873cd9m6mo .fsf@pobox.com. ..[color=blue]
                > "Sean Cody" <sean@-[NOSPAMPLEASE]-tfh.ca> writes:[color=green]
                > > I use a lot of member variables. Is there a way to not have
                > > to reference members by self.member.[/color][/color]

                1. call the parameter s instead of self; then it is s.member.
                But best not to post code with that, lest you upset some readers ;-).
                [color=blue][color=green]
                > > Back in the day in pascal you could do stuff like
                > > "with self begin do_stuff(member _variable); end;"
                > > which was extremely useful for large 'records.'[/color][/color]

                There have been proposals something like that, but they do not seem to
                fit Python too well.

                2.
                [color=blue]
                > Well, obviously, there's:
                >
                > mv = self.member_var iable
                > do_stuff(mv)[/color]

                In case you think this a hack, it is not. Copying things into the
                local variable space (from builtins, globals, attributes) is a fairly
                common idiom. When a value is used repeatedly (like in a loop), the
                copying is paid for by faster repeated access.

                Terry J. Reedy


                Comment

                • Peter Otten

                  #9
                  Re: HTMLParser problems.

                  John J. Lee wrote:
                  [color=blue]
                  > [...][color=green]
                  >> The simplest solution is to replace the above line with
                  >>
                  >> parser.feed(soc ket.read().repl ace("&nbsp;", "NaN")[/color]
                  > [...]
                  >
                  > That's platform-dependent, if you're relying on float("NaN").[/color]

                  Actually, I'm not, any non-empty string would have done as well, given the
                  original poster's parser implementation.

                  Peter

                  Comment

                  • Peter Otten

                    #10
                    Re: HTMLParser problems.

                    Peter Otten wrote:
                    [color=blue]
                    > Actually, I'm not, any non-empty string would have done as well, given the
                    > original poster's parser implementation.[/color]

                    Nitpicking myself: any string containing at least one non-white character.

                    Peter

                    Comment

                    • Alex Martelli

                      #11
                      Re: HTMLParser problems.

                      Terry Reedy wrote:
                      [color=blue][color=green][color=darkred]
                      >> > Back in the day in pascal you could do stuff like
                      >> > "with self begin do_stuff(member _variable); end;"
                      >> > which was extremely useful for large 'records.'[/color][/color]
                      >
                      > There have been proposals something like that, but they do not seem to
                      > fit Python too well.[/color]

                      No, but, for the record: just last week in python-dev Guido rejected
                      a syntax proposal using a leading dot to strop a variablename in some
                      circumstances (writing '.var' rather than 'var' in those cases) for
                      the stated reason that, and I quote:
                      "I want to reserve .var for the "with" statement (a la VB)."

                      So, something like "with self: dostuff(.member _variable)" MIGHT be
                      in Python's future (the leading dot, like in VB, at least does make
                      things more explicit than leaving it implied like Pascal does).

                      [color=blue][color=green]
                      >> mv = self.member_var iable
                      >> do_stuff(mv)[/color]
                      >
                      > In case you think this a hack, it is not. Copying things into the
                      > local variable space (from builtins, globals, attributes) is a fairly
                      > common idiom. When a value is used repeatedly (like in a loop), the
                      > copying is paid for by faster repeated access.[/color]

                      Sure, good point. It _is_ a hack by some definitions of the word,
                      but that's not necessarily a bad thing. A quibble: the optimization
                      may be worth it when the NAME is used repeatedly -- repeated uses
                      of the VALUE, e.g. within the body of do_stuff, accessing the value
                      through another name [e.g. the parametername for do_stuff] do not
                      count, because what you're optimizing is specifically name lookup.

                      E.g., one silly example:

                      [alex@lancelot bo]$ timeit.py -c -s'x=range(999)' 'for i in range(999):
                      x[i]=id(x[i])'
                      1000 loops, best of 3: 590 usec per loop

                      [alex@lancelot bo]$ timeit.py -c -s'x=range(999)' -s'lid=id' 'for i in
                      range(999): x[i]=lid(x[i])'
                      1000 loops, best of 3: 490 usec per loop

                      the repeated lookups of builtin name 'id' in the first case accounted
                      for almost 17% of the CPU time, so the simple optimization leading to
                      the second case may be worth it if this code is in a bottleneck. In a
                      way, this is a special case of the general principle that Python does
                      NOT get into the thorny business of hosting constant subexpressions
                      (requiring it to prove that something IS constant...), so, when you're
                      looking at a major bottleneck, you have to consider doing such hoisting
                      yourself manually. Name lookup for anything but local bare names IS
                      "a subexpression", so if you KNOW it's a constant subex. in some case
                      where every cycle matter, you can hoist it.

                      (Or, you can use psyco, which among many other things can do the
                      hoisting on your behalf...:

                      [alex@lancelot bo]$ timeit.py -c -s'x=range(999)' -s'import psyco;
                      psyco.full()' 'for i in range(999): x[i]=id(x[i])'
                      10000 loops, best of 3: 43 usec per loop

                      [alex@lancelot bo]$ timeit.py -c -s'x=range(999); lid=id' -s'import psyco;
                      psyco.full()' 'for i in range(999): x[i]=lid(x[i])'
                      10000 loops, best of 3: 43 usec per loop

                      as you can see, with psyco, this manual hoisting gives no further
                      benefit -- so you can use whatever construct you find clearer and
                      not worry about performance effects, just enjoying the order-of-
                      magnitude speedup that psyco achieves in this case either way:-).


                      Alex

                      Comment

                      • John J. Lee

                        #12
                        Re: HTMLParser problems.

                        Peter Otten <__peter__@web. de> writes:
                        [color=blue]
                        > John J. Lee wrote:
                        >[color=green]
                        > > [...][color=darkred]
                        > >> The simplest solution is to replace the above line with
                        > >>
                        > >> parser.feed(soc ket.read().repl ace("&nbsp;", "NaN")[/color]
                        > > [...]
                        > >
                        > > That's platform-dependent, if you're relying on float("NaN").[/color]
                        >
                        > Actually, I'm not, any non-empty string would have done as well, given the
                        > original poster's parser implementation.[/color]

                        I was referring to the fact that (despite what he said), his code did
                        ..append(-1), not .append("-1"). But only the OP knows what he really
                        meant to do.


                        John

                        Comment

                        Working...