Hands-on HTML Table Parser/Matrix?

Collapse
This topic is closed.
X
X
 
  • Time
  • Show
Clear All
new posts
  • robert

    Hands-on HTML Table Parser/Matrix?

    Often I want to extract some web table contents. Formats are
    mostly static, simple text & numbers in it, other tags to be
    stripped off. So a simple & fast approach would be ok.

    What of the different modules around is most easy to use, stable,
    up-to-date, iterator access or best matrix-access (without need
    for callback functions,class es.. for basic tasks)?


    Robert
  • Tim Cook

    #2
    Re: Hands-on HTML Table Parser/Matrix?

    There are couple of HTML examples using Pyparsing here:




    --Tim

    On Sun, 2008-07-06 at 14:40 +0200, robert wrote:
    Often I want to extract some web table contents. Formats are
    mostly static, simple text & numbers in it, other tags to be
    stripped off. So a simple & fast approach would be ok.

    What of the different modules around is most easy to use, stable,
    up-to-date, iterator access or best matrix-access (without need
    for callback functions,class es.. for basic tasks)?


    Robert
    --
    http://mail.python.org/mailman/listinfo/python-list
    --
    Timothy Cook, MSc
    Health Informatics Research & Development Services
    LinkedIn Profile:http://www.linkedin.com/in/timothywaynecook
    Skype ID == timothy.cook
    *************** *************** *************** *************** **
    *You may get my Public GPG key from popular keyservers or *
    *from this link http://timothywayne.cook.googlepages.com/home*
    *************** *************** *************** *************** **

    -----BEGIN PGP SIGNATURE-----
    Version: GnuPG v1.4.7 (GNU/Linux)

    iD8DBQBIcL/72TFRV0OoZwMRAu OEAKCpDdFwDmNP6 XzHfiFQlMeKkvnp rwCeMT/H
    EhH0g7ctU0eiz8X tbLZBoLI=
    =V+Xf
    -----END PGP SIGNATURE-----

    Comment

    • robert

      #3
      Re: Hands-on HTML Table Parser/Matrix?

      Tim Cook wrote:
      >
      On Sun, 2008-07-06 at 14:40 +0200, robert wrote:
      >Often I want to extract some web table contents. Formats are
      >mostly static, simple text & numbers in it, other tags to be
      >stripped off. So a simple & fast approach would be ok.
      >>
      >What of the different modules around is most easy to use, stable,
      >up-to-date, iterator access or best matrix-access (without need
      >for callback functions,class es.. for basic tasks)?
      >>
      There are couple of HTML examples using Pyparsing here:
      >

      >
      >
      hm - nothing special with HTML tables.

      Meanwhile:

      I dislike "ClientTabl e" (file centric, too much parsing errors in
      real world).

      "TableParse " works. Very simple&fast 70-liner regexp->matrix and
      strip/clean/HTML-entities conversion. Fast success hands-on.
      Doesn't separate nested tables and such complexities consciously -
      but works though for simple hands-on tasks in real world.


      Robert

      Comment

      • Sebastian \lunar\ Wiesner

        #4
        Re: Hands-on HTML Table Parser/Matrix?

        robert <no-spam@no-spam-no-spam.invalid>:
        Often I want to extract some web table contents. Formats are
        mostly static, simple text & numbers in it, other tags to be
        stripped off. So a simple & fast approach would be ok.
        >
        What of the different modules around is most easy to use, stable,
        up-to-date, iterator access or best matrix-access (without need
        for callback functions,class es.. for basic tasks)?
        Not more than a handful of lines with lxml.html:

        def htmltable2matri x(table):
        """Converts a html table to a matrix.

        :param table: The html table element
        :type table: An lxml element
        """
        matrix = []
        for row in table:
        matrix.append([e.text_content( ) for e in row])
        return matrix



        --
        Freedom is always the freedom of dissenters.
        (Rosa Luxemburg)

        Comment

        Working...