Help Parsing an HTML File

Collapse
This topic is closed.
X
X
 
  • Time
  • Show
Clear All
new posts
  • egonslokar@gmail.com

    Help Parsing an HTML File

    Hello Python Community,

    It'd be great if someone could provide guidance or sample code for
    accomplishing the following:

    I have a single unicode file that has descriptions of hundreds of
    objects. The file fairly resembles HTML-EXAMPLE pasted below.

    I need to parse the file in such a way to extract data out of the html
    and to come up with a tab separated file that would look like OUTPUT-
    FILE below.

    Any tips, advice and guidance is greatly appreciated.

    Thanks,

    Egon




    =====OUTPUT-FILE=====
    /please note that the first line of the file contains column headers/
    ------Tab Separated Output File Begin------
    H1 H2 DIV Segment1 Segment2 Segment3
    RoséH1-1 RoséH2-1 RoséDIV-1 RoséSegmentDIV1-1 RoséSegmentDIV2-1
    RoséSegmentDIV3-1
    PinkH1-2 PinkH2-2 PinkDIV2-2 PinkSegmentDIV1-2 No-Value No-Value
    BlackH1-3 BlackH2-3 BlackDIV2-3 BlackSegmentDIV 1-3 No-Value No-Value
    YellowH1-4 YellowH2-4 YellowDIV2-4 YellowSegmentDI V1-4
    YellowSegmentDI V2-4 No-Value
    ------Tab Separated Output File End------



    =====HTML-EXAMPLE=====
    ------HTML Example Begin------
    <html>

    <h1>RoséH1-1</h1>
    <h2>RoséH2-1</h2>
    <div>RoséDIV-1</div>
    <div "segment1">Rosé SegmentDIV1-1</div><br>
    <div "segment2">Rosé SegmentDIV2-1</div><br>
    <div "segment3">Rosé SegmentDIV3-1</div><br>
    <br>
    <br>

    <h1>PinkH1-2</h1>
    <h2>PinkH2-2</h2>
    <div>PinkDIV2-2</div>
    <div "segment1">Pink SegmentDIV1-2</div><br>
    <br>
    <comment></comment>

    <h1>BlackH1-3</h1>
    <h2>BlackH2-3</h2>
    <div>BlackDIV 2-3</div>
    <div "segment1">Blac kSegmentDIV1-3</div><br>

    <h1>YellowH1-4</h1>
    <h2>YellowH2-4</h2>
    <div>YellowDI V2-4</div>
    <div "segment1">Yell owSegmentDIV1-4</div><br>
    <div "segment2">Yell owSegmentDIV2-4</div><br>

    </html>
    ------HTML Example End------
  • Tim Chase

    #2
    Re: Help Parsing an HTML File

    I need to parse the file in such a way to extract data out of the html
    and to come up with a tab separated file that would look like OUTPUT-
    FILE below.

    BeautifulSoup[1]. Your one-stop-shop for all your HTML parsing
    needs.

    What you do with the parsed data, is an exercise left to the
    reader, but it's navigable.

    -tkc

    [1] http://www.crummy.com/software/BeautifulSoup/




    Comment

    • Mike Driscoll

      #3
      Re: Help Parsing an HTML File

      On Feb 15, 3:28 pm, egonslo...@gmai l.com wrote:
      Hello Python Community,
      >
      It'd be great if someone could provide guidance or sample code for
      accomplishing the following:
      >
      I have a single unicode file that has descriptions of hundreds of
      objects. The file fairly resembles HTML-EXAMPLE pasted below.
      >
      I need to parse the file in such a way to extract data out of the html
      and to come up with a tab separated file that would look like OUTPUT-
      FILE below.
      >
      Any tips, advice and guidance is greatly appreciated.
      >
      Thanks,
      >
      Egon
      >
      =====OUTPUT-FILE=====
      /please note that the first line of the file contains column headers/
      ------Tab Separated Output File Begin------
      H1 H2 DIV Segment1 Segment2 Segment3
      RoséH1-1 RoséH2-1 RoséDIV-1 RoséSegmentDIV1-1 RoséSegmentDIV2-1
      RoséSegmentDIV3-1
      PinkH1-2 PinkH2-2 PinkDIV2-2 PinkSegmentDIV1-2 No-Value No-Value
      BlackH1-3 BlackH2-3 BlackDIV2-3 BlackSegmentDIV 1-3 No-Value No-Value
      YellowH1-4 YellowH2-4 YellowDIV2-4 YellowSegmentDI V1-4
      YellowSegmentDI V2-4 No-Value
      ------Tab Separated Output File End------
      >
      =====HTML-EXAMPLE=====
      ------HTML Example Begin------
      <html>
      >
      <h1>RoséH1-1</h1>
      <h2>RoséH2-1</h2>
      <div>RoséDIV-1</div>
      <div "segment1">Rosé SegmentDIV1-1</div><br>
      <div "segment2">Rosé SegmentDIV2-1</div><br>
      <div "segment3">Rosé SegmentDIV3-1</div><br>
      <br>
      <br>
      >
      <h1>PinkH1-2</h1>
      <h2>PinkH2-2</h2>
      <div>PinkDIV2-2</div>
      <div "segment1">Pink SegmentDIV1-2</div><br>
      <br>
      <comment></comment>
      >
      <h1>BlackH1-3</h1>
      <h2>BlackH2-3</h2>
      <div>BlackDIV 2-3</div>
      <div "segment1">Blac kSegmentDIV1-3</div><br>
      >
      <h1>YellowH1-4</h1>
      <h2>YellowH2-4</h2>
      <div>YellowDI V2-4</div>
      <div "segment1">Yell owSegmentDIV1-4</div><br>
      <div "segment2">Yell owSegmentDIV2-4</div><br>
      >
      </html>
      ------HTML Example End------
      Pyparsing, ElementTree and lxml are all good candidates as well.
      BeautifulSoup takes care of malformed html though.





      Mike

      Comment

      • Stefan Behnel

        #4
        Re: Help Parsing an HTML File

        egonslokar@gmai l.com wrote:
        I have a single unicode file that has descriptions of hundreds of
        objects. The file fairly resembles HTML-EXAMPLE pasted below.
        >
        I need to parse the file in such a way to extract data out of the html
        and to come up with a tab separated file that would look like OUTPUT-
        FILE below.
        >
        =====OUTPUT-FILE=====
        /please note that the first line of the file contains column headers/
        ------Tab Separated Output File Begin------
        H1 H2 DIV Segment1 Segment2 Segment3
        RoséH1-1 RoséH2-1 RoséDIV-1 RoséSegmentDIV1-1 RoséSegmentDIV2-1
        ------Tab Separated Output File End------
        >
        =====HTML-EXAMPLE=====
        ------HTML Example Begin------
        <html>
        >
        <h1>RoséH1-1</h1>
        <h2>RoséH2-1</h2>
        <div>RoséDIV-1</div>
        <div "segment1">Rosé SegmentDIV1-1</div><br>
        <div "segment2">Rosé SegmentDIV2-1</div><br>
        <div "segment3">Rosé SegmentDIV3-1</div><br>
        <br>
        <br>
        >
        </html>
        ------HTML Example End------
        Now, what ugly markup is that? You will never manage to get any HTML compliant
        parser return the "segmentX" stuff in there. I think your best bet is really
        going for pyparsing or regular expressions (and I actually recommend pyparsing
        here).

        Stefan

        Comment

        • Peter Otten

          #5
          Re: Help Parsing an HTML File

          Stefan Behnel wrote:
          egonslokar@gmai l.com wrote:
          >I have a single unicode file that has descriptions of hundreds of
          >objects. The file fairly resembles HTML-EXAMPLE pasted below.
          >>
          >I need to parse the file in such a way to extract data out of the html
          >and to come up with a tab separated file that would look like OUTPUT-
          >FILE below.
          >>
          >=====OUTPUT-FILE=====
          >/please note that the first line of the file contains column headers/
          >------Tab Separated Output File Begin------
          >H1 H2 DIV Segment1 Segment2 Segment3
          >RoséH1-1 RoséH2-1 RoséDIV-1 RoséSegmentDIV 1-1 RoséSegmentDIV 2-1
          >------Tab Separated Output File End------
          >>
          >=====HTML-EXAMPLE=====
          >------HTML Example Begin------
          ><html>
          >>
          ><h1>RoséH1-1</h1>
          ><h2>RoséH2-1</h2>
          ><div>RoséDI V-1</div>
          ><div "segment1">Rosà ©SegmentDIV1-1</div><br>
          ><div "segment2">Rosà ©SegmentDIV2-1</div><br>
          ><div "segment3">Rosà ©SegmentDIV3-1</div><br>
          ><br>
          ><br>
          >>
          ></html>
          >------HTML Example End------
          >
          Now, what ugly markup is that? You will never manage to get any HTML
          compliant parser return the "segmentX" stuff in there. I think your best
          bet is really going for pyparsing or regular expressions (and I actually
          recommend pyparsing here).
          >
          Stefan
          In practice the following might be sufficient:

          from BeautifulSoup import BeautifulSoup

          def chunks(bs):
          chunk = []
          for tag in bs.findAll(["h1", "h2", "div"]):
          if tag.name == "h1":
          if chunk:
          yield chunk
          chunk = []
          chunk.append(ta g)
          if chunk:
          yield chunk

          def process(filenam e):
          bs = BeautifulSoup(o pen(filename))
          for chunk in chunks(bs):
          columns = [tag.string for tag in chunk]
          columns += ["No Value"] * (6 - len(columns))
          print "\t".join(colum ns)

          if __name__ == "__main__":
          process("exampl e.html")

          The biggest caveat is that only columns at the end of a row may be left out.

          Peter

          Comment

          Working...