Parse HTML file to EXCEL

Collapse
X
 
  • Time
  • Show
Clear All
new posts
  • Amad Khan
    New Member
    • Jan 2011
    • 5

    Parse HTML file to EXCEL

    Hello Everbody,
    Hopefully you all are fine.

    Question:
    I have few html files (invoices) at a directory.
    What i required is read html contents (customer name,bill_no,du es etc) and store them in a .csv file...
    [i have attached file invoice.txt, plz open it in IE]

    I have an already existing parser (attached parser.sh) which is working fine.
    Please help me in converting my existing shell parser to python parser.

    Shell Parser Code is below:
    Code:
    #!/bin/bash
    
    echo "Script started \n"; 
    echo "\"BILL_NUMBER\",\"ACCOUNT_NUMBER\",\"USERNAME\",\"CUSTOMER_NAME\",\"CONTACT_NO\",\"EMAIL\",
    \"PACKAGE_PLAN\",\"TOTAL_AMOUNT_DUE_TOP\",\"PAYMENT_DUE_DATE\",\"TOTAL_AMOUNT_AFTER_DUE_DATE\",\"BILLING_PERIOD\",
    \"PERVIOUS_BALANCE\",\"PAYMENTS\",\"NET_PREVIOUS_BALANCE\",\"SUBORDINATE_AC_CHARGES\",\"INITIAL_CHARGES\",
    \"MONTHLY_LINE_RENT\",\"ANTIVIRUS_LINE_RENT\",\"PARENTAL_LINE_RENT\",\"EXTRA_USAGE\",\"Extra Usage-2GB_COUNT\",
    \"Extra Usage-2GB_AMOUNT\",\"Extra Usage-5GB_COUNT\",\"Extra Usage-5GB_AMOUNT\",\"SPEED_BOOST_COUNT\",
    \"SPEED_BOOST_AMOUNT\",\"HAPPY_DAYS_1_COUNT\",\"HAPPY_DAYS_1_AMOUNT\",\"HAPPY_DAYS_3_COUNT\",
    \"HAPPY_DAYS_3_AMOUNT\",\"StaticIP_COUNT\",\"StaticIP_AMOUNT\",\"PayAsYouGo_05Day_COUNT\",\"PayAsYouGo_05Day_AMOUNT\",
    \"PayAsYouGo_10Day_COUNT\",\"PayAsYouGo_10Day_AMOUNT\",\"PayAsYouGo_30Day_COUNT\",\"PayAsYouGo_30Day_AMOUNT\",
    \"PayAsYouGo_03Day_COUNT\",\"PayAsYouGo_03Day_AMOUNT\",\"PayAsYouGo_07Day_COUNT\",\"PayAsYouGo_07Day_AMOUNT\",
    \"PayAsYouGo_15Day_COUNT\",\"PayAsYouGo_15Day_AMOUNT\",\"Power Hours-30Days_COUNT\",\"Power Hours-30Day_AMOUNT\",
    \"DISCOUNTS\",\"ADJUSTMENTS\",\"DEVICE_CHANGE_CHARGES\",\"PLAN_CHANGE_CHARGES\",\"DEVICE_DAMAGE_CHARGES\",
    \"DEVICE_LOST_CHARGES\",\"ACCOUNT_FREEZE_CHARGES\",\"LATE_PAYMENT_CHARGES\",\"SUBTOTAL\",\"FEDERAL_EXCISE_DUTY\",
    \"ADVANCE_WITHHOLDING_TAX\",\"CURRENT_CHARGES\",\"ToTAL_DUE\"" > parsed.csv
     
    for file in *.html; do
     cat $file | grep '<!--B' |  awk 'BEGIN{ RS = ""; FS = "|"}  { for( i=1;i<=NF;i+=1)	{ split($i,a,"##"); printf "\"" a[2] "\","}  printf "\n"  }'  
      >> parsed.csv
    done
     echo "Script finished";


    Please if anyone can share it in python format.thnx
    Attached Files
    Last edited by bvdet; Dec 15 '11, 04:55 PM. Reason: Add code tags
  • Glenton
    Recognized Expert Contributor
    • Nov 2008
    • 391

    #2
    Hi

    There are three modules that you can use to help you.
    glob - to help you go through your files
    re - for regular expressions to extract your data
    csv - for writing csv files

    Below is a basic structure to get you started

    Code:
    import glob
    import re
    import csv
    
    headings=[*list of your headings*]
    output=csv.writer(open("parser.csv","w"))
    output.writerow(headings)
    
    for file in glob.glob("*.html"):
        inputFile=open(file[-1])
        data=[]
        for heading in headings:
            *code to extract data for heading
            data.append(extracted data)
        output.writerow(data)
        inputFile.close()
    The code to extract your data is obviously specific to your file. I haven't gone through to see what is the best way, but generally using a regular expression will sort you out.

    Good luck!

    Comment

    • bvdet
      Recognized Expert Specialist
      • Oct 2006
      • 2851

      #3
      To add to Glenton's information, Python module BeautifulSoup is ideal for parsing HTML files. Having never used it, I thought I would give it a shot.

      Here's what takes place:
      1. Read the file in it's entirety and create a BeautifulSoup object
      2. Find the text you want to parse
      3. Replace "=\n" with "" in the text
      4. Create a list of strings by splitting the text on "|"
      5. Create a file object for writing the csv data
      6. Create a csv.writer object
      7. Iterate on the list of strings, split each string on "##", and write each row
      8. Close the file object


      Now for the code:
      Code:
      import re
      from BeautifulSoup import BeautifulSoup
      
      fnIn = "invoice.htm"
      fnOut = "invoice.csv"
      
      soup = BeautifulSoup(open(fnIn).read())
      comments = soup.find(text=re.compile("BILL_NUMBER")).replace("=\n", "").split("|")
      
      f = open(fnOut, 'w')
      writer = csv.writer(f)
      for s in comments:
          writer.writerow(s.split("##"))
      f.close()
      Looks simple, doesn't it?

      The csv module automatically accounts for embedded commas in the text.

      Comment

      • Glenton
        Recognized Expert Contributor
        • Nov 2008
        • 391

        #4
        Wow! That's handy. I have an old script that I use to download share data - wish I'd known about BeautifulSoup then!

        Comment

        • Amad Khan
          New Member
          • Jan 2011
          • 5

          #5
          Thanks Glenton & bvdet,

          @bvdet: i am getting error "ImportErro r: No module named BeautifulSoup" on executing the provided code.
          Is it beacause of any missing plugin/utility.?
          Please guide.

          Comment

          • bvdet
            Recognized Expert Specialist
            • Oct 2006
            • 2851

            #6
            BeautifulSoup is not built into Python. You have to download and install it.

            Comment

            • Amad Khan
              New Member
              • Jan 2011
              • 5

              #7
              Dear Both,
              Thanks for your help, now it giving error on writer, as undefine..

              Anyways I have simplified my requirement as below, kindly provide me python code ......

              I have an HTML File (say myfile.html) having only 3 lines as shown below

              Hello EverBody.
              <!--FName##Adam|SNa me##John|PhoneN o##0987654321|G ender##Male|EMA IL##abbassalam@ yahoo.com-->
              Good Bye

              I want to write this file(myfile.htm l) contents to a csv file (say mycsv.csv) in such a way that....
              1)Program only read line 2 starting from "<!--" to "-->"
              2)Extract all strings between '##" and "|" and store it into a csv file as below

              Adam John 0987654321 Male abbassalam@yaho o.com

              Comment

              • bvdet
                Recognized Expert Specialist
                • Oct 2006
                • 2851

                #8
                Amad Khan,

                We are not here to write your code for you. You should be able to write your own from the examples we have provided. You can post the code you have attempted along with the error you received, including traceback, and we will be glad to assist in correcting your problem.

                bvdet
                Moderator

                Comment

                Working...