Parse HTML file to EXCEL

**Glenton** · Dec 15 '11, 09:43 AM

Hi

There are three modules that you can use to help you.
glob - to help you go through your files
re - for regular expressions to extract your data
csv - for writing csv files

Below is a basic structure to get you started

Code:

import glob
import re
import csv

headings=[*list of your headings*]
output=csv.writer(open("parser.csv","w"))
output.writerow(headings)

for file in glob.glob("*.html"):
    inputFile=open(file[-1])
    data=[]
    for heading in headings:
        *code to extract data for heading
        data.append(extracted data)
    output.writerow(data)
    inputFile.close()

The code to extract your data is obviously specific to your file. I haven't gone through to see what is the best way, but generally using a regular expression will sort you out.

Good luck!

**bvdet** · Dec 15 '11, 06:37 PM

To add to Glenton's information, Python module BeautifulSoup is ideal for parsing HTML files. Having never used it, I thought I would give it a shot.

Here's what takes place:

Read the file in it's entirety and create a BeautifulSoup object
Find the text you want to parse
Replace "=\n" with "" in the text
Create a list of strings by splitting the text on "|"
Create a file object for writing the csv data
Create a csv.writer object
Iterate on the list of strings, split each string on "##", and write each row
Close the file object

Now for the code:

Code:

import re
from BeautifulSoup import BeautifulSoup

fnIn = "invoice.htm"
fnOut = "invoice.csv"

soup = BeautifulSoup(open(fnIn).read())
comments = soup.find(text=re.compile("BILL_NUMBER")).replace("=\n", "").split("|")

f = open(fnOut, 'w')
writer = csv.writer(f)
for s in comments:
    writer.writerow(s.split("##"))
f.close()

Looks simple, doesn't it?

The csv module automatically accounts for embedded commas in the text.

**Glenton** · Dec 16 '11, 02:03 AM

Wow! That's handy. I have an old script that I use to download share data - wish I'd known about BeautifulSoup then!

**Amad Khan** · Dec 22 '11, 06:04 PM

Thanks Glenton & bvdet,

@bvdet: i am getting error "ImportErro r: No module named BeautifulSoup" on executing the provided code.
Is it beacause of any missing plugin/utility.?
Please guide.

**bvdet** · Dec 22 '11, 08:32 PM

BeautifulSoup is not built into Python. You have to download and install it.

**Amad Khan** · Jan 4 '12, 05:19 PM

Dear Both,
Thanks for your help, now it giving error on writer, as undefine..

Anyways I have simplified my requirement as below, kindly provide me python code ......

I have an HTML File (say myfile.html) having only 3 lines as shown below

Hello EverBody.

Good Bye

I want to write this file(myfile.htm l) contents to a csv file (say mycsv.csv) in such a way that....
1)Program only read line 2 starting from ""
2)Extract all strings between '##" and "|" and store it into a csv file as below

Adam John 0987654321 Male abbassalam@yaho o.com

**bvdet** · Jan 4 '12, 05:36 PM

Amad Khan,

We are not here to write your code for you. You should be able to write your own from the examples we have provided. You can post the code you have attempted along with the error you received, including traceback, and we will be glad to assist in correcting your problem.

bvdet
Moderator

Parse HTML file to EXCEL

Parse HTML file to EXCEL

Comment

Comment

Comment

Comment

Comment

Comment

Comment