How to modify the xml structure internally to work the program?

**dwblas** · Jan 15 '16, 07:33 PM

What is the problem and what do you want to extract? It would possibly be easier to process this as a plain text file and split/groupby the <h1>, <h2>, & <span> tags depending. Will post some code later tonight time permitting.

**dwblas** · Jan 15 '16, 09:06 PM

This code should be self explanatory. The combined record(s) are printed, but you could also search for string within the record, or write them to a file.

Code:

def process_group(group_in):
    print " ".join(group_in)

with open("problem_or_working_html.txt", "r") as fp_in:
    starters=["<h1", "<h2", "<span", "</body"]
    this_group=[]
    for rec in fp_in:
        rec=rec.strip()
        for start_lit in starters:
            if rec.startswith(start_lit):
                process_group(this_group)
                this_group=[]
        this_group.append(rec)

## process last group
process_group(this_group)

**amskape** · Jan 16 '16, 05:16 AM

Dear dwblas,
Thanks for your fantastic answer . It works fine with small indentation changes.

Code:

#!/bin/python  
def process_group(group_in):
    print " ".join(group_in)
with open("problem_html.txt", "r") as fp_in:
    starters = ["<h1", "<h2", "<span", "</body"]
    this_group = []
    for rec in fp_in:
        rec = rec.strip()
        for start_lit in starters:
            if rec.startswith(start_lit):
                process_group(this_group)
            #this_group = []
        this_group.append(rec)

# process last group
process_group(this_group) #function invoking...

But current situation I got the result as DOM element with a normal python print show as

Code:

[<DOM Element: body at 0xb199054c>]

So the Node list element . In node list we cannot apply this strip() method. Please advise a solution in this case...

With lots of gratitude

Anes

How to modify the xml structure internally to work the program?

How to modify the xml structure internally to work the program?

Comment

Comment

Comment