How to modify the xml structure internally to work the program?

Collapse
X
 
  • Time
  • Show
Clear All
new posts
  • amskape
    New Member
    • Apr 2010
    • 56

    How to modify the xml structure internally to work the program?

    Dear Friends,

    I have an application in Python which take input as an XML document. The XML document is supplied externally and cannot change it structure . But there is problem in alignment of XML. I am using xml minidom for parsing purpose.

    There is simple position change is enough. But I have no idea how to change the element of DOM i.e self.tree = MD.parse(ficher o) Please advise a good way ...

    Please refer the problematic html and normal html structure attached here ...
    N.B We have no option to edit the source HTML, because it may come from CD also.



    Thanks
    Anes
    Attached Files
  • dwblas
    Recognized Expert Contributor
    • May 2008
    • 626

    #2
    What is the problem and what do you want to extract? It would possibly be easier to process this as a plain text file and split/groupby the <h1>, <h2>, & <span> tags depending. Will post some code later tonight time permitting.

    Comment

    • dwblas
      Recognized Expert Contributor
      • May 2008
      • 626

      #3
      This code should be self explanatory. The combined record(s) are printed, but you could also search for string within the record, or write them to a file.
      Code:
      def process_group(group_in):
          print " ".join(group_in)
      
      with open("problem_or_working_html.txt", "r") as fp_in:
          starters=["<h1", "<h2", "<span", "</body"]
          this_group=[]
          for rec in fp_in:
              rec=rec.strip()
              for start_lit in starters:
                  if rec.startswith(start_lit):
                      process_group(this_group)
                      this_group=[]
              this_group.append(rec)
      
      ## process last group
      process_group(this_group)

      Comment

      • amskape
        New Member
        • Apr 2010
        • 56

        #4
        Dear dwblas,
        Thanks for your fantastic answer . It works fine with small indentation changes.
        Code:
        #!/bin/python  
        def process_group(group_in):
            print " ".join(group_in)
        with open("problem_html.txt", "r") as fp_in:
            starters = ["<h1", "<h2", "<span", "</body"]
            this_group = []
            for rec in fp_in:
                rec = rec.strip()
                for start_lit in starters:
                    if rec.startswith(start_lit):
                        process_group(this_group)
                    #this_group = []
                this_group.append(rec)
        
        # process last group
        process_group(this_group) #function invoking...
        But current situation I got the result as DOM element with a normal python print show as
        Code:
        [<DOM Element: body at 0xb199054c>]
        So the Node list element . In node list we cannot apply this strip() method. Please advise a solution in this case...

        With lots of gratitude

        Anes

        Comment

        Working...