Parsing tag names and values from XML files

Collapse
X
 
  • Time
  • Show
Clear All
new posts
  • haobijam
    New Member
    • Oct 2010
    • 16

    Parsing tag names and values from XML files

    Sir,

    Could you please assist me in writing a python code for parsing values from XML files. I would like to extract Person, Email, Phone, Organization, Address, etc from the XML file. I have a written a code for it but could you please rectify. Please find the attached MINiML.txt file.

    Code:
    #!/usr/bin/python
    print "Content-Type: text/plain\n"    
    print "<html><body>" 
    import xml.dom.minidom
    
    # Load the Contibutor collection
    MINiML = xml.dom.minidom.parse ( 'MINiML.xml' )
    
    # Get a list of Contibutors
    Contibutors = MINiML.documentElement.getElementsByTagName( 'Contibutor' )
    
    # Loop through the Contibutors
    for Contibutor in Contibutors:
    
        #Print out the Contibutor's information
        print
        print 'Email:  ' + Contibutor.getElementsByTagName ( 'Email' )[0].childNodes [0].nodeValue
        print 'Phone: ' + Contibutor.getElementsByTagName ( 'Phone' )[0].childNodes [0].nodeValue
        print 'Laboratory:  ' + Contibutor.getElementsByTagName ( 'Laboratory' )[0].childNodes [0].nodeValue
        print 'Department:  ' + Contibutor.getElementsByTagName ( 'Department' ) [0].childNodes [0].nodeValue
        print 'Organization:  ' + Contibutor.getElementsByTagName ( 'Organization' )[0].childNodes [0].nodeValue
        print "</body></html>"
    Regards,
    Haobijam
    Attached Files
    Last edited by bvdet; Nov 19 '10, 02:26 PM. Reason: Modified thread title
  • bvdet
    Recognized Expert Specialist
    • Oct 2006
    • 2851

    #2
    To begin with, you misspelled "Contributo r". You never got any elements.

    Each "Contributo r" can have a varying number of ELEMENT_TYPE child nodes. Some of the child nodes can have ELEMENT_TYPE child nodes also.

    Note that the first child node can be a Text node with no real text value:
    Code:
    >>> Contributor.childNodes[0]
    <DOM Text node "\n    ">
    >>>
    Code:
    >>> Contributor
    <DOM Element: Contributor at 0x1238ad0>
    >>> Contributor.childNodes[1].nodeName
    u'Organization'
    >>>

    Try the following and look at the output. Then decide the best way to get the data you need for printing.
    Code:
    for Contributor in Contributors:
        for elem in Contributor.childNodes:
            print repr(elem)
            if elem.hasChildNodes:
                for item in elem.childNodes:
                    print "   ", repr(item)

    Comment

    • haobijam
      New Member
      • Oct 2010
      • 16

      #3
      Dear,

      Could yo please tell me how could i parse the attributes and its values from the XML file (MINiML.txt). I would like to print output like below -

      Contributoriid = "contrib1"
      Person Yael Strulovici-Bare
      Email yas2003@med.cor nell.edu
      Phone 646-962-5560
      Laboratory Crystal
      Department Department of Genetic Medicine
      Organization Weill Cornell Medical College
      Line 1300 York Avenue
      City New York
      State NY
      Zip-Code 10021
      Country USA

      Regards,
      Haobijam

      Comment

      • bvdet
        Recognized Expert Specialist
        • Oct 2006
        • 2851

        #4
        I wrote some functions for an application of mine that you may find useful or give you ideas on how to format the output for your application. The first one returns a list of text found in the child nodes of a parent node. Whitespace is ignored.
        Code:
        def getTextFromElem(parent):
            '''Return a list of text found in the child nodes of a
            parent node, discarding whitespace.'''
            textList = []
            for n in parent.childNodes:
                # TEXT_NODE - 3
                if n.nodeType == 3 and n.nodeValue.strip():
                    textList.append(str(n.nodeValue.strip()))
            return textList
        The second returns a list of element nodes below a parent node.
        Code:
        def getElemChildren(parent):
            # Return a list of element nodes below parent
            elements = []
            for obj in parent.childNodes:
                if obj.nodeType == obj.ELEMENT_NODE:
                    elements.append(obj)
            return elements
        The third returns a list of strings representing the node tree below a parent node, using recursion to reach nested levels.
        Code:
        def nodeTree(element, pad=0):
            # Return list of strings representing the node tree below element
            results = ["%s%s" % (pad*" ", str(element.nodeName))]
            nextElems = getElemChildren(element)
            if nextElems:
                for node in nextElems:
                    results.extend(nodeTree(node, pad+2))
            else:
                results.append("%s%s" % ((pad+2)*" ", ", ".join(getTextFromElem(element))))
            return results
        Using nodeTree() in your application:
        Code:
        >>> contributors = xmlDoc.documentElement.getElementsByTagName( 'Contributor' )
        >>> for contributor in contributors:
        ... 	print "\n".join(nodeTree(contributor))
        ... 	
        Contributor
          Person
            First
              Yael
            Last
              Strulovici-Barel
          Email
            yas2003@med.cornell.edu
          Phone
            646-962-5560
          Laboratory
            Crystal
          Department
            Department of Genetic Medicine
          Organization
            Weill Cornell Medical College
          Address
            Line
              1300 York Avenue
            City
              New York
            State
              NY
            Zip-Code
              10021
            Country
              USA
        Contributor
          Organization
            
          Email
            geo@ncbi.nlm.nih.gov, support@affymetrix.com
          Phone
            888-362-2447
          Organization
            Affymetrix, Inc.
          Address
            City
              Santa Clara
            State
              CA
            Zip-Code
              95051
            Country
              USA
          Web-Link
            http://www.affymetrix.com/index.affx
        Contributor
          Person
            First
              Brendan
            Last
              Carolan
        Contributor
          Person
            First
              Ben-Gary
            Last
              Harvey
        Contributor
          Person
            First
              Bishnu
            Middle
              P
            Last
              De
        Contributor
          Person
            First
              Holly
            Last
              Vanni
        Contributor
          Person
            First
              Ronald
            Middle
              G
            Last
              Crystal
        >>>
        Since we are not extracting attributes, I modified the title of this thread.

        BV - Moderator

        Comment

        • haobijam
          New Member
          • Oct 2010
          • 16

          #5
          Hello,

          Thanks for your help. I do have assembled and run the script but there was an error while running it on Platform section at line number 86 in MINiML.xml file. When i remove this line and run the script it prints correctly what we want in the output. The error in output prints like -

          >>>

          Traceback (most recent call last):
          File "C:\Users\haoja m\Desktop\GEO\G SE10006\test2.p y", line 45, in <module>
          print "\n".join(nodeT ree(contributor ))
          File "C:\Users\haoja m\Desktop\GEO\G SE10006\test2.p y", line 32, in nodeTree
          results.extend( nodeTree(node, pad+2))
          File "C:\Users\haoja m\Desktop\GEO\G SE10006\test2.p y", line 34, in nodeTree
          results.append( "%s%s" % ((pad+2)*" ", ", ".join(getTextF romElem(element ))))
          File "C:\Users\haoja m\Desktop\GEO\G SE10006\test2.p y", line 15, in getTextFromElem
          textList.append (str(n.nodeValu e.strip()))
          UnicodeEncodeEr ror: 'ascii' codec can't encode character u'\xae' in position 379: ordinal not in range(128)
          Code:
          #!/usr/bin/python
          import xml.dom.minidom
          
          # Load the Contibutor collection
          MINiML = xml.dom.minidom.parse ( 'MINiML.xml' )
          
          
          def getTextFromElem(parent):
              '''Return a list of text found in the child nodes of a
              parent node, discarding whitespace.'''
              textList = []
              for n in parent.childNodes:
                  # TEXT_NODE - 3
                  if n.nodeType == 3 and n.nodeValue.strip():
                      textList.append(str(n.nodeValue.strip()))
              return textList
          
          def getElemChildren(parent):
              # Return a list of element nodes below parent
              elements = []
              for obj in parent.childNodes:
                  if obj.nodeType == obj.ELEMENT_NODE:
                      elements.append(obj)
              return elements
          
          def nodeTree(element, pad=0):
              # Return list of strings representing the node tree below element
              results = ["%s%s" % (pad*" ", str(element.nodeName))]
              nextElems = getElemChildren(element)
              if nextElems:
                  for node in nextElems:
                      results.extend(nodeTree(node, pad+2))
              else:
                  results.append("%s%s" % ((pad+2)*" ", ", ".join(getTextFromElem(element))))
              return results
          
          contributors = MINiML.documentElement.getElementsByTagName( 'Contributor' )
          for contributor in contributors:
              print "\n".join(nodeTree(contributor))
          contributors = MINiML.documentElement.getElementsByTagName( 'Database' )
          for contributor in contributors:
              print "\n".join(nodeTree(contributor))
          contributors = MINiML.documentElement.getElementsByTagName( 'Platform' )
          for contributor in contributors:
              print "\n".join(nodeTree(contributor))
          contributors = MINiML.documentElement.getElementsByTagName( 'Sample' )
          for contributor in contributors:
              print "\n".join(nodeTree(contributor))
          contributors = MINiML.documentElement.getElementsByTagName( 'Series' )
          for contributor in contributors:
              print "\n".join(nodeTree(contributor))
          Regards,
          Haobijam

          Comment

          • haobijam
            New Member
            • Oct 2010
            • 16

            #6
            Hello,

            The output for this python code is attached here but the line number 86 in MINiML.xml file is not printed. This an error. Please see the output.

            Regards,
            Haobijam
            Attached Files

            Comment

            • bvdet
              Recognized Expert Specialist
              • Oct 2006
              • 2851

              #7
              I think the word "GenBank\xa e" is the problem. I'm not sure what to do about that. You might try ElementTree to parse the file.

              Comment

              Working...