Parsing tag names and values from XML files

**bvdet** · Nov 18 '10, 03:12 PM

To begin with, you misspelled "Contributo r". You never got any elements.

Each "Contributo r" can have a varying number of ELEMENT_TYPE child nodes. Some of the child nodes can have ELEMENT_TYPE child nodes also.

Note that the first child node can be a Text node with no real text value:

Code:

>>> Contributor.childNodes[0]
<DOM Text node "\n    ">
>>>

Code:

>>> Contributor
<DOM Element: Contributor at 0x1238ad0>
>>> Contributor.childNodes[1].nodeName
u'Organization'
>>>

Try the following and look at the output. Then decide the best way to get the data you need for printing.

Code:

for Contributor in Contributors:
    for elem in Contributor.childNodes:
        print repr(elem)
        if elem.hasChildNodes:
            for item in elem.childNodes:
                print "   ", repr(item)

**haobijam** · Nov 19 '10, 10:02 AM

Dear,

Could yo please tell me how could i parse the attributes and its values from the XML file (MINiML.txt). I would like to print output like below -

Contributoriid = "contrib1"
Person Yael Strulovici-Bare
Email yas2003@med.cor nell.edu
Phone 646-962-5560
Laboratory Crystal
Department Department of Genetic Medicine
Organization Weill Cornell Medical College
Line 1300 York Avenue
City New York
State NY
Zip-Code 10021
Country USA

Regards,
Haobijam

**bvdet** · Nov 19 '10, 02:25 PM

I wrote some functions for an application of mine that you may find useful or give you ideas on how to format the output for your application. The first one returns a list of text found in the child nodes of a parent node. Whitespace is ignored.

Code:

def getTextFromElem(parent):
    '''Return a list of text found in the child nodes of a
    parent node, discarding whitespace.'''
    textList = []
    for n in parent.childNodes:
        # TEXT_NODE - 3
        if n.nodeType == 3 and n.nodeValue.strip():
            textList.append(str(n.nodeValue.strip()))
    return textList

The second returns a list of element nodes below a parent node.

Code:

def getElemChildren(parent):
    # Return a list of element nodes below parent
    elements = []
    for obj in parent.childNodes:
        if obj.nodeType == obj.ELEMENT_NODE:
            elements.append(obj)
    return elements

The third returns a list of strings representing the node tree below a parent node, using recursion to reach nested levels.

Code:

def nodeTree(element, pad=0):
    # Return list of strings representing the node tree below element
    results = ["%s%s" % (pad*" ", str(element.nodeName))]
    nextElems = getElemChildren(element)
    if nextElems:
        for node in nextElems:
            results.extend(nodeTree(node, pad+2))
    else:
        results.append("%s%s" % ((pad+2)*" ", ", ".join(getTextFromElem(element))))
    return results

Using nodeTree() in your application:

Code:

>>> contributors = xmlDoc.documentElement.getElementsByTagName( 'Contributor' )
>>> for contributor in contributors:
... 	print "\n".join(nodeTree(contributor))
... 	
Contributor
  Person
    First
      Yael
    Last
      Strulovici-Barel
  Email
    yas2003@med.cornell.edu
  Phone
    646-962-5560
  Laboratory
    Crystal
  Department
    Department of Genetic Medicine
  Organization
    Weill Cornell Medical College
  Address
    Line
      1300 York Avenue
    City
      New York
    State
      NY
    Zip-Code
      10021
    Country
      USA
Contributor
  Organization
    
  Email
    geo@ncbi.nlm.nih.gov, support@affymetrix.com
  Phone
    888-362-2447
  Organization
    Affymetrix, Inc.
  Address
    City
      Santa Clara
    State
      CA
    Zip-Code
      95051
    Country
      USA
  Web-Link
    http://www.affymetrix.com/index.affx
Contributor
  Person
    First
      Brendan
    Last
      Carolan
Contributor
  Person
    First
      Ben-Gary
    Last
      Harvey
Contributor
  Person
    First
      Bishnu
    Middle
      P
    Last
      De
Contributor
  Person
    First
      Holly
    Last
      Vanni
Contributor
  Person
    First
      Ronald
    Middle
      G
    Last
      Crystal
>>>

Since we are not extracting attributes, I modified the title of this thread.

BV - Moderator

**haobijam** · Nov 21 '10, 12:05 PM

Hello,

Thanks for your help. I do have assembled and run the script but there was an error while running it on Platform section at line number 86 in MINiML.xml file. When i remove this line and run the script it prints correctly what we want in the output. The error in output prints like -

>>>

Traceback (most recent call last):
File "C:\Users\haoja m\Desktop\GEO\G SE10006\test2.p y", line 45, in <module>
print "\n".join(nodeT ree(contributor ))
File "C:\Users\haoja m\Desktop\GEO\G SE10006\test2.p y", line 32, in nodeTree
results.extend( nodeTree(node, pad+2))
File "C:\Users\haoja m\Desktop\GEO\G SE10006\test2.p y", line 34, in nodeTree
results.append( "%s%s" % ((pad+2)*" ", ", ".join(getTextF romElem(element ))))
File "C:\Users\haoja m\Desktop\GEO\G SE10006\test2.p y", line 15, in getTextFromElem
textList.append (str(n.nodeValu e.strip()))
UnicodeEncodeEr ror: 'ascii' codec can't encode character u'\xae' in position 379: ordinal not in range(128)

Code:

#!/usr/bin/python
import xml.dom.minidom

# Load the Contibutor collection
MINiML = xml.dom.minidom.parse ( 'MINiML.xml' )


def getTextFromElem(parent):
    '''Return a list of text found in the child nodes of a
    parent node, discarding whitespace.'''
    textList = []
    for n in parent.childNodes:
        # TEXT_NODE - 3
        if n.nodeType == 3 and n.nodeValue.strip():
            textList.append(str(n.nodeValue.strip()))
    return textList

def getElemChildren(parent):
    # Return a list of element nodes below parent
    elements = []
    for obj in parent.childNodes:
        if obj.nodeType == obj.ELEMENT_NODE:
            elements.append(obj)
    return elements

def nodeTree(element, pad=0):
    # Return list of strings representing the node tree below element
    results = ["%s%s" % (pad*" ", str(element.nodeName))]
    nextElems = getElemChildren(element)
    if nextElems:
        for node in nextElems:
            results.extend(nodeTree(node, pad+2))
    else:
        results.append("%s%s" % ((pad+2)*" ", ", ".join(getTextFromElem(element))))
    return results

contributors = MINiML.documentElement.getElementsByTagName( 'Contributor' )
for contributor in contributors:
    print "\n".join(nodeTree(contributor))
contributors = MINiML.documentElement.getElementsByTagName( 'Database' )
for contributor in contributors:
    print "\n".join(nodeTree(contributor))
contributors = MINiML.documentElement.getElementsByTagName( 'Platform' )
for contributor in contributors:
    print "\n".join(nodeTree(contributor))
contributors = MINiML.documentElement.getElementsByTagName( 'Sample' )
for contributor in contributors:
    print "\n".join(nodeTree(contributor))
contributors = MINiML.documentElement.getElementsByTagName( 'Series' )
for contributor in contributors:
    print "\n".join(nodeTree(contributor))

Regards,
Haobijam

**haobijam** · Nov 21 '10, 12:13 PM

Hello,

The output for this python code is attached here but the line number 86 in MINiML.xml file is not printed. This an error. Please see the output.

Regards,
Haobijam

Attached Files

output.txt (199.8 KB, 364 views)

**bvdet** · Nov 21 '10, 05:11 PM

I think the word "GenBank\xa e" is the problem. I'm not sure what to do about that. You might try ElementTree to parse the file.

Parsing tag names and values from XML files

Parsing tag names and values from XML files

Comment

Comment

Comment

Comment

Comment

Comment