"Full" element tag listing possible with Elementtree?

Collapse
This topic is closed.
X
X
 
  • Time
  • Show
Clear All
new posts
  • jaime.dyson@gmail.com

    "Full" element tag listing possible with Elementtree?

    Hello all,

    I have the unenviable task of turning about 20K strangely formatted
    XML documents from different sources into something resembling a
    clean, standard, uniform format. I like Elementtree and have been
    using it to step through the documents to get a feel for their
    structure. .getiterator() gives me a depth-first traversal that
    eliminates the hierarchy of the elements. What I'd like is to be able
    to traverse elements while keeping track of ancestors, and print out
    the full structure of all of an ancestor's nodes as I arrive at each
    node. So, for example, if I had a document that looked like this:

    <a>
    <b att="atttag" content="b"this is node b </b>
    <cthis is node c
    <d />
    <ethis is node e </e>
    </c>
    <fthis is node f </f>
    </a>

    I would want to print the following:

    <a>
    <a<b>
    <a<btext: this is node b
    <a<c>
    <a<ctext: this is node c
    <a<c<d>
    <a<c<e>
    <a<c<etext: this is node e
    <a<f>
    <a<fthis is node f


    Is there a simple way to do this? Any help would be appreciated.
    Thanks..

  • Fredrik Lundh

    #2
    Re: &quot;Full&quot ; element tag listing possible with Elementtree?

    jaime.dyson@gma il.com wrote:
    <a>
    <b att="atttag" content="b"this is node b </b>
    <cthis is node c
    <d />
    <ethis is node e </e>
    </c>
    <fthis is node f </f>
    </a>
    >
    I would want to print the following:
    >
    <a>
    <a<b>
    <a<btext: this is node b
    <a<c>
    <a<ctext: this is node c
    <a<c<d>
    <a<c<e>
    <a<c<etext: this is node e
    <a<f>
    <a<fthis is node f
    >
    Is there a simple way to do this? Any help would be appreciated.
    in stock ET, using a parent map is probably the easiest way to do this:



    that is, for a given ET structure "tree", you can do

    parent_map = dict((c, p) for p in tree.getiterato r() for c in p)

    def get_parents(ele m):
    parents = []
    while 1:
    elem = parent_map.get( elem)
    if elem is None:
    break
    parents.append( elem)
    return reversed(parent s)

    for elem in tree.getiterato r():
    print list(get_parent s(elem)), elem

    </F>

    Comment

    • Stefan Behnel

      #3
      Re: &quot;Full&quot ; element tag listing possible with Elementtree?

      jaime.dyson@gma il.com wrote:
      I have the unenviable task of turning about 20K strangely formatted
      XML documents from different sources into something resembling a
      clean, standard, uniform format. I like Elementtree and have been
      using it to step through the documents to get a feel for their
      structure. .getiterator() gives me a depth-first traversal that
      eliminates the hierarchy of the elements. What I'd like is to be able
      to traverse elements while keeping track of ancestors, and print out
      the full structure of all of an ancestor's nodes as I arrive at each
      node.
      Try lxml.etree. It's an extended re-implementation of ElementTree based on
      libxml2. Amongst tons of other features, it provides its Elements with a
      getparent() method and allows you to iterate over their ancestors (and other
      XPath axes), or to iterate over a parsed document in an iterparse-like fashion
      (called iterwalk).



      Stefan

      Comment

      • jaime.dyson@gmail.com

        #4
        Re: &quot;Full&quot ; element tag listing possible with Elementtree?

        On Sep 4, 11:43 pm, Fredrik Lundh <fred...@python ware.comwrote:
        jaime.dy...@gma il.com wrote:
        <a>
          <b att="atttag" content="b"this is node b </b>
          <cthis is node c
            <d />
            <ethis is node e </e>
          </c>
          <fthis is node f </f>
        </a>
        >
        I would want to print the following:
        >
        <a>
        <a<b>
        <a<btext: this is node b
        <a<c>
        <a<ctext: this is node c
        <a<c<d>
        <a<c<e>
        <a<c<etext: this is node e
        <a<f>
        <a<fthis is node f
        >
        Is there a simple way to do this?  Any help would be appreciated.
        >
        in stock ET, using a parent map is probably the easiest way to do this:
        >
             http://effbot.org/zone/element.htm#accessing-parents
        >
        that is, for a given ET structure "tree", you can do
        >
        parent_map = dict((c, p) for p in tree.getiterato r() for c in p)
        >
        def get_parents(ele m):
             parents = []
             while 1:
                 elem = parent_map.get( elem)
                 if elem is None:
                     break
                 parents.append( elem)
             return reversed(parent s)
        >
        for elem in tree.getiterato r():
             print list(get_parent s(elem)), elem
        >
        </F>
        Fantastic. Thank you very much, Fredrik! And thanks for ET!

        Comment

        • Carl Banks

          #5
          Re: &quot;Full&quot ; element tag listing possible with Elementtree?

          On Sep 5, 2:27 am, jaime.dy...@gma il.com wrote:
          So, for example, if I had a document that looked like this:
          >
          <a>
          <b att="atttag" content="b"this is node b </b>
          <cthis is node c
          <d />
          <ethis is node e </e>
          </c>
          <fthis is node f </f>
          </a>
          >
          I would want to print the following:
          >
          <a>
          <a<b>
          <a<btext: this is node b
          <a<c>
          <a<ctext: this is node c
          <a<c<d>
          <a<c<e>
          <a<c<etext: this is node e
          <a<f>
          <a<fthis is node f
          >
          Is there a simple way to do this? Any help would be appreciated.
          Thanks..
          Fredrik Lundh wrote Element Tree, so he'd know the best solution, but
          I'd like to point out that this is also trivially easy with recursion:


          def print_nodes(ele ment, ancestors = []):
          s = hierarchy = ancestors + ["<" + element.tag + ">"]
          if element.text is not None:
          s = s + [element.text]
          print " ".join(s)
          for subelement in element:
          print_nodes(sub element,hierarc hy)



          Carl Banks

          Comment

          Working...