Getting elements and text with lxml

Collapse
This topic is closed.
X
X
 
  • Time
  • Show
Clear All
new posts
  • =?ISO-8859-1?Q?J=2E_Pablo_Fern=E1ndez?=

    Getting elements and text with lxml

    Hello,

    I have an XML file that starts with:

    <vortaro>
    <art mrk="$Id: a.xml,v 1.10 2007/09/11 16:30:20 revo Exp $">
    <kap>
    <ofc>*</ofc>-<rad>a</rad>
    </kap>

    out of it, I'd like to extract something like (I'm just showing one
    structure, any structure as long as all data is there is fine):

    [("ofc", "*"), "-", ("rad", "a")]

    How can I do it? I managed to get the content of boths tags and the
    text up to the first tag ("\n "), but not the - (and in other XML
    files, there's more text outside the elements).

    Thanks.
  • Gabriel Genellina

    #2
    Re: Getting elements and text with lxml

    En Fri, 16 May 2008 18:53:03 -0300, J. Pablo Fernández <pupeno@pupeno. com>
    escribió:
    Hello,
    >
    I have an XML file that starts with:
    >
    <vortaro>
    <art mrk="$Id: a.xml,v 1.10 2007/09/11 16:30:20 revo Exp $">
    <kap>
    <ofc>*</ofc>-<rad>a</rad>
    </kap>
    >
    out of it, I'd like to extract something like (I'm just showing one
    structure, any structure as long as all data is there is fine):
    >
    [("ofc", "*"), "-", ("rad", "a")]
    >
    How can I do it? I managed to get the content of boths tags and the
    text up to the first tag ("\n "), but not the - (and in other XML
    files, there's more text outside the elements).
    Look for the "tail" attribute.

    --
    Gabriel Genellina

    Comment

    • =?ISO-8859-1?Q?J=2E_Pablo_Fern=E1ndez?=

      #3
      Re: Getting elements and text with lxml

      On May 17, 2:19 am, "Gabriel Genellina" <gagsl-...@yahoo.com.a r>
      wrote:
      En Fri, 16 May 2008 18:53:03 -0300, J. Pablo Fernández <pup...@pupeno. com 
      escribió:
      >
      >
      >
      Hello,
      >
      I have an XML file that starts with:
      >
      <vortaro>
      <art mrk="$Id: a.xml,v 1.10 2007/09/11 16:30:20 revo Exp $">
      <kap>
        <ofc>*</ofc>-<rad>a</rad>
      </kap>
      >
      out of it, I'd like to extract something like (I'm just showing one
      structure, any structure as long as all data is there is fine):
      >
      [("ofc", "*"), "-", ("rad", "a")]
      >
      How can I do it? I managed to get the content of boths tags and the
      text up to the first tag ("\n   "), but not the - (and in other XML
      files, there's more text outside the elements).
      >
      Look for the "tail" attribute.
      That gives me the last part, but not the one in the middle:

      In : etree.tounicode (e)
      Out: u'<kap>\n <ofc>*</ofc>-<rad>a</rad>\n</kap>\n'

      In : e.text
      Out: '\n '

      In : e.tail
      Out: '\n'

      Thanks.

      Comment

      • John Machin

        #4
        Re: Getting elements and text with lxml

        J. Pablo Fernández wrote:
        On May 17, 2:19 am, "Gabriel Genellina" <gagsl-...@yahoo.com.a r>
        wrote:
        >En Fri, 16 May 2008 18:53:03 -0300, J. Pablo Fernández <pup...@pupeno. com>
        >escribió:
        >>
        >>
        >>
        >>Hello,
        >>I have an XML file that starts with:
        >><vortaro>
        >><art mrk="$Id: a.xml,v 1.10 2007/09/11 16:30:20 revo Exp $">
        >><kap>
        >> <ofc>*</ofc>-<rad>a</rad>
        >></kap>
        >>out of it, I'd like to extract something like (I'm just showing one
        >>structure, any structure as long as all data is there is fine):
        >>[("ofc", "*"), "-", ("rad", "a")]
        >>How can I do it? I managed to get the content of boths tags and the
        >>text up to the first tag ("\n "), but not the - (and in other XML
        >>files, there's more text outside the elements).
        >Look for the "tail" attribute.
        >
        That gives me the last part, but not the one in the middle:
        >
        In : etree.tounicode (e)
        Out: u'<kap>\n <ofc>*</ofc>-<rad>a</rad>\n</kap>\n'
        >
        In : e.text
        Out: '\n '
        >
        In : e.tail
        Out: '\n'
        >
        You need the text content of your initial element's children, which
        needs that of their children, and so on.

        See http://effbot.org/zone/element-bits-and-pieces.htm

        HTH,
        John


        Comment

        • Stefan Behnel

          #5
          Re: Getting elements and text with lxml

          J. Pablo Fernández wrote:
          I have an XML file that starts with:
          >
          <vortaro>
          <art mrk="$Id: a.xml,v 1.10 2007/09/11 16:30:20 revo Exp $">
          <kap>
          <ofc>*</ofc>-<rad>a</rad>
          </kap>
          >
          out of it, I'd like to extract something like (I'm just showing one
          structure, any structure as long as all data is there is fine):
          >
          [("ofc", "*"), "-", ("rad", "a")]
          >>root = etree.fromstrin g(xml)
          >>l = []
          >>for el in root.iter(): # or root.getiterato r()
          ... l.append((el, el.text))
          ... l.append(el.tex t)

          or maybe this is enough:

          list(root.itert ext())

          Stefan

          Comment

          • =?ISO-8859-1?Q?J=2E_Pablo_Fern=E1ndez?=

            #6
            Re: Getting elements and text with lxml

            On May 17, 4:17 pm, Stefan Behnel <stefan...@behn el.dewrote:
            J. Pablo Fernández wrote:
            I have an XML file that starts with:
            >
            <vortaro>
            <art mrk="$Id: a.xml,v 1.10 2007/09/11 16:30:20 revo Exp $">
            <kap>
              <ofc>*</ofc>-<rad>a</rad>
            </kap>
            >
            out of it, I'd like to extract something like (I'm just showing one
            structure, any structure as long as all data is there is fine):
            >
            [("ofc", "*"), "-", ("rad", "a")]
            >
                >>root = etree.fromstrin g(xml)
                >>l = []
                >>for el in root.iter():    # or root.getiterato r()
                ...     l.append((el, el.text))
                ...     l.append(el.tex t)
            >
            or maybe this is enough:
            >
                list(root.itert ext())
            >
            Stefan
            Hello,

            My object doesn't have iter() or itertext(), it only has:
            iterancestors, iterchildren, iterdescendants , itersiblings.

            Thanks.

            Comment

            Working...