XPath querying text node *including* <br/>

Collapse
This topic is closed.
X
X
 
  • Time
  • Show
Clear All
new posts
  • Sven

    XPath querying text node *including* <br/>

    Dear all,

    I'm trying to extract data from HTML using XPath in Java.
    Unfortunately the text contents of nodes may contain <br/tags which
    are not correctly interpreted, at least not for me ;)

    A <pnode may contain this text:

    <p>
    Test1<br/>
    Test2<br/>
    Test3
    </p>

    Which is returned by the XPath query as "Test1Test2Test 3" but I need
    it as "Test1\nTest2\n Test3" or "Test1 Test2 Test3".

    Here's example code (Java 6):

    public class Example {
    private static final String html = "<html><body><p >Test1<br/
    >Test2<br/>Test3</p></body></html>";
    public static void main( String[] args ) throws Exception {
    final XPathFactory xPathFactory = XPathFactory.ne wInstance();

    XPath xPath = xPathFactory.ne wXPath();
    String value = (String)xPath.e valuate(
    "//p",
    new InputSource( new StringReader( html ) ),
    XPathConstants. STRING );

    System.out.prin tln( value );

    xPath = xPathFactory.ne wXPath();
    value = (String)xPath.e valuate(
    "//p/text()",
    new InputSource( new StringReader( html ) ),
    XPathConstants. STRING );

    System.out.prin tln( value );

    xPath = xPathFactory.ne wXPath();
    value = (String)xPath.e valuate(
    "//p/node()",
    new InputSource( new StringReader( html ) ),
    XPathConstants. STRING );

    System.out.prin tln( value );
    }
    }

    This code returns:

    Test1Test2Test3
    Test1
    Test1

    Is there any way (XPath function etc) which will return the contents
    as desired?

    Thank you!
  • Bjoern Hoehrmann

    #2
    Re: XPath querying text node *including* &lt;br/&gt;

    * Sven wrote in comp.text.xml:
    >I'm trying to extract data from HTML using XPath in Java.
    >Unfortunatel y the text contents of nodes may contain <br/tags which
    >are not correctly interpreted, at least not for me ;)
    You have to convert them to line breaks yourself, using XPath 1.0 there
    is no way to transform them to line breaks with a simple expression. It
    would be easy to do with XSLT, otherwise you have to implement this in
    code. If you don't have other child elements you could simply iterate
    over the children of the element, append text to a buffer and if you
    have a br element instead, append a line break to the buffer.
    --
    Björn Höhrmann · mailto:bjoern@h oehrmann.de · http://bjoern.hoehrmann.de
    Weinh. Str. 22 · Telefon: +49(0)621/4309674 · http://www.bjoernsworld.de
    68309 Mannheim · PGP Pub. KeyID: 0xA4357E78 · http://www.websitedev.de/

    Comment

    • Joshua Cranmer

      #3
      Re: XPath querying text node *including* &lt;br/&gt;

      Sven wrote:
      Dear all,
      >
      I'm trying to extract data from HTML using XPath in Java.
      Unfortunately the text contents of nodes may contain <br/tags which
      are not correctly interpreted, at least not for me ;)
      >
      A <pnode may contain this text:
      >
      <p>
      Test1<br/>
      Test2<br/>
      Test3
      </p>
      >
      Which is returned by the XPath query as "Test1Test2Test 3" but I need
      it as "Test1\nTest2\n Test3" or "Test1 Test2 Test3".
      >
      Here's example code (Java 6):
      >
      public class Example {
      private static final String html =
      "<html><body><p >Test1<br/Test2<br/Test3</p></body></html>";
      >
      }
      >
      This code returns:
      >
      Test1Test2Test3
      Test1
      Test1
      >
      Is there any way (XPath function etc) which will return the contents
      as desired?
      >
      Thank you!
      String sanitized = html.replaceAll ("<br/>","\n");
      and then replace you usages of `html' with those of `sanitized'.

      --
      Beware of bugs in the above code; I have only proved it correct, not
      tried it. -- Donald E. Knuth

      Comment

      • Philippe Poulard

        #4
        Re: XPath querying text node *including* &lt;br/&gt;

        Joshua Cranmer a écrit :
        String sanitized = html.replaceAll ("<br/>","\n");
        and then replace you usages of `html' with those of `sanitized'.
        Hi,

        This usually doesn't work for thousand different reasons, for examples :

        <br></br>
        <!-- this <br/isn't a line break -->
        <![CDATA[this <br/isn't a line break]]>
        <br><?todo : buy some <br/>ead?></br>

        etc...

        This is the main reason why we have to use parsers : this way, one can
        process things for what they are rather than for what they look like.

        With a SAX filter you can have a more verbose code, but correct :

        public class LineBreakFilter extends XMLFilterImpl {
        public void startElement(St ring uri, String localName, String
        qName, Attributes atts) {
        if ( "br".equals(loc alName) ) {
        characters("\n" .toCharArray(), 0, 1);
        } else {
        super.startElem ent(...);
        }
        }
        public void endElement(Stri ng uri, String localName, String qName) {
        if ( ! "br".equals(loc alName) ) {
        super.endElemen t(...);
        } // else do nothing
        }
        }

        You just have to plug it to a SAX parser (beware to namespaces if you
        have some).

        --
        Cordialement,

        ///
        (. .)
        --------ooO--(_)--Ooo--------
        | Philippe Poulard |
        -----------------------------

        Have the RefleX !

        Comment

        • Philippe Poulard

          #5
          Re: XPath querying text node *including* &lt;br/&gt;

          Philippe Poulard a écrit :
          public class LineBreakFilter extends XMLFilterImpl {
          public void startElement(St ring uri, String localName, String qName,
          Attributes atts) {
          if ( "br".equals(loc alName) ) {
          characters("\n" .toCharArray(), 0, 1);
          } else {
          super.startElem ent(...);
          }
          }
          public void endElement(Stri ng uri, String localName, String qName) {
          if ( ! "br".equals(loc alName) ) {
          super.endElemen t(...);
          } // else do nothing
          }
          }
          I forgot to add in the test: && uri == null (or && uri.length == 0, I
          don't remember what the SAX parser is supposed to give)

          --
          Cordialement,

          ///
          (. .)
          --------ooO--(_)--Ooo--------
          | Philippe Poulard |
          -----------------------------

          Have the RefleX !

          Comment

          • Sven

            #6
            Re: XPath querying text node *including* &lt;br/&gt;

            On 28 Apr., 00:11, Joshua Cranmer <Pidgeo...@veri zon.invalidwrot e:
            String sanitized = html.replaceAll ("<br/>","\n");
            and then replace you usages of `html' with those of `sanitized'.
            Thanks for the hint! Although Philippe noted that this may not work in
            all situations it's sufficient enough for me at the moment.

            Now I have another problem with text nodes (damn text nodes *g*).
            Still the same scenario where I try to extract data from XHTML pages,
            let's assume we have nodes like this

            <div>
            Text1
            <a href="http://...">Link</a>
            Text2
            </div>

            Then \\div\text() will only return "Text1". It's basically the same
            problem where text nodes are interrupted by child nodes. Any way with
            pure XPath to fetch the whole text?

            Thanks!

            Comment

            • Joseph J. Kesselman

              #7
              Re: XPath querying text node *including* &lt;br/&gt;

              <div>
              Text1
              <a href="http://...">Link</a>
              Text2
              </div>
              >
              Then \\div\text() will only return "Text1".
              //div/text() (note: FORWARD slashes in XPath!) will return two text
              nodes. Whatever you are doing with the result of that path may be
              operating on only the first node returned, but you didn't show us that
              which makes it hard to advise you.

              Alernatively, you could retrieve the text value of the <divelement --
              but that would include Link as well, since it's defined as all contained
              text.

              Comment

              • Martin Honnen

                #8
                Re: XPath querying text node *including* &lt;br/&gt;

                Sven wrote:
                <div>
                Text1
                <a href="http://...">Link</a>
                Text2
                </div>
                >
                Then \\div\text() will only return "Text1". It's basically the same
                problem where text nodes are interrupted by child nodes. Any way with
                pure XPath to fetch the whole text?
                Well
                string(/div)
                will give you the text contained in that element which is
                "
                Text1
                Link
                Text2
                "

                And
                /div/text()
                as an XPath 1.0 expression selects two text nodes over which you can
                iterate to extract
                "
                Text1
                "
                and
                "
                Text2
                "


                --

                Martin Honnen

                Comment

                • Sven

                  #9
                  Re: XPath querying text node *including* &lt;br/&gt;

                  On 23 Mai, 19:33, "Joseph J. Kesselman" <keshlam-nos...@comcast. net>
                  wrote:
                  //div/text() (note: FORWARD slashes in XPath!) will return two text
                  nodes. Whatever you are doing with the result of that path may be
                  operating on only the first node returned, but you didn't show us that
                  which makes it hard to advise you.
                  Thanks, this was my bad! In Java I used XPath#evaluate( String
                  expression, InputSource source, QName returnType ) with returnType ==
                  XPathConstants. STRING which apparently doesn't concatenate multiple
                  results. I'm now using XPathConstants. NODESET and everything works as
                  expected. Great!

                  Comment

                  Working...