Getting the complete text content of a node...

Collapse
This topic is closed.
X
X
 
  • Time
  • Show
Clear All
new posts
  • Arancaytar

    Getting the complete text content of a node...

    (Note: I am a Javascript newbie. I can handle PHP and Java, but this
    is unfamiliar territory.)

    For a wordcount feature, I need to collect the complete text content
    of a 'div' element inside a variable. Because of the issues with
    paragraphs and markup, the content is split into different nodes in
    the DOM.

    For example:

    <div>Hello, <p>this text is <span style="font-style:italic">i talic</
    span></p></div>

    This will cause (more or less) a DOM tree like this:

    [div]
    -Hello,
    - [p]
    -this text is
    -[span]
    -italic

    Now, my function has, as a starting point, the [div] node that is the
    top parent here.

    function aggregateTextNo de(textNode) {
    ...
    return allText;
    }

    Since I don't know the depth of the nodes, I am trying to build this
    as a recursive function. "depth" is a parameter that ensures I can
    limit the recursion to a certain level.

    function aggregateTextNo de(textNode,dep th) {
    var text=textNode.n odeValue; // get the text value of the current
    node
    if (depth==0) return text; // recursion limit reached
    for (i=0;i<textNode .childNodes.len gth;i++) { // if the node has
    child nodes, aggregate these
    text+=aggregate TextNode(textNo de.childNodes[i],depth-1); //
    append aggregated text
    }
    return text;
    }

    However, no matter where I set the recursion limit, the script
    invariably freezes Firefox until the timeout is reached and I can
    abort it - infinite loop, apparently.

    Can you see what's wrong with my code? It's very clearly the recursion
    that causes it, because if the node has no child nodes at all (say
    "<div>Just text</div>"), it succeeds. But if there is only a single
    child node, it hangs itself.

    Meanwhile, I've managed to do it with a very ugly nested loop that can
    go three levels deep, but I'd really rather use the recursive approach
    if at all possible.

  • Christoph Burschka

    #2
    Re: Getting the complete text content of a node...

    Arancaytar schrieb:
    >
    However, no matter where I set the recursion limit, the script
    invariably freezes Firefox until the timeout is reached and I can
    abort it - infinite loop, apparently.
    >

    I forgot to add: The Error Console shows no warnings or notices - I
    don't have any indication that the script is crashing apart from the
    freeze itself.

    Also, a more readable version of the function with shorter, non-wrapped
    lines:

    function aggregateTextNo de(node,depth) {
    var text=node.nodeV alue;
    if (depth==0) return text;
    for (i=0;i<node.chi ldNodes.length; i++) {
    text+=aggregate TextNode(node.c hildNodes[i],depth-1);
    }
    return text;
    }

    --
    CB

    Comment

    • p.lepin@ctncorp.com

      #3
      Re: Getting the complete text content of a node...

      On Feb 1, 3:01 pm, "Arancaytar "
      <arancaytar.ily a...@gmail.comw rote:
      For a wordcount feature, I need to collect the complete
      text content of a 'div' element inside a variable.
      Because of the issues with paragraphs and markup, the
      content is split into different nodes in the DOM.
      >
      Since I don't know the depth of the nodes, I am trying to
      build this as a recursive function. "depth" is a
      parameter that ensures I can limit the recursion to a
      certain level.
      [code snipped]
      Meanwhile, I've managed to do it with a very ugly nested
      loop that can go three levels deep, but I'd really rather
      use the recursive approach if at all possible.
      I'm not sure what the problem with your code is, but I was
      a bit surprised you don't check the nodeType. Anyway, the
      following works for me in Firefox 1.5.0.7 and Konqueror
      3.5.4. Cannot check it with other browsers at the moment.

      <!DOCTYPE HTML PUBLIC
      "-//W3C//DTD HTML 4.01//EN"
      "http://www.w3.org/TR/html4/strict.dtd">
      <html>
      <head>
      <title></title>
      <script type="text/javascript">
      function grabText ( node , maxDepth )
      {
      if ( 3 == node . nodeType )
      {
      return node . nodeValue ;
      }
      else if
      (
      ( 1 == node . nodeType ) && ( 0 < maxDepth )
      )
      {
      var result = '' ;
      for
      (
      var i = 0 ;
      i < node . childNodes . length ;
      i ++
      )
      {
      result +=
      grabText
      (
      node . childNodes [ i ] , maxDepth - 1
      ) ;
      }
      return result ;
      }
      return '' ;
      }
      </script>
      </head>
      <body>
      <div onclick=" alert ( grabText ( this , 3 ) ) ; ">
      Some <i>stupid</itext with
      <b><span>fa<i>n </i>cy</spanformatting</b>!
      </div>
      </body>
      </html>

      --
      Pavel Lepin

      Comment

      • RobG

        #4
        Re: Getting the complete text content of a node...

        On Feb 1, 11:01 pm, "Arancaytar " <arancaytar.ily a...@gmail.comw rote:
        (Note: I am a Javascript newbie. I can handle PHP and Java, but this
        is unfamiliar territory.)
        >
        For a wordcount feature, I need to collect the complete text content
        of a 'div' element inside a variable. Because of the issues with
        paragraphs and markup, the content is split into different nodes in
        the DOM.
        Here's one I prepared earlier...

        It tries the W3C compliant textContent first, if that isn't
        supported, it tries IE's innerText. Finally, it tries innerHTML with
        a regular expression to strip HTML tags. The final method should only
        be used in a very small number of browsers, and may fail in those in a
        few cases.

        If you really want a recursive function, that is included further
        down.

        // Using textConent || innerText || innerHTML = regEx
        function getText (el) {
        if (el.textContent ) {return el.textContent; }
        if (el.innerText) {return el.innerText;}
        if (typeof el.innerHTML == 'string') {
        return el.innerHTML.re place(/<[^<>]+>/g,'');
        }
        }


        // Using textConent || innerText || recursion
        function getText(el)
        {
        if (el.textContent ) return el.textContent;
        if (el.innerText) return el.innerText;
        return getText2(el);

        function getText2(el) {
        var x = el.childNodes;
        var txt = '';
        for (var i=0, len=x.length; i<len; ++i){
        if (3 == x[i].nodeType) {
        txt += x[i].data;
        } else if (1 == x[i].nodeType){
        txt += getText2(x[i]);
        }
        }
        return txt.replace(/\s+/g,' ');
        }
        }

        --
        Rob

        Comment

        • Christoph Burschka

          #5
          Re: Getting the complete text content of a node...

          RobG wrote:
          If you really want a recursive function, that is included further
          down.
          >
          I don't want recursion at any cost - I just assumed it's necessary
          because of the way the DOM tree stores text. If it's possible to get the
          "flat" text content of the node in another way, that would be just great.

          I haven't tried out your code yet, but from what I see, I guess
          "textConten t" and "innerText" can do just that without a need for a
          messy recursion.

          So by trying "textConten t", "innerText" and stripped "innerHTML" in that
          order, I can support almost all browsers that matter?

          --
          CB

          Comment

          • Elegie

            #6
            Re: Getting the complete text content of a node...

            RobG wrote:

            Hi Rob,

            <snip>
            Finally, it tries innerHTML with
            a regular expression to strip HTML tags. The final method should only
            be used in a very small number of browsers, and may fail in those in a
            few cases.
            I think so, too: innerHTML returns some text in which HTML entities
            should logically not be expanded (as it normally represents a valid HTML
            fragment). Therefore, if the code were to include some, those entities
            would appear "as is" in the returned text.

            For that very reason, while I'd admit the innerHTML approach is
            definitely appealing, I think I'd prefer to stick to the recursion model
            as the third fall back technique.


            Kind regards,
            Elegie.

            Comment

            • RobG

              #7
              Re: Getting the complete text content of a node...

              Christoph Burschka wrote:
              RobG wrote:
              >
              >If you really want a recursive function, that is included further
              >down.
              >>
              >
              I don't want recursion at any cost - I just assumed it's necessary
              because of the way the DOM tree stores text. If it's possible to get the
              "flat" text content of the node in another way, that would be just great.
              >
              I haven't tried out your code yet, but from what I see, I guess
              "textConten t" and "innerText" can do just that without a need for a
              messy recursion.
              >
              So by trying "textConten t", "innerText" and stripped "innerHTML" in that
              order, I can support almost all browsers that matter?
              Yes.

              I don't know of any recent browser that doesn't support either
              textContent or innerHTML, maybe there are some mobile browsers in that
              category. If you keep the tag content simple (no '<' or '>' characters
              in attribute values) then the fall-back to innerHTML should be pretty
              solid too.


              --
              Rob

              Comment

              • Christoph Burschka

                #8
                Re: Getting the complete text content of a node...

                RobG schrieb:
                Christoph Burschka wrote:
                >
                >RobG wrote:
                >>
                >>If you really want a recursive function, that is included further
                >>down.
                >>>
                >>
                >I don't want recursion at any cost - I just assumed it's necessary
                >because of the way the DOM tree stores text. If it's possible to get
                >the "flat" text content of the node in another way, that would be just
                >great.
                >>
                >I haven't tried out your code yet, but from what I see, I guess
                >"textContent " and "innerText" can do just that without a need for a
                >messy recursion.
                >>
                >So by trying "textConten t", "innerText" and stripped "innerHTML" in
                >that order, I can support almost all browsers that matter?
                >
                >
                Yes.
                >
                I don't know of any recent browser that doesn't support either
                textContent or innerHTML, maybe there are some mobile browsers in that
                category. If you keep the tag content simple (no '<' or '>' characters
                in attribute values) then the fall-back to innerHTML should be pretty
                solid too.
                >
                >
                Well, since the wordcount is a cosmetic feature, it won't break the page
                if by some chance the browser doesn't support it.

                Anyway, I've replaced my current nested loop with this function, and it
                works perfectly. Thanks a lot!

                --
                CB

                Comment

                Working...