Remove Empty Tags on page

Collapse
This topic is closed.
X
X
 
  • Time
  • Show
Clear All
new posts
  • David

    Remove Empty Tags on page

    Hi All,

    I am working on a script that is theoreticaly simple but I can not get it to
    work completely. I am dealing with a page spit out by .NET that leaves empty
    tags in the markup. I need a javascript solution to go behind and do a clean
    up after the page loads.

    The .NET will leave behind any combination of nested tags. Here is an
    example below. Even though the <spantags are not empty, as they contain
    <emtags they also need to be removed.

    <p>
    <span><em></em></span>
    <span><em></em></span>
    <span><em></em></span>
    </p>

    Here is a simple test page of what I have done so far. It does remove some
    of the tags but always leaves behind some empty tags...


    Any ideas of the best way to approach this?

    David



  • David

    #2
    Re: Remove Empty Tags on page





    "David" <none@none.comw rote in message
    news:HeA_j.8105 $9H6.7786@trndd c04...
    Hi All,
    >
    I am working on a script that is theoreticaly simple but I can not get it
    to work completely. I am dealing with a page spit out by .NET that leaves
    empty tags in the markup. I need a javascript solution to go behind and do
    a clean up after the page loads.
    David

    For any that look at the page you will see the script is only looping
    through a certain set of tags...

    var tagArray = ["em", "span", "p", "a", "li", "ul"];

    Using all tags.. el=document.get ElementsByTagNa me("*") would be the
    preferred method but I found myself needing several loops and several
    node.parentNode .removeChild()' s and it still didn't work correctly.

    David


    Comment

    • Doug Miller

      #3
      Re: Remove Empty Tags on page

      In article <HeA_j.8105$9H6 .7786@trnddc04> , "David" <none@none.comw rote:
      >Hi All,
      >
      >I am working on a script that is theoreticaly simple but I can not get it to
      >work completely. I am dealing with a page spit out by .NET that leaves empty
      >tags in the markup. I need a javascript solution to go behind and do a clean
      >up after the page loads.
      >
      >The .NET will leave behind any combination of nested tags. Here is an
      >example below. Even though the <spantags are not empty, as they contain
      ><emtags they also need to be removed.
      >
      ><p>
      <span><em></em></span>
      <span><em></em></span>
      <span><em></em></span>
      ></p>
      >
      >Here is a simple test page of what I have done so far. It does remove some
      >of the tags but always leaves behind some empty tags...
      >http://mysite.verizon.net/res8xvny/removeTags.html
      >
      >Any ideas of the best way to approach this?
      Any reason you can't just use the search-and-replace function in your favorite
      text editor? If you have shell access to a Unix machine, this is pretty
      trivial.

      Comment

      • David

        #4
        Re: Remove Empty Tags on page


        "Doug Miller" <spambait@milma c.comwrote in message
        news:Z%A_j.2849 $Q57.1079@nlpi0 65.nbdc.sbc.com ...
        In article <HeA_j.8105$9H6 .7786@trnddc04> , "David" <none@none.comw rote:
        >>Hi All,
        >>
        >>I am working on a script that is theoreticaly simple but I can not get it
        >>to
        >>work completely. I am dealing with a page spit out by .NET that leaves
        >>empty
        >>tags in the markup. I need a javascript solution to go behind and do a
        >>clean
        >>up after the page loads.
        >>
        >>The .NET will leave behind any combination of nested tags. Here is an
        >>example below. Even though the <spantags are not empty, as they contain
        >><emtags they also need to be removed.
        >>
        >><p>
        > <span><em></em></span>
        > <span><em></em></span>
        > <span><em></em></span>
        >></p>
        >>
        >>Here is a simple test page of what I have done so far. It does remove some
        >>of the tags but always leaves behind some empty tags...
        >>http://mysite.verizon.net/res8xvny/removeTags.html
        >>
        >>Any ideas of the best way to approach this?
        >
        Any reason you can't just use the search-and-replace function in your
        favorite
        text editor? If you have shell access to a Unix machine, this is pretty
        trivial.
        Yes, the reason is because the .NET is rendering this HTML live. This has to
        be done to the actual rendered page on the fly, after it has been loaded.

        David




        Comment

        • Doug Miller

          #5
          Re: Remove Empty Tags on page

          In article <75B_j.10498$3j .2456@trnddc05> , "David" <none@none.comw rote:
          >
          >"Doug Miller" <spambait@milma c.comwrote in message
          >news:Z%A_j.284 9$Q57.1079@nlpi 065.nbdc.sbc.co m...
          >In article <HeA_j.8105$9H6 .7786@trnddc04> , "David" <none@none.comw rote:
          >>>Hi All,
          >>>
          >>>I am working on a script that is theoreticaly simple but I can not get it
          >>>to
          >>>work completely. I am dealing with a page spit out by .NET that leaves
          >>>empty
          >>>tags in the markup. I need a javascript solution to go behind and do a
          >>>clean
          >>>up after the page loads.
          >>>
          >>>The .NET will leave behind any combination of nested tags. Here is an
          >>>example below. Even though the <spantags are not empty, as they contain
          >>><emtags they also need to be removed.
          >>>
          >>><p>
          >> <span><em></em></span>
          >> <span><em></em></span>
          >> <span><em></em></span>
          >>></p>
          >>>
          >>>Here is a simple test page of what I have done so far. It does remove some
          >>>of the tags but always leaves behind some empty tags...
          >>>http://mysite.verizon.net/res8xvny/removeTags.html
          >>>
          >>>Any ideas of the best way to approach this?
          >>
          >Any reason you can't just use the search-and-replace function in your
          >favorite
          >text editor? If you have shell access to a Unix machine, this is pretty
          >trivial.
          >
          >Yes, the reason is because the .NET is rendering this HTML live. This has to
          >be done to the actual rendered page on the fly, after it has been loaded.
          Well, you could still do it with a Unix shell script... might be easier.

          Comment

          • RobG

            #6
            Re: Remove Empty Tags on page

            On May 27, 12:58 am, "David" <n...@none.comw rote:
            Hi All,
            >
            I am working on a script that is theoreticaly simple but I can not get it to
            work completely. I am dealing with a page spit out by .NET that leaves empty
            tags in the markup. I need a javascript solution to go behind and do a clean
            up after the page loads.
            >
            The .NET will leave behind any combination of nested tags. Here is an
            example below. Even though the <spantags are not empty, as they contain
            <emtags they also need to be removed.
            >
            <p>
            <span><em></em></span>
            <span><em></em></span>
            <span><em></em></span>
            </p>
            >
            Here is a simple test page of what I have done so far. It does remove some
            of the tags but always leaves behind some empty tags...http://mysite.verizon.net/res8xvny/removeTags.html
            Have you considered going down the DOM and remove any in-line element
            whose textContent or innerText is empty? That way you don't have to
            go down nested empty nodes, they will be removed as soon as you reach
            the highest ancestor.

            function getText(el)
            {
            if (typeof el == 'string') el = document.getEle mentById(el);

            // Try DOM 3 textContent property first
            if (typeof el.textContent == 'string') {return el.textContent; }

            // Try MS innerText property
            if (typeof el.innerText == 'string') {return el.innerText;}
            return rec(el);

            // Recurse over child nodes
            function rec(el) {
            var n, x = el.childNodes;
            var txt = [];
            for (var i=0, len=x.length; i<len; ++i){
            n = x[i];

            // Use TEXT_NODE and ELEMENT_NODE as apparently IE 8 will
            // "not support enumeration of nodeType constant values"
            // G. Talbert clj
            if (n.TEXT_NODE == n.nodeType) {
            txt.push(n.data );
            } else if (n.ELEMENT_NODE == n.nodeType) {
            txt.push(rec(n) );
            }
            }
            return txt.join('').re place(/\s+/g,' ');
            }
            }


            function removeEmptyNode s() {
            var node, nodes = document.getEle mentsByTagName( '*');

            // These nodes are allowed to be empty
            var allowedEmpty = 'base basefont body br col hr html image '
            + 'input isindex link meta param title';
            var re;

            // Collection is live, so as remove nodes, length gets shorter
            for (var i=0; i<nodes.length ; i++) {
            node = nodes[i];
            re = new RegExp('\\b'+no de.tagName+'\\b ','i');

            // Only removes nodes where textContent is '', but could extend
            // to remove any node where textContent is matches \s*
            if (!re.test(allow edEmpty) && getText(node) == '') {
            node.parentNode .removeChild(no de);

            // i node removed, so backup
            --i;
            }
            }
            }



            --
            Rob



            Comment

            • Bjoern Hoehrmann

              #7
              Re: Remove Empty Tags on page

              * RobG wrote in comp.lang.javas cript:
              >Have you considered going down the DOM and remove any in-line element
              >whose textContent or innerText is empty? That way you don't have to
              >go down nested empty nodes, they will be removed as soon as you reach
              >the highest ancestor.
              But then you'll remove e.g. <span><img/></span>.
              --
              Björn Höhrmann · mailto:bjoern@h oehrmann.de · http://bjoern.hoehrmann.de
              Weinh. Str. 22 · Telefon: +49(0)621/4309674 · http://www.bjoernsworld.de
              68309 Mannheim · PGP Pub. KeyID: 0xA4357E78 · http://www.websitedev.de/

              Comment

              • David

                #8
                Re: Remove Empty Tags on page

                "RobG" <rgqld@iinet.ne t.auwrote in message
                news:2aa46234-e9c1-44db-ad98-dd458da7776b@y2 2g2000prd.googl egroups.com...
                Have you considered going down the DOM and remove any in-line element
                whose textContent or innerText is empty? That way you don't have to
                go down nested empty nodes, they will be removed as soon as you reach
                the highest ancestor.
                >
                function getText(el)
                {
                if (typeof el == 'string') el = document.getEle mentById(el);
                >
                // Try DOM 3 textContent property first
                if (typeof el.textContent == 'string') {return el.textContent; }
                >
                // Try MS innerText property
                if (typeof el.innerText == 'string') {return el.innerText;}
                return rec(el);
                >
                // Recurse over child nodes
                function rec(el) {
                var n, x = el.childNodes;
                var txt = [];
                for (var i=0, len=x.length; i<len; ++i){
                n = x[i];
                >
                // Use TEXT_NODE and ELEMENT_NODE as apparently IE 8 will
                // "not support enumeration of nodeType constant values"
                // G. Talbert clj
                if (n.TEXT_NODE == n.nodeType) {
                txt.push(n.data );
                } else if (n.ELEMENT_NODE == n.nodeType) {
                txt.push(rec(n) );
                }
                }
                return txt.join('').re place(/\s+/g,' ');
                }
                }
                >
                >
                function removeEmptyNode s() {
                var node, nodes = document.getEle mentsByTagName( '*');
                >
                // These nodes are allowed to be empty
                var allowedEmpty = 'base basefont body br col hr html image '
                + 'input isindex link meta param title';
                var re;
                >
                // Collection is live, so as remove nodes, length gets shorter
                for (var i=0; i<nodes.length ; i++) {
                node = nodes[i];
                re = new RegExp('\\b'+no de.tagName+'\\b ','i');
                >
                // Only removes nodes where textContent is '', but could extend
                // to remove any node where textContent is matches \s*
                if (!re.test(allow edEmpty) && getText(node) == '') {
                node.parentNode .removeChild(no de);
                >
                // i node removed, so backup
                --i;
                }
                }
                }
                >
                >
                >
                --
                Rob

                I tried it and it does work, but it leaves in the <p></pin the page in
                this scenario...

                <p>
                <span><em></em></span>
                <span><em></em></span>
                <span><em></em></span>
                </p>

                David


                Comment

                • RobG

                  #9
                  Re: Remove Empty Tags on page

                  On May 27, 10:00 am, Bjoern Hoehrmann <bjo...@hoehrma nn.dewrote:
                  * RobG wrote in comp.lang.javas cript:
                  >
                  Have you considered going down the DOM and remove any in-line element
                  whose textContent or innerText is empty? That way you don't have to
                  go down nested empty nodes, they will be removed as soon as you reach
                  the highest ancestor.
                  >
                  But then you'll remove e.g. <span><img/></span>.
                  Ooops. That can be fixed by going over the child nodes to see if any
                  contain "allowed to be empty" nodes - and hence losing its appeal. I
                  think Lasse's recursive DOM walk is best, as it also allows empty
                  #text nodes to be removed along the way.

                  FWIW, here's the fixed function (also removes nodes where the content
                  is only whitespace):

                  function removeEmptyNode s() {
                  var node, nodes = document.getEle mentsByTagName( '*');
                  var kids, skip = false;
                  var allowedEmpty = 'base basefont body br col hr html img '
                  + 'input isindex link meta param title';
                  var re0 = /^\s*$/;
                  var re1, re2;

                  for (var i=0; i<nodes.length ; i++) {
                  node = nodes[i];
                  re1 = new RegExp('\\b'+no de.tagName+'\\b ','i');

                  if (!re1.test(allo wedEmpty) && re0.test(getTex t(node))) {
                  kids = node.getElement sByTagName('*') ;

                  for (var j=0, jlen=kids.lengt h; j<jlen; j++) {
                  re2 = new RegExp('\\b'+ki ds[j].tagName+'\\b', 'i');

                  if (re2.test(allow edEmpty)) {
                  skip = true;
                  break;
                  }
                  }

                  if (!skip) {
                  node.parentNode .removeChild(no de);
                  --i;
                  }
                  skip = false;
                  }
                  }
                  }


                  --
                  Rob

                  Comment

                  • David

                    #10
                    Re: Remove Empty Tags on page



                    "RobG" <rgqld@iinet.ne t.auwrote in message
                    news:07a93d93-7c01-4eea-b230-5e89f302f5b2@d1 9g2000prm.googl egroups.com...
                    On May 27, 10:00 am, Bjoern Hoehrmann <bjo...@hoehrma nn.dewrote:
                    >* RobG wrote in comp.lang.javas cript:
                    >>
                    >Have you considered going down the DOM and remove any in-line element
                    >whose textContent or innerText is empty? That way you don't have to
                    >go down nested empty nodes, they will be removed as soon as you reach
                    >the highest ancestor.
                    >>
                    >But then you'll remove e.g. <span><img/></span>.
                    >
                    Ooops. That can be fixed by going over the child nodes to see if any
                    contain "allowed to be empty" nodes - and hence losing its appeal. I
                    think Lasse's recursive DOM walk is best, as it also allows empty
                    #text nodes to be removed along the way.
                    >
                    FWIW, here's the fixed function (also removes nodes where the content
                    is only whitespace):
                    >
                    function removeEmptyNode s() {
                    var node, nodes = document.getEle mentsByTagName( '*');
                    var kids, skip = false;
                    var allowedEmpty = 'base basefont body br col hr html img '
                    + 'input isindex link meta param title';
                    var re0 = /^\s*$/;
                    var re1, re2;
                    >
                    for (var i=0; i<nodes.length ; i++) {
                    node = nodes[i];
                    re1 = new RegExp('\\b'+no de.tagName+'\\b ','i');
                    >
                    if (!re1.test(allo wedEmpty) && re0.test(getTex t(node))) {
                    kids = node.getElement sByTagName('*') ;
                    >
                    for (var j=0, jlen=kids.lengt h; j<jlen; j++) {
                    re2 = new RegExp('\\b'+ki ds[j].tagName+'\\b', 'i');
                    >
                    if (re2.test(allow edEmpty)) {
                    skip = true;
                    break;
                    }
                    }
                    >
                    if (!skip) {
                    node.parentNode .removeChild(no de);
                    --i;
                    }
                    skip = false;
                    }
                    }
                    }
                    >
                    >
                    --
                    Rob

                    Yep, that works as well. I really appreciate your help on this.

                    David


                    Comment

                    • Henry

                      #11
                      Re: Remove Empty Tags on page

                      On May 26, 3:58 pm, David wrote:
                      I am working on a script that is theoreticaly simple but I
                      can not get it to work completely. I am dealing with a page
                      spit out by .NET that leaves empty tags in the markup.
                      No matter how bad .NET may be it is not so bad that it would be
                      randomly inserting mark-up into its output. If there are empty
                      elements in the mark-up then it is almost certain that they are there
                      because .NET had been instructed to put them there. So the obvious
                      solution is fix the server side code so that it does not output
                      anything but what you want it to output (i.e. take control of what you
                      are doing).
                      I need a javascript solution to go behind and do a clean
                      up after the page loads.
                      That would be the worst possible approach to the problem.

                      Comment

                      • David

                        #12
                        Re: Remove Empty Tags on page


                        "Henry" <rcornford@rain drop.co.ukwrote in message
                        news:3c52a157-b5f0-4af3-927c-6e026cd055b0@56 g2000hsm.google groups.com...
                        On May 26, 3:58 pm, David wrote:
                        >I am working on a script that is theoreticaly simple but I
                        >can not get it to work completely. I am dealing with a page
                        >spit out by .NET that leaves empty tags in the markup.
                        >
                        No matter how bad .NET may be it is not so bad that it would be
                        randomly inserting mark-up into its output. If there are empty
                        elements in the mark-up then it is almost certain that they are there
                        because .NET had been instructed to put them there. So the obvious
                        solution is fix the server side code so that it does not output
                        anything but what you want it to output (i.e. take control of what you
                        are doing).
                        >
                        >I need a javascript solution to go behind and do a clean
                        >up after the page loads.
                        >
                        That would be the worst possible approach to the problem.
                        Henry,

                        Completely agree with you, absolutely, and I told our developers and powers
                        to be just this but I do not make the decisions and have to deal with them.

                        David


                        Comment

                        Working...