removing content between specified tokens using java script

Collapse
This topic is closed.
X
X
 
  • Time
  • Show
Clear All
new posts
  • rajarao

    removing content between specified tokens using java script

    hi
    I want to remove the content embedded in <script> and </script> tags
    submitted via text box.
    My java script should remove the content embedded between <script> and
    </script> tag.
    my current code is

    function RemoveHTMLScrip t(strText)
    {
    var regEx = /<script\w*<\/script>/g
    return strText.replace (regEx, "");
    }
    let us say,
    strText = "Hi <script> .... .... ..... </script> How are u";
    the expected out put is "Hi How are u"

    Regular expression solution is preferred
    thanks and regards
    Raja rao

  • Lasse Reichstein Nielsen

    #2
    Re: removing content between specified tokens using java script

    "rajarao" <rajaraob@yahoo .com> writes:
    [color=blue]
    > I want to remove the content embedded in <script> and </script> tags
    > submitted via text box.
    > My java script should remove the content embedded between <script> and
    > </script> tag.
    > my current code is
    >
    > function RemoveHTMLScrip t(strText)
    > {
    > var regEx = /<script\w*<\/script>/g[/color]

    This matches "<script" followed by zero or more "word
    characters". Word characters doesn't include ">", so this is unlikely
    to work.
    [color=blue]
    > return strText.replace (regEx, "");
    > }
    > let us say,
    > strText = "Hi <script> .... .... ..... </script> How are u";
    > the expected out put is "Hi How are u"[/color]

    More likely "Hi How are u", if one needs to be pedantic, as evidently
    I do :)
    [color=blue]
    > Regular expression solution is preferred[/color]

    First thing to consider is what to do if the text is:

    "abc<script>... </script>def<scri pt>...</script>ghi"

    You would probably want this to be simplified to "abcdefghi" . However,
    if you use a simple regualar expression matching from <script> to
    </script>, it will match from the first <script> to the last </script>,
    returning only "abcghi".

    To avoid this, you need a non-greedy matching by the regular
    expression, something only available in recent browsers. You don't say
    whether this code should be executed on a web page or on a server,
    but if it is on a server, you control the version of Javascript, and
    can rely on non-greedy matching if available.

    Try this RegExp then:
    /<\s*script.+? <\/\s*script\s*>/ig

    If non-greedy regular expressions are not available, you can find the
    instances manually using indexOf. It's not very effective, though,
    since it doesn't ignore case and whitespace. It can be made to work,
    but it's not nearly as much fun :)


    /L
    --
    Lasse Reichstein Nielsen - lrn@hotpop.com
    DHTML Death Colors: <URL:http://www.infimum.dk/HTML/rasterTriangleD OM.html>
    'Faith without judgement merely degrades the spirit divine.'

    Comment

    • Thomas 'PointedEars' Lahn

      #3
      Re: removing content between specified tokens using java script

      Lasse Reichstein Nielsen wrote:
      [color=blue]
      > "rajarao" <rajaraob@yahoo .com> writes:[color=green]
      >> Regular expression solution is preferred[/color]
      >
      > First thing to consider is what to do if the text is:
      >
      > "abc<script>... </script>def<scri pt>...</script>ghi"
      >
      > You would probably want this to be simplified to "abcdefghi" . However,
      > if you use a simple regualar expression matching from <script> to
      > </script>, it will match from the first <script> to the last </script>,
      > returning only "abcghi".
      >
      > To avoid this, you need a non-greedy matching by the regular
      > expression, something only available in recent browsers. You don't say
      > whether this code should be executed on a web page or on a server,
      > but if it is on a server, you control the version of Javascript, and
      > can rely on non-greedy matching if available.
      >
      > Try this RegExp then:
      > /<\s*script.+? <\/\s*script\s*>/ig[/color]

      Is there really a UA out there that is so b0rken to parse "< script>" as
      "<script>" and "</ script>" as "</script>"? The SGML declaration of HTML
      clearly forbids that for all elements. "<" is STAGO (Start Tag Open) and
      "</" is ETAGO (End Tag Open) where both must not be followed by white
      space.
      [color=blue]
      > If non-greedy regular expressions are not available, you can find the
      > instances manually using indexOf. It's not very effective, though,
      > since it doesn't ignore case and whitespace. It can be made to work,
      > but it's not nearly as much fun :)[/color]

      That is why one wants to use

      /<script[^>]*>[^<>]*<\/script>/ig

      then. Since this is not the first time I encountered the problem,
      I am going to extend my stripTags() method[1] so that you can strip
      only specific tags and also their content if you want.


      PointedEars
      ___________
      [1] <http://pointedears.de. vu/scripts/string.js>

      Comment

      • Lasse Reichstein Nielsen

        #4
        Re: removing content between specified tokens using java script

        Thomas 'PointedEars' Lahn <PointedEars@we b.de> writes:
        [color=blue]
        > Is there really a UA out there that is so b0rken to parse "< script>" as
        > "<script>" and "</ script>" as "</script>"?[/color]

        Probably :) But I don't know of any.

        [color=blue]
        > That is why one wants to use
        >
        > /<script[^>]*>[^<>]*<\/script>/ig[/color]

        That rules out:
        ---
        <script type="text/javascript">
        if (screen.innerWi dth < 1000) { alert("your resolution sucks");}
        </script>
        ---
        since it contains a "<" inside the script.
        You should match up to "</" for correctness, or up to "</script"
        for compliance with browsers.

        /L
        --
        Lasse Reichstein Nielsen - lrn@hotpop.com
        DHTML Death Colors: <URL:http://www.infimum.dk/HTML/rasterTriangleD OM.html>
        'Faith without judgement merely degrades the spirit divine.'

        Comment

        • Thomas 'PointedEars' Lahn

          #5
          Re: removing content between specified tokens using java script

          Lasse Reichstein Nielsen wrote:
          [color=blue]
          > Thomas 'PointedEars' Lahn <PointedEars@we b.de> writes:[color=green]
          >> That is why one wants to use
          >>
          >> /<script[^>]*>[^<>]*<\/script>/ig[/color]
          >
          > That rules out:
          > ---
          > <script type="text/javascript">
          > if (screen.innerWi dth < 1000) { alert("your resolution sucks");}
          > </script>
          > ---
          > since it contains a "<" inside the script.[/color]

          True.
          [color=blue]
          > You should match up to "</" for correctness, or up to "</script"
          > for compliance with browsers.[/color]

          You mean

          /<script[^>]*>.*(?!<\/script>).*<\/script>/ig

          and the like?

          The problem is that such matches would require negative lookahead
          (/(?!...)/) which would require ECMAScript 3 support and I wanted to avoid
          this since my solution was meant as an backwards compatible alternative to
          yours. But even if I would use that and thus lose backwards compatibility,
          I think it could still fail if someone uses "</" or "</script" or
          "<\/script" within script code for some reason.

          Your non-greedy RegExp requires ECMAScript 3 support as well, and yet fails
          if someone uses "</script>" or even "<\/script>" within the script code. So
          neither the OP nor anyone "can rely on non-greedy matching if available".

          Alas, until someone proves the opposite, it remains an intrinsic property of
          nested expressions and languages created by such expressions like markup
          languages that successful parsing of them using Regular Expressions is just
          impossible in general. There are cases where RegExp parsing of such context
          can be successful, though; the more detailed/strict its structure/syntax is
          defined and the less nested its subexpressions are, the higher is the
          statistical probability of successful RegExp parsing of it. Remember we
          already had this discussion here a few months before.


          PointedEars

          Comment

          • Lasse Reichstein Nielsen

            #6
            Re: removing content between specified tokens using java script

            Thomas 'PointedEars' Lahn <PointedEars@we b.de> writes:
            [color=blue]
            > Lasse Reichstein Nielsen wrote:[color=green]
            >> You should match up to "</" for correctness, or up to "</script"
            >> for compliance with browsers.[/color]
            >
            > You mean
            >
            > /<script[^>]*>.*(?!<\/script>).*<\/script>/ig
            >
            > and the like?[/color]
            [color=blue]
            > The problem is that such matches would require negative lookahead
            > (/(?!...)/)[/color]

            If it is to be easy, it required eiter negative lookahead, or
            non-greedy matching
            /<script.*?>.*?< \/script\s*>/ig

            However, neither gives any power to regular expressions that they
            didn't have already, so you can make a regular expression without either
            that matches the same expression. It's just likely to be huge.

            A non-greedy match until the string abcd (/.*?abcd/) can be written as
            [^a]*a(((a|ba|bca)* ([^ba]|b[^ca]|bc[^da])[^a]*a)*bcd)
            ^ until first a
            next a is before bcd: restart
            not bcd and or a = either [^ba], or b[^ca], or bc[^da]
            then findnext a and restart
            or bcd => finished

            A similar non-greedy match for ".*?</script" would be:

            [^<]*<((<|\/<|\/s<|\/sc<|\/scr<|\/scri<|\/scrip<)*
            ([^\/<]|\/[^s<]|\/s[^c<]|\/sc[^r<]|\/scr[^i<]|\/scri[^p<]|\/scrip[^t<])
            [^<]*<)*\/script

            The struture is simple, so you can generate it automatically (provided
            the string doesn't contain repeats of the first character!):

            function reEscape(string ) {
            return string.replace(/([[+*?.(){\\\/])/g,"\\$1"); // did I miss any?
            }

            function matchUntilRE(st ring) {
            if (string.length == 0) { return; }
            if (string.length == 1) { return "[^"+reEscape(str ing)+"]*" +
            reEscape(string ); }
            var buf = []; // StringBuffer
            var firstChar = reEscape(string .charAt(0));
            buf.push("[^",firstChar ,"]*",firstChar );
            buf.push("((");
            for(var i=0;i<string.le ngth-1;i++) {
            if (i>0) { buf.push("|"); }
            buf.push(reEsca pe(string.subst ring(1,i+1)),fi rstChar);
            }
            buf.push(")*(") ;
            for(var i=0;i<string.le ngth-1;i++) {
            if (i>0) { buf.push("|"); }
            buf.push(reEsca pe(string.subst ring(1,i+1)),
            "[^",reEscape(str ing.charAt(i+1) ),firstChar,"]");
            }
            buf.push(")");
            buf.push("[^",firstChar ,"]*",firstChar );
            buf.push(")*");
            buf.push(reEsca pe(string.subst ring(1)));
            return buf.join("");
            }

            (Yey, it gives me exactly the same as the one I created manually :)

            I don't see how a non-greedy match until </script can fail.
            [color=blue]
            > Your non-greedy RegExp requires ECMAScript 3 support as well, and yet fails
            > if someone uses "</script>" or even "<\/script>" within the script code.[/color]

            Fails how? The first is not permitted inside script code (it should
            end the script right there), the latter is, and should not be matched
            by a search for "</script".

            The only problem I see here is the decission whether to search for
            </ or </script. I'd go for the latter, for the same reason browsers
            do it: it is sufficient, and allows erroneous scripts without breaking.
            [color=blue]
            > Alas, until someone proves the opposite, it remains an intrinsic property of
            > nested expressions and languages created by such expressions like markup
            > languages that successful parsing of them using Regular Expressions is just
            > impossible in general.[/color]

            Yes, but we are not parsing the HTML here.
            [color=blue]
            > There are cases where RegExp parsing of such context
            > can be successful, though; the more detailed/strict its structure/syntax is
            > defined and the less nested its subexpressions are, the higher is the
            > statistical probability of successful RegExp parsing of it.[/color]

            Exactly. And the script element does not contain markup so it cannot
            be nested. It stops at the *first* following occurence of "</script",
            which is something RE's can test for successfully.

            Likewise, you can use regexps to find all tags in a document, because
            tags are not nested (elements are).
            /L
            --
            Lasse Reichstein Nielsen - lrn@hotpop.com
            DHTML Death Colors: <URL:http://www.infimum.dk/HTML/rasterTriangleD OM.html>
            'Faith without judgement merely degrades the spirit divine.'

            Comment

            • Lasse Reichstein Nielsen

              #7
              Re: removing content between specified tokens using java script

              Lasse Reichstein Nielsen <lrn@hotpop.com > writes:

              [lookeahead and non-greedy matching][color=blue]
              > However, neither gives any power to regular expressions that they
              > didn't have already, so you can make a regular expression without either
              > that matches the same expression. It's just likely to be huge.[/color]

              I'm confuzing two things here.

              It is correct that non-greedy matching doesn't allow regular
              expressions to match anything they couldn't without. They don't even
              need to be rewritten to match the same strings, just use the greedy
              operators instead. What non-greedy matching does is, when there are
              *more* than one way to match a string, the returned match will be the
              shortest possible.
              [color=blue]
              > A non-greedy match until the string abcd (/.*?abcd/) can be written as[/color]
              [color=blue]
              > [^a]*a(((a|ba|bca)* ([^ba]|b[^ca]|bc[^da])[^a]*a)*bcd)[/color]

              That is incorrect. This expression matches the string up to and including
              the first occurence of abcd. That is not the same as a non-greedy .*?,
              whic can match past the first occurence if needed.

              Matching up to the first occurence is what we need in this case, but
              it is not the same as non-greedy matching.

              /L
              --
              Lasse Reichstein Nielsen - lrn@hotpop.com
              DHTML Death Colors: <URL:http://www.infimum.dk/HTML/rasterTriangleD OM.html>
              'Faith without judgement merely degrades the spirit divine.'

              Comment

              • Thomas 'PointedEars' Lahn

                #8
                Re: removing content between specified tokens using java script

                Lasse Reichstein Nielsen wrote:
                [color=blue]
                > Matching up to the first occurence is what we need in this case,[/color]

                No, it is not, as we are trying to parse a markup language, consisting of
                nested subexpressions. The first occurrence of the close tag after the open
                tag is not necessarily the correct one as I already pointed out.


                PointedEars

                Comment

                • Thomas 'PointedEars' Lahn

                  #9
                  Re: removing content between specified tokens using java script

                  Lasse Reichstein Nielsen wrote:
                  [color=blue]
                  > Thomas 'PointedEars' Lahn <PointedEars@we b.de> writes:[color=green]
                  >> Lasse Reichstein Nielsen wrote:[color=darkred]
                  >>> You should match up to "</" for correctness, or up to "</script"[/color][/color]
                  >
                  > [...]
                  > I don't see how a non-greedy match until </script can fail.
                  >[color=green]
                  >> Your non-greedy RegExp requires ECMAScript 3 support as well, and yet fails
                  >> if someone uses "</script>" or even "<\/script>" within the script code.[/color]
                  >
                  > Fails how? The first is not permitted inside script code (it should
                  > end the script right there), the latter is, and should not be matched
                  > by a search for "</script".[/color]

                  Note that although specified in SGML that ETAGO ends an element rather than
                  its entire end tag, not all UAs follow the spec in this regard so one could
                  use the non-conforming syntax and get away with it, e.g. placing malicious
                  code within a bulletin board posting viewed with IE. Such needs to be covered.
                  [color=blue]
                  > [...][color=green]
                  >> Alas, until someone proves the opposite, it remains an intrinsic property of
                  >> nested expressions and languages created by such expressions like markup
                  >> languages that successful parsing of them using Regular Expressions is just
                  >> impossible in general.[/color]
                  >
                  > Yes, but we are not parsing the HTML here.[/color]

                  IBTD.


                  PointedEars

                  Comment

                  • Lasse Reichstein Nielsen

                    #10
                    Re: removing content between specified tokens using java script

                    Thomas 'PointedEars' Lahn <PointedEars@we b.de> writes:
                    [color=blue]
                    > Lasse Reichstein Nielsen wrote:
                    >[color=green]
                    >> Matching up to the first occurence is what we need in this case,[/color]
                    >
                    > No, it is not, as we are trying to parse a markup language, consisting of
                    > nested subexpressions.[/color]

                    But we are not. We are trying "to remove the content embedded in
                    <script> and </script> tags". Script tags have CDATA as content type,
                    so they are not containing nested HTML tags.

                    It is true that regular expressions cannot match recursive tree structures
                    (HTML is really a special case of the "matched parenthesis" problem, the
                    traditional non-recursive language).
                    [color=blue]
                    > The first occurrence of the close tag after the open
                    > tag is not necessarily the correct one as I already pointed out.[/color]

                    Yes it is. In HTML, the script tag ends at the first occurence of
                    "</". Browsers don't follow the HTML specification and end script tags
                    at the first occurence of the literal character sequences "</script".
                    There is no way to include that literal sequence inside a script tag.

                    /L
                    --
                    Lasse Reichstein Nielsen - lrn@hotpop.com
                    DHTML Death Colors: <URL:http://www.infimum.dk/HTML/rasterTriangleD OM.html>
                    'Faith without judgement merely degrades the spirit divine.'

                    Comment

                    • Lasse Reichstein Nielsen

                      #11
                      Re: removing content between specified tokens using java script

                      Lasse Reichstein Nielsen <lrn@hotpop.com > writes:
                      [color=blue]
                      > (HTML is really a special case of the "matched parenthesis" problem, the
                      > traditional non-recursive language).[/color]

                      non-REGULAR, of course. It's definitly recursive :)

                      /L
                      --
                      Lasse Reichstein Nielsen - lrn@hotpop.com
                      DHTML Death Colors: <URL:http://www.infimum.dk/HTML/rasterTriangleD OM.html>
                      'Faith without judgement merely degrades the spirit divine.'

                      Comment

                      • Thomas 'PointedEars' Lahn

                        #12
                        Re: removing content between specified tokens using java script

                        Lasse Reichstein Nielsen wrote:
                        [color=blue]
                        > Thomas 'PointedEars' Lahn <PointedEars@we b.de> writes:[color=green]
                        >> Lasse Reichstein Nielsen wrote:[color=darkred]
                        >>> Matching up to the first occurence is what we need in this case,[/color]
                        >>
                        >> No, it is not, as we are trying to parse a markup language, consisting of
                        >> nested subexpressions.[/color]
                        >
                        > But we are not. We are trying "to remove the content embedded in
                        > <script> and </script> tags". Script tags have CDATA as content type,[/color]

                        True if you mean the content model of the HTML "script" element.
                        [color=blue]
                        > so they are not containing nested HTML tags.[/color]

                        False. CDATA is content that is not parsed by an HTML UA and
                        thus it does not contribute to the parse tree. It can contain
                        (nested) <script type="text/javascript">
                        document.write( '<strong><em>ta gs</em></strong>'); // [1]
                        </script> anyway.

                        [1] Yes, I know that this is invalid HTML but it works in
                        non-conforming UAs and this is for demo only, anyway.
                        [color=blue][color=green]
                        >> The first occurrence of the close tag after the open
                        >> tag is not necessarily the correct one as I already pointed out.[/color]
                        >
                        > Yes it is. In HTML, the script tag ends at the first occurence of
                        > "</".[/color]

                        True.
                        [color=blue]
                        > Browsers don't follow the HTML specification and end script tags
                        > at the first occurence of the literal character sequences "</script".[/color]

                        s/tags/elements/

                        ACK, my bad.
                        [color=blue]
                        > There is no way to include that literal sequence inside a script tag.[/color]

                        Well, you *can* include it in a "script" element's content but it does
                        not *work* as intended (a script error due to incomplete code is highly
                        likely). Yet garbage content remains if scriptwise parsing/replacement
                        follows that misguided paradigm. That is clearly a Bad Thing.

                        So (again) no RegExp presented in this thread (incl. mine) is suitable to
                        solve the problem (which this discussion is about after all). Instead one
                        should write a markup parser prototype or use a (DOM) object that provides
                        such a functionality.


                        PointedEars

                        Comment

                        Working...