removing content between specified tokens using java script

**Lasse Reichstein Nielsen** · Jul 23 '05, 10:25 AM

Re: removing content between specified tokens using java script

"rajarao" <rajaraob@yahoo .com> writes:
[color=blue]
> I want to remove the content embedded in <script> and </script> tags
> submitted via text box.
> My java script should remove the content embedded between <script> and
> </script> tag.
> my current code is
>
> function RemoveHTMLScrip t(strText)
> {
> var regEx = /<script\w*<\/script>/g[/color]

This matches "<script" followed by zero or more "word
characters". Word characters doesn't include ">", so this is unlikely
to work.
[color=blue]
> return strText.replace (regEx, "");
> }
> let us say,
> strText = "Hi <script> .... .... ..... </script> How are u";
> the expected out put is "Hi How are u"[/color]

More likely "Hi How are u", if one needs to be pedantic, as evidently
I do :)
[color=blue]
> Regular expression solution is preferred[/color]

First thing to consider is what to do if the text is:

"abc<script>... </script>def<scri pt>...</script>ghi"

You would probably want this to be simplified to "abcdefghi" . However,
if you use a simple regualar expression matching from <script> to
</script>, it will match from the first <script> to the last </script>,
returning only "abcghi".

To avoid this, you need a non-greedy matching by the regular
expression, something only available in recent browsers. You don't say
whether this code should be executed on a web page or on a server,
but if it is on a server, you control the version of Javascript, and
can rely on non-greedy matching if available.

Try this RegExp then:
/<\s*script.+? <\/\s*script\s*>/ig

If non-greedy regular expressions are not available, you can find the
instances manually using indexOf. It's not very effective, though,
since it doesn't ignore case and whitespace. It can be made to work,
but it's not nearly as much fun :)

/L
--
Lasse Reichstein Nielsen - lrn@hotpop.com
DHTML Death Colors: <URL:http://www.infimum.dk/HTML/rasterTriangleD OM.html>
'Faith without judgement merely degrades the spirit divine.'

**Thomas 'PointedEars' Lahn** · Jul 23 '05, 10:30 AM

Re: removing content between specified tokens using java script

Lasse Reichstein Nielsen wrote:
[color=blue]
> "rajarao" <rajaraob@yahoo .com> writes:[color=green]
>> Regular expression solution is preferred[/color]
>
> First thing to consider is what to do if the text is:
>
> "abc<script>... </script>def<scri pt>...</script>ghi"
>
> You would probably want this to be simplified to "abcdefghi" . However,
> if you use a simple regualar expression matching from <script> to
> </script>, it will match from the first <script> to the last </script>,
> returning only "abcghi".
>
> To avoid this, you need a non-greedy matching by the regular
> expression, something only available in recent browsers. You don't say
> whether this code should be executed on a web page or on a server,
> but if it is on a server, you control the version of Javascript, and
> can rely on non-greedy matching if available.
>
> Try this RegExp then:
> /<\s*script.+? <\/\s*script\s*>/ig[/color]

Is there really a UA out there that is so b0rken to parse "< script>" as
"<script>" and "</ script>" as "</script>"? The SGML declaration of HTML
clearly forbids that for all elements. "<" is STAGO (Start Tag Open) and
"</" is ETAGO (End Tag Open) where both must not be followed by white
space.
[color=blue]
> If non-greedy regular expressions are not available, you can find the
> instances manually using indexOf. It's not very effective, though,
> since it doesn't ignore case and whitespace. It can be made to work,
> but it's not nearly as much fun :)[/color]

That is why one wants to use

/<script[^>]*>[^<>]*<\/script>/ig

then. Since this is not the first time I encountered the problem,
I am going to extend my stripTags() method[1] so that you can strip
only specific tags and also their content if you want.

PointedEars
___________
[1] <http://pointedears.de. vu/scripts/string.js>

**Lasse Reichstein Nielsen** · Jul 23 '05, 10:31 AM

Re: removing content between specified tokens using java script

Thomas 'PointedEars' Lahn <PointedEars@we b.de> writes:
[color=blue]
> Is there really a UA out there that is so b0rken to parse "< script>" as
> "<script>" and "</ script>" as "</script>"?[/color]

Probably :) But I don't know of any.

[color=blue]
> That is why one wants to use
>
> /<script[^>]*>[^<>]*<\/script>/ig[/color]

That rules out:
---
<script type="text/javascript">
if (screen.innerWi dth < 1000) { alert("your resolution sucks");}
</script>
---
since it contains a "<" inside the script.
You should match up to "</" for correctness, or up to "</script"
for compliance with browsers.

/L
--
Lasse Reichstein Nielsen - lrn@hotpop.com
DHTML Death Colors: <URL:http://www.infimum.dk/HTML/rasterTriangleD OM.html>
'Faith without judgement merely degrades the spirit divine.'

**Thomas 'PointedEars' Lahn** · Jul 23 '05, 10:38 AM

Re: removing content between specified tokens using java script

Lasse Reichstein Nielsen wrote:
[color=blue]
> Thomas 'PointedEars' Lahn <PointedEars@we b.de> writes:[color=green]
>> That is why one wants to use
>>
>> /<script[^>]*>[^<>]*<\/script>/ig[/color]
>
> That rules out:
> ---
> <script type="text/javascript">
> if (screen.innerWi dth < 1000) { alert("your resolution sucks");}
> </script>
> ---
> since it contains a "<" inside the script.[/color]

True.
[color=blue]
> You should match up to "</" for correctness, or up to "</script"
> for compliance with browsers.[/color]

You mean

/<script[^>]*>.*(?!<\/script>).*<\/script>/ig

and the like?

The problem is that such matches would require negative lookahead
(/(?!...)/) which would require ECMAScript 3 support and I wanted to avoid
this since my solution was meant as an backwards compatible alternative to
yours. But even if I would use that and thus lose backwards compatibility,
I think it could still fail if someone uses "</" or "</script" or
"<\/script" within script code for some reason.

Your non-greedy RegExp requires ECMAScript 3 support as well, and yet fails
if someone uses "</script>" or even "<\/script>" within the script code. So
neither the OP nor anyone "can rely on non-greedy matching if available".

Alas, until someone proves the opposite, it remains an intrinsic property of
nested expressions and languages created by such expressions like markup
languages that successful parsing of them using Regular Expressions is just
impossible in general. There are cases where RegExp parsing of such context
can be successful, though; the more detailed/strict its structure/syntax is
defined and the less nested its subexpressions are, the higher is the
statistical probability of successful RegExp parsing of it. Remember we
already had this discussion here a few months before.

PointedEars

**Lasse Reichstein Nielsen** · Jul 23 '05, 10:38 AM

Re: removing content between specified tokens using java script

Thomas 'PointedEars' Lahn <PointedEars@we b.de> writes:
[color=blue]
> Lasse Reichstein Nielsen wrote:[color=green]
>> You should match up to "</" for correctness, or up to "</script"
>> for compliance with browsers.[/color]
>
> You mean
>
> /<script[^>]*>.*(?!<\/script>).*<\/script>/ig
>
> and the like?[/color]
[color=blue]
> The problem is that such matches would require negative lookahead
> (/(?!...)/)[/color]

If it is to be easy, it required eiter negative lookahead, or
non-greedy matching
/<script.*?>.*?< \/script\s*>/ig

However, neither gives any power to regular expressions that they
didn't have already, so you can make a regular expression without either
that matches the same expression. It's just likely to be huge.

A non-greedy match until the string abcd (/.*?abcd/) can be written as
[^a]*a(((a|ba|bca)* ([^ba]|b[^ca]|bc[^da])[^a]*a)*bcd)
^ until first a
next a is before bcd: restart
not bcd and or a = either [^ba], or b[^ca], or bc[^da]
then findnext a and restart
or bcd => finished

A similar non-greedy match for ".*?</script" would be:

[^<]*<((<|\/<|\/s<|\/sc<|\/scr<|\/scri<|\/scrip<)*
([^\/<]|\/[^s<]|\/s[^c<]|\/sc[^r<]|\/scr[^i<]|\/scri[^p<]|\/scrip[^t<])
[^<]*<)*\/script

The struture is simple, so you can generate it automatically (provided
the string doesn't contain repeats of the first character!):

function reEscape(string ) {
return string.replace(/([[+*?.(){\\\/])/g,"\\$1"); // did I miss any?
}

function matchUntilRE(st ring) {
if (string.length == 0) { return; }
if (string.length == 1) { return "[^"+reEscape(str ing)+"]*" +
reEscape(string ); }
var buf = []; // StringBuffer
var firstChar = reEscape(string .charAt(0));
buf.push("[^",firstChar ,"]*",firstChar );
buf.push("((");
for(var i=0;i<string.le ngth-1;i++) {
if (i>0) { buf.push("|"); }
buf.push(reEsca pe(string.subst ring(1,i+1)),fi rstChar);
}
buf.push(")*(") ;
for(var i=0;i<string.le ngth-1;i++) {
if (i>0) { buf.push("|"); }
buf.push(reEsca pe(string.subst ring(1,i+1)),
"[^",reEscape(str ing.charAt(i+1) ),firstChar,"]");
}
buf.push(")");
buf.push("[^",firstChar ,"]*",firstChar );
buf.push(")*");
buf.push(reEsca pe(string.subst ring(1)));
return buf.join("");
}

(Yey, it gives me exactly the same as the one I created manually :)

I don't see how a non-greedy match until </script can fail.
[color=blue]
> Your non-greedy RegExp requires ECMAScript 3 support as well, and yet fails
> if someone uses "</script>" or even "<\/script>" within the script code.[/color]

Fails how? The first is not permitted inside script code (it should
end the script right there), the latter is, and should not be matched
by a search for "</script".

The only problem I see here is the decission whether to search for
</ or </script. I'd go for the latter, for the same reason browsers
do it: it is sufficient, and allows erroneous scripts without breaking.
[color=blue]
> Alas, until someone proves the opposite, it remains an intrinsic property of
> nested expressions and languages created by such expressions like markup
> languages that successful parsing of them using Regular Expressions is just
> impossible in general.[/color]

Yes, but we are not parsing the HTML here.
[color=blue]
> There are cases where RegExp parsing of such context
> can be successful, though; the more detailed/strict its structure/syntax is
> defined and the less nested its subexpressions are, the higher is the
> statistical probability of successful RegExp parsing of it.[/color]

Exactly. And the script element does not contain markup so it cannot
be nested. It stops at the *first* following occurence of "</script",
which is something RE's can test for successfully.

Likewise, you can use regexps to find all tags in a document, because
tags are not nested (elements are).
/L
--
Lasse Reichstein Nielsen - lrn@hotpop.com
DHTML Death Colors: <URL:http://www.infimum.dk/HTML/rasterTriangleD OM.html>
'Faith without judgement merely degrades the spirit divine.'

**Lasse Reichstein Nielsen** · Jul 23 '05, 10:38 AM

Re: removing content between specified tokens using java script

Lasse Reichstein Nielsen <lrn@hotpop.com > writes:

[lookeahead and non-greedy matching][color=blue]
> However, neither gives any power to regular expressions that they
> didn't have already, so you can make a regular expression without either
> that matches the same expression. It's just likely to be huge.[/color]

I'm confuzing two things here.

It is correct that non-greedy matching doesn't allow regular
expressions to match anything they couldn't without. They don't even
need to be rewritten to match the same strings, just use the greedy
operators instead. What non-greedy matching does is, when there are
*more* than one way to match a string, the returned match will be the
shortest possible.
[color=blue]
> A non-greedy match until the string abcd (/.*?abcd/) can be written as[/color]
[color=blue]
> [^a]*a(((a|ba|bca)* ([^ba]|b[^ca]|bc[^da])[^a]*a)*bcd)[/color]

That is incorrect. This expression matches the string up to and including
the first occurence of abcd. That is not the same as a non-greedy .*?,
whic can match past the first occurence if needed.

Matching up to the first occurence is what we need in this case, but
it is not the same as non-greedy matching.

/L
--
Lasse Reichstein Nielsen - lrn@hotpop.com
DHTML Death Colors: <URL:http://www.infimum.dk/HTML/rasterTriangleD OM.html>
'Faith without judgement merely degrades the spirit divine.'

**Thomas 'PointedEars' Lahn** · Jul 23 '05, 10:38 AM

Re: removing content between specified tokens using java script

Lasse Reichstein Nielsen wrote:
[color=blue]
> Matching up to the first occurence is what we need in this case,[/color]

No, it is not, as we are trying to parse a markup language, consisting of
nested subexpressions. The first occurrence of the close tag after the open
tag is not necessarily the correct one as I already pointed out.

PointedEars

**Thomas 'PointedEars' Lahn** · Jul 23 '05, 10:38 AM

Re: removing content between specified tokens using java script

Lasse Reichstein Nielsen wrote:
[color=blue]
> Thomas 'PointedEars' Lahn <PointedEars@we b.de> writes:[color=green]
>> Lasse Reichstein Nielsen wrote:[color=darkred]
>>> You should match up to "</" for correctness, or up to "</script"[/color][/color]
>
> [...]
> I don't see how a non-greedy match until </script can fail.
>[color=green]
>> Your non-greedy RegExp requires ECMAScript 3 support as well, and yet fails
>> if someone uses "</script>" or even "<\/script>" within the script code.[/color]
>
> Fails how? The first is not permitted inside script code (it should
> end the script right there), the latter is, and should not be matched
> by a search for "</script".[/color]

Note that although specified in SGML that ETAGO ends an element rather than
its entire end tag, not all UAs follow the spec in this regard so one could
use the non-conforming syntax and get away with it, e.g. placing malicious
code within a bulletin board posting viewed with IE. Such needs to be covered.
[color=blue]
> [...][color=green]
>> Alas, until someone proves the opposite, it remains an intrinsic property of
>> nested expressions and languages created by such expressions like markup
>> languages that successful parsing of them using Regular Expressions is just
>> impossible in general.[/color]
>
> Yes, but we are not parsing the HTML here.[/color]

IBTD.

PointedEars

**Lasse Reichstein Nielsen** · Jul 23 '05, 10:39 AM

Re: removing content between specified tokens using java script

Thomas 'PointedEars' Lahn <PointedEars@we b.de> writes:
[color=blue]
> Lasse Reichstein Nielsen wrote:
>[color=green]
>> Matching up to the first occurence is what we need in this case,[/color]
>
> No, it is not, as we are trying to parse a markup language, consisting of
> nested subexpressions.[/color]

But we are not. We are trying "to remove the content embedded in
<script> and </script> tags". Script tags have CDATA as content type,
so they are not containing nested HTML tags.

It is true that regular expressions cannot match recursive tree structures
(HTML is really a special case of the "matched parenthesis" problem, the
traditional non-recursive language).
[color=blue]
> The first occurrence of the close tag after the open
> tag is not necessarily the correct one as I already pointed out.[/color]

Yes it is. In HTML, the script tag ends at the first occurence of
"</". Browsers don't follow the HTML specification and end script tags
at the first occurence of the literal character sequences "</script".
There is no way to include that literal sequence inside a script tag.

/L
--
Lasse Reichstein Nielsen - lrn@hotpop.com
DHTML Death Colors: <URL:http://www.infimum.dk/HTML/rasterTriangleD OM.html>
'Faith without judgement merely degrades the spirit divine.'

**Lasse Reichstein Nielsen** · Jul 23 '05, 10:39 AM

Re: removing content between specified tokens using java script

Lasse Reichstein Nielsen <lrn@hotpop.com > writes:
[color=blue]
> (HTML is really a special case of the "matched parenthesis" problem, the
> traditional non-recursive language).[/color]

non-REGULAR, of course. It's definitly recursive :)

/L
--
Lasse Reichstein Nielsen - lrn@hotpop.com
DHTML Death Colors: <URL:http://www.infimum.dk/HTML/rasterTriangleD OM.html>
'Faith without judgement merely degrades the spirit divine.'

**Thomas 'PointedEars' Lahn** · Jul 23 '05, 10:39 AM

Re: removing content between specified tokens using java script

Lasse Reichstein Nielsen wrote:
[color=blue]
> Thomas 'PointedEars' Lahn <PointedEars@we b.de> writes:[color=green]
>> Lasse Reichstein Nielsen wrote:[color=darkred]
>>> Matching up to the first occurence is what we need in this case,[/color]
>>
>> No, it is not, as we are trying to parse a markup language, consisting of
>> nested subexpressions.[/color]
>
> But we are not. We are trying "to remove the content embedded in
> <script> and </script> tags". Script tags have CDATA as content type,[/color]

True if you mean the content model of the HTML "script" element.
[color=blue]
> so they are not containing nested HTML tags.[/color]

False. CDATA is content that is not parsed by an HTML UA and
thus it does not contribute to the parse tree. It can contain
(nested) <script type="text/javascript">
document.write( '<strong><em>ta gs</em></strong>'); // [1]
</script> anyway.

[1] Yes, I know that this is invalid HTML but it works in
non-conforming UAs and this is for demo only, anyway.
[color=blue][color=green]
>> The first occurrence of the close tag after the open
>> tag is not necessarily the correct one as I already pointed out.[/color]
>
> Yes it is. In HTML, the script tag ends at the first occurence of
> "</".[/color]

True.
[color=blue]
> Browsers don't follow the HTML specification and end script tags
> at the first occurence of the literal character sequences "</script".[/color]

s/tags/elements/

ACK, my bad.
[color=blue]
> There is no way to include that literal sequence inside a script tag.[/color]

Well, you *can* include it in a "script" element's content but it does
not *work* as intended (a script error due to incomplete code is highly
likely). Yet garbage content remains if scriptwise parsing/replacement
follows that misguided paradigm. That is clearly a Bad Thing.

So (again) no RegExp presented in this thread (incl. mine) is suitable to
solve the problem (which this discussion is about after all). Instead one
should write a markup parser prototype or use a (DOM) object that provides
such a functionality.

PointedEars

removing content between specified tokens using java script

removing content between specified tokens using java script

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment