String.replace(/</g,'&lt;');

Collapse
This topic is closed.
X
X
 
  • Time
  • Show
Clear All
new posts
  • higabe

    String.replace(/</g,'&lt;');

    Three questions

    1)

    I have a string function that works perfectly but according to W3C.org
    web site is syntactically flawed because it contains the characters </
    in sequence. So how am I supposed to write this function?

    String.replace(/</g,'&lt;');

    2)

    While I'm on the subject, anyone know why they implemented replace using
    a slash delimiter instead of quotes? I know it's how it's done in Perl
    but why is it done that way?

    3)

    One last regexp question:
    is it possible to do something like this:
    String.replace(/<(.*?)>(.*?)</$1>/ig,'&lt;$1&gt;$ 2&lt;/$1&gt;');
    This is just an example where a sub-match used in a regular expression
    must sub-match again exactly as it did the first time later in the same
    string. But I don't know how to do that in a regexp although it seems
    like it should be possible.
  • Lasse Reichstein Nielsen

    #2
    Re: String.replace(/&lt;/g,'&amp;lt;');

    higabe <higabe@hotmail .com> writes:
    [color=blue]
    > Three questions
    >
    > 1)
    >
    > I have a string function that works perfectly but according to W3C.org
    > web site is syntactically flawed because it contains the characters </
    > in sequence. So how am I supposed to write this function?
    >
    > String.replace(/</g,'&lt;');[/color]

    Hmm, I can see that I have some of those too, the most recent of them
    written today. Bummer. I never noticed that there was a </-sequence in
    that.
    Try
    String.replace(/[<]/g,'&lt;');
    or
    String.replace( RegExp("<","g") ,'&lt;');
    [color=blue]
    > 2)
    >
    > While I'm on the subject, anyone know why they implemented replace using
    > a slash delimiter instead of quotes? I know it's how it's done in Perl
    > but why is it done that way?[/color]

    They didn't implement "replace" with slash-delimiters. They
    implemented *regular expressions* with slash-delimiters. You can use
    regular expressions in many other ways than just string-replace.

    You could also write
    var myRegExp = /[<]/g;
    String.replace( myRegExp,'&lt;' );

    These are equivalent uses of regular expressions and strings:
    /a*b/i.exec("caabc")
    and
    "caabc".mat ch(/a*b/i)

    [color=blue]
    > 3)
    >
    > One last regexp question:
    > is it possible to do something like this:
    > String.replace(/<(.*?)>(.*?)</$1>/ig,'&lt;$1&gt;$ 2&lt;/$1&gt;');[/color]

    Yes, but you need to escape the slash in "</" and it's "\1" instead of
    "$1". Also you will only want to match the tag name, not attributes,
    and you have no letters, so the "i" flag is not necessary. And don't
    call a variable "String", since it conflicts with the global variable
    holding the constructor of String objects.

    So, this should do what you wanted:

    string.replace(/<\s*(\w+)\b(.*? )>(.*?)<\/\1>/g,
    '&lt;$1$2&gt;$3 &lt;/$1&gt;');

    It is confuzed if ">" occurs inside an attribute, e.g. <tag
    attr="foo>bar"> . Just don't do that :)

    It doesn't handle nested tags either. That is still outside the power
    of regular expressions, even with backreference.

    There are ways around that, though, using a function as second argument
    of replace, allowing us to use recursion:

    function tagify(string) {
    return string.replace(/<\s*(\w+)\b(.*? )>(.*?)<\/\1>/g,
    function(match, sub1,sub2,sub3) {
    return "&lt;"+sub1+sub 2+"&gt;" +
    tagify(sub3) +
    "&lt;/"+sub1+"&gt ;";
    });
    }

    This still fails for elements with no closing tag. It could probably
    be made to work for XHTML, where all tags have end tags (sometimes
    abbreviated to just end in "/>"):
    /<\s*(\w+)\b(|.* ?[^/])(?:\/>|>(.*?)<\/\1>)/g

    ^start tag
    ^optional whitespace
    ^tagname
    ^optional attributes, not ending in /
    ^either >content</tagname> or just />

    The XHTML parser would then be:

    function tagify(string) {
    return string.replace(
    /<\s*(\w+)\b(|.* ?[^/])(?:\/>|>(.*?)<\/\1>)/g,
    function(match, sub1,sub2,sub3) {
    return "&lt;"+sub1 +" "+sub2+
    (sub3 !== undefined ?
    "&gt;" + tagify(sub3) +
    "&lt;/"+sub1+"&gt ;" :
    "/>");
    });
    }


    Hmm. I feel stupid, considering the much larger parser for XHTML that
    I made some time ago. Oh well, at least it handled ">" inside
    attribute values :).
    [color=blue]
    > This is just an example where a sub-match used in a regular expression
    > must sub-match again exactly as it did the first time later in the same
    > string.[/color]

    It works in recent versions of Javascript/ECMAScript. Earlier ones didn't
    have non-greedy matches (*?) or backreferences (\1).
    [color=blue]
    > But I don't know how to do that in a regexp although it seems
    > like it should be possible.[/color]

    It is, and you were close.


    Adding backreferences to regular expressions gives them more power than
    "real" regular expressions, i.e., they can be used to match something that
    is not a regular language. Example:

    /^(11+)\1+$/

    This regular expression matches any string of 1's that can be written
    as two or more repetitions of two or more 1's. That is, unary representation
    of composite numbers.
    !/^(11+)\1+$/.test("--string of n 1's--")
    is a test for whether n is prime (but not a very efficient one).

    /L
    --
    Lasse Reichstein Nielsen - lrn@hotpop.com
    DHTML Death Colors: <URL:http://www.infimum.dk/HTML/rasterTriangleD OM.html>
    'Faith without judgement merely degrades the spirit divine.'

    Comment

    • Dr John Stockton

      #3
      Re: String.replace(/&lt;/g,'&amp;lt;');

      JRS: In article <3FB9542F.99C49 5D7@hotmail.com >, seen in
      news:comp.lang. javascript, higabe <higabe@hotmail .com> posted at Mon, 17
      Nov 2003 23:03:18 :-[color=blue]
      >I have a string function that works perfectly but according to W3C.org
      >web site is syntactically flawed because it contains the characters </
      >in sequence. So how am I supposed to write this function?
      >
      >String.replace (/</g,'&lt;');[/color]

      String.replace(/\</g,'&lt;'); appears acceptable to MSIE 4. It may,
      however, be deprecated; you could try \x3c and \o74 to replace < .

      --
      © John Stockton, Surrey, UK. ?@merlyn.demon. co.uk Turnpike v4.00 IE 4 ©
      <URL:http://jibbering.com/faq/> Jim Ley's FAQ for news:comp.lang. javascript
      <URL:http://www.merlyn.demo n.co.uk/js-index.htm> JS maths, dates, sources.
      <URL:http://www.merlyn.demo n.co.uk/> TP/BP/Delphi/JS/&c., FAQ topics, links.

      Comment

      • Lasse Reichstein Nielsen

        #4
        Re: String.replace(/&lt;/g,'&amp;lt;');

        Dr John Stockton <spam@merlyn.de mon.co.uk> writes:
        [color=blue]
        > JRS: In article <3FB9542F.99C49 5D7@hotmail.com >, seen in
        > news:comp.lang. javascript, higabe <higabe@hotmail .com> posted at Mon, 17
        > Nov 2003 23:03:18 :-[color=green]
        >>I have a string function that works perfectly but according to W3C.org
        >>web site is syntactically flawed because it contains the characters </
        >>in sequence. So how am I supposed to write this function?
        >>
        >>String.replac e(/</g,'&lt;');[/color]
        >
        > String.replace(/\</g,'&lt;'); appears acceptable to MSIE 4.[/color]

        The original is also acceptable to all the browsers I have access to,
        and they are equally incorrect according to the HTML 4 specification.
        The problem is the character sequence "</", and it is still there.
        [color=blue]
        > It may, however, be deprecated; you could try \x3c and \o74 to
        > replace < .[/color]

        I havent heard of \o??. Do you mean octal \074?
        It should work then (or \u003c). In either case it is a Javascript
        escape for the character, so the HTML parser won't see "</".

        /L
        --
        Lasse Reichstein Nielsen - lrn@hotpop.com
        DHTML Death Colors: <URL:http://www.infimum.dk/HTML/rasterTriangleD OM.html>
        'Faith without judgement merely degrades the spirit divine.'

        Comment

        • Dr John Stockton

          #5
          Re: String.replace(/&lt;/g,'&amp;lt;');

          JRS: In article <smklmq00.fsf@h otpop.com>, seen in
          news:comp.lang. javascript, Lasse Reichstein Nielsen <lrn@hotpop.com >
          posted at Tue, 18 Nov 2003 20:32:47 :-[color=blue]
          >Dr John Stockton <spam@merlyn.de mon.co.uk> writes:
          >[color=green]
          >> JRS: In article <3FB9542F.99C49 5D7@hotmail.com >, seen in
          >> news:comp.lang. javascript, higabe <higabe@hotmail .com> posted at Mon, 17
          >> Nov 2003 23:03:18 :-[color=darkred]
          >>>I have a string function that works perfectly but according to W3C.org
          >>>web site is syntactically flawed because it contains the characters </
          >>>in sequence. So how am I supposed to write this function?
          >>>
          >>>String.repla ce(/</g,'&lt;');[/color]
          >>
          >> String.replace(/\</g,'&lt;'); appears acceptable to MSIE 4.[/color]
          >
          >The original is also acceptable to all the browsers I have access to,
          >and they are equally incorrect according to the HTML 4 specification.
          >The problem is the character sequence "</", and it is still there.[/color]

          Oops.

          [color=blue][color=green]
          >> It may, however, be deprecated; you could try \x3c and \o74 to
          >> replace < .[/color]
          >
          >I havent heard of \o??. Do you mean octal \074?[/color]

          It (\o??) is/was in the NS reference page for RegExp, "Last Updated:
          05/28/99 12:00:15 "; the little o can be replaced, in my browser, by big
          o or by zero. There, using \0 (zero) is rather like using \1 or \2,
          which can be *either* octal or back-references.

          "Last Updated September 28, 2000" omits \o and says that \0 must not be
          followed by another digit (matches NUL character). "octal" does not
          occur in the page, nor in the RegExp region of ECMA-262 3rd Edn.

          Hence, while \o74 \O74 \074 may work, \x3c should be used instead of
          them.
          [color=blue]
          >It should work then (or \u003c). In either case it is a Javascript
          >escape for the character, so the HTML parser won't see "</".[/color]

          --
          © John Stockton, Surrey, UK. ?@merlyn.demon. co.uk Turnpike v4.00 MIME. ©
          Web <URL:http://www.merlyn.demo n.co.uk/> - FAQish topics, acronyms, & links.
          Proper <= 4-line sig. separator as above, a line exactly "-- " (SonOfRFC1036)
          Do not Mail News to me. Before a reply, quote with ">" or "> " (SonOfRFC1036)

          Comment

          Working...