Regular expression help

Collapse
This topic is closed.
X
X
 
  • Time
  • Show
Clear All
new posts
  • Christoph Boget

    Regular expression help

    I'm trying to get a regular expression to work in JS. It appears to be
    working everywhere else I'm testing it (an app called Regex Coach and php)
    but I can't seem to get it to work in JS. What the regex is supposed to do
    is:

    <p></p>
    OR
    <br>
    OR
    <br/>
    OR
    <br />
    OR
    <p></p>
    OR
    <p><br></p>
    OR
    <p><br/></p>
    OR
    <p><br /></p>
    OR
    <p>[multiple of any of the above br's]</p>

    taking into account any number of interlaced spaces. The regex I came up
    with is:

    ^\s*<p\s*>(?:<b r\s*\/?>|\s)*<\/p>|(?:<br\s*\/?>|\s)*\s*$

    which, as I said, seems to work elsewhere. However, no matter how I try to
    use it in JS, using the test() method against it returns true against text
    when it should be returning false. The things I've tried are as follows:

    var re = /^\s*<p\s*>(?:<b r\s*\/?>|\s)*<\/p>|(?:<br\s*\/?>|\s)*\s*$/;
    re.test( MyStringValue );

    var re = /^\s*<p\s*>(?:<b r\s*\/?>|\s)*<\/p>|(?:<br\s*\/?>|\s)*\s*$/gim;
    re.test( MyStringValue );

    var re = new RegExp(
    '^\<p\s*\>[\<br\s*\/{0,1}\>|\s]*\<\/p\>|[\<br\s*\/{0,1}\>|\s]*$' );
    re.test( MyStringValue );

    var re = new RegExp(
    '^\<p\s*\>[\<br\s*\/{0,1}\>|\s]*\<\/p\>|[\<br\s*\/{0,1}\>|\s]*$', 'gim' );
    re.test( MyStringValue );

    But it's failing the test (returning true) on things like

    <p>
    <br>
    <br/>lskadfakjsdf;l ja <br>
    </p>

    for example. What gives? Am I doing something wrong? It seems like it's
    working elsewhere, just not in JS...

    thnx,
    Christoph

  • Thomas 'PointedEars' Lahn

    #2
    Re: Regular expression help

    Christoph Boget wrote:
    I'm trying to get a regular expression to work in JS. It appears to be
    working everywhere else I'm testing it (an app called Regex Coach and php)
    I recommend to use QuickREx instead:

    <http://www.bastian-bergerhoff.com/eclipse/features/web/QuickREx/toc.html>
    but I can't seem to get it to work in JS. What the regex is supposed to do
    is:
    >
    <p></p>
    ^^^^^^^
    OR
    <br>
    OR
    <br/>
    OR
    <br />
    OR
    <p></p>
    ^^^^^^^
    ISTM there is a duplicate.
    OR
    <p><br></p>
    OR
    <p><br/></p>
    OR
    <p><br /></p>
    OR
    <p>[multiple of any of the above br's]</p>
    >
    taking into account any number of interlaced spaces.
    Probably you mean in-between *whitespace* instead as that is what \s matches.
    The regex I came up with is:
    >
    ^\s*<p\s*>(?:<b r\s*\/?>|\s)*<\/p>|(?:<br\s*\/?>|\s)*\s*$
    >
    which, as I said, seems to work elsewhere.
    It will not work in JScript before version 5.5 (by default: MSHTML before
    version 5.5) because of the non-capturing parentheses.
    However, no matter how I try to use it in JS, using the test() method
    against it returns true against text when it should be returning false.
    The things I've tried are as follows:
    (1)
    var re = /^\s*<p\s*>(?:<b r\s*\/?>|\s)*<\/p>|(?:<br\s*\/?>|\s)*\s*$/;
    re.test( MyStringValue );
    It should be `myStringValue' as the identifier does not denote a constructor.

    (2)
    var re = /^\s*<p\s*>(?:<b r\s*\/?>|\s)*<\/p>|(?:<br\s*\/?>|\s)*\s*$/gim;
    re.test( MyStringValue );
    The `m' modifier means that `^' matches the start-of-line rather than the
    start-of-input, and `$' matches the end-of-line rather than the end-of-input.

    (3)
    var re = new RegExp(
    '^\<p\s*\>[\<br\s*\/{0,1}\>|\s]*\<\/p\>|[\<br\s*\/{0,1}\>|\s]*$' );
    re.test( MyStringValue );
    (4)
    var re = new RegExp(
    '^\<p\s*\>[\<br\s*\/{0,1}\>|\s]*\<\/p\>|[\<br\s*\/{0,1}\>|\s]*$', 'gim' );
    re.test( MyStringValue );
    You don't need to escape `<' or `>' in string literals. Instead, you must
    escape all backslashes in the expression as currently you are passing the
    equivalent of

    "^<ps*>[<brs*/{0,1}>|s]*</p>|[<brs*/{0,1}>|s]*$"

    Unlike PHP, there is no difference in ECMAScript implementations with
    single-quoted and double-quoted strings. And unlike PHP, ECMAScript
    implementations do not support Perl-Compatible Regular Expressions (PCRE).
    Although I am pretty sure `[' and `]' are not used in PCRE for capturing as
    well.
    But it's failing the test (returning true) on things like
    >
    <p>
    <br>
    <br/>lskadfakjsdf;l ja <br>
    </p>
    >
    for example. What gives?
    (1) /^\s*<p\s*>(?:<b r\s*\/?>|\s)*<\/p>|(?:<br\s*\/?>|\s)*\s*$/
    matches the empty word at the end of input
    with /(?:<br\s*\/?>|\s)*\s*$/

    (2) /^\s*<p\s*>(?:<b r\s*\/?>|\s)*<\/p>|(?:<br\s*\/?>|\s)*\s*$/gim
    first matches "<br>" in line 2, then " <br>" at the end of line 3,
    with /(?:<br\s*\/?>|\s)*\s*$/gim

    (3) /^<ps*>[<brs*/{0,1}>|s]*</p>|[<brs*/{0,1}>|s]*$/
    matches ">" at the end of input (in the last line),
    with /[<brs*/{0,1}>|s]*$/

    (4) /^<ps*>[<brs*/{0,1}>|s]*</p>|[<brs*/{0,1}>|s]*$/gim
    first matches ">" in the first line (containing "<p>"),
    with /[<brs*/{0,1}>|s]*$/gim
    Am I doing something wrong?
    Obviously.
    It seems like it's working elsewhere, just not in JS...
    That's highly unlikely. You have not considered how greedy matching works
    (/x*/ matches the empty word, maybe you were looking for /x+/ instead) and
    you have been using fantasy syntax (`[...]' denotes a character class
    instead). Furthermore, you failed to observe that anchors are part of the
    operand of the alternation: /^ab|cd$/ matches either "ab" at the beginning
    of input or "cd" at the end of input, not an input consisting of either "ab"
    or "cd" (that have to be /^(ab|cd)$/ or /^(?:ab|cd)$/.)

    <http://developer.mozil la.org/en/docs/Core_JavaScript _1.5_Reference: Global_Objects: RegExp>


    HTH

    PointedEars
    --
    Anyone who slaps a 'this page is best viewed with Browser X' label on
    a Web page appears to be yearning for the bad old days, before the Web,
    when you had very little chance of reading a document written on another
    computer, another word processor, or another network. -- Tim Berners-Lee

    Comment

    • Thomas 'PointedEars' Lahn

      #3
      Re: Regular expression help

      Thomas 'PointedEars' Lahn wrote:
      Christoph Boget wrote:
      >But it's failing the test (returning true) on things like
      >>
      ><p>
      ><br>
      ><br/>lskadfakjsdf;l ja <br>
      ></p>
      >>
      >for example. What gives?
      >
      [...]
      (2) /^\s*<p\s*>(?:<b r\s*\/?>|\s)*<\/p>|(?:<br\s*\/?>|\s)*\s*$/gim
      first matches "<br>" in line 2, then " <br>" at the end of line 3,
      with /(?:<br\s*\/?>|\s)*\s*$/gim
      Correction: It matches the substring consisting of the newline before "<br>"
      (because of the /(...|\s)*/gim alternation) followed by "<br>" followed by
      the newline after "<br>", and it matches the newline after "<br>" again
      because of /(...)*\s*$/gim.
      (3) /^<ps*>[<brs*/{0,1}>|s]*</p>|[<brs*/{0,1}>|s]*$/
      matches ">" at the end of input (in the last line),
      with /[<brs*/{0,1}>|s]*$/
      It matches also the empty word at the end of input because of /[...]*$/.
      (4) /^<ps*>[<brs*/{0,1}>|s]*</p>|[<brs*/{0,1}>|s]*$/gim
      first matches ">" in the first line (containing "<p>"),
      with /[<brs*/{0,1}>|s]*$/gim
      It matches also:

      - the empty word after "<p>" because of /[...]*$/
      - "<br>" in line 2 because of /[...<br...>...]*$/
      - the empty word after "<br>" in line 2 because of /[...>...]*$/
      - "<br>" in line 3 because of /[<br...>]*$/
      - the empty word after "<br>" in line 3 because of /[...>...]*$/
      - ">" in line 4 because of /[...>...]*$/
      - the empty word after "</p>" in line 4 because of /[...]*$/

      (Did I already mention QuickREx rules? :))


      PointedEars
      --
      Anyone who slaps a 'this page is best viewed with Browser X' label on
      a Web page appears to be yearning for the bad old days, before the Web,
      when you had very little chance of reading a document written on another
      computer, another word processor, or another network. -- Tim Berners-Lee

      Comment

      • Lasse Reichstein Nielsen

        #4
        Re: Regular expression help

        "Christoph Boget" <jcboget@yahoo. comwrites:
        I'm trying to get a regular expression to work in JS. It appears to
        be working everywhere else I'm testing it (an app called Regex Coach
        and php) but I can't seem to get it to work in JS. What the regex is
        supposed to do is:
        If I understand correctly:
        either a <br>, optionally with whitespaces after the "br" and/or slash
        before the ">",
        or zero or more of these br's interleaves by optional whitespace
        and optionally flanked by <pand </pwith optional surrounding whitespace.

        From that I construct the following:

        /^\s*(?:<p\s*>\s *(?:<br\s*\/?>\s*)*<\/p\s*>\s*|(?:<br \s*\/?>\s)*)$/i
        The regex I came up with is:
        >
        ^\s*<p\s*>(?:<b r\s*\/?>|\s)*<\/p>|(?:<br\s*\/?>|\s)*\s*$
        The middle "|" separates the entire regexp into two parts. That means
        that the ^ and $ anchors, as well as their adjacent whitespaces, are in
        separate alternatives. You probably want to group the alternatives
        so that the anchors and whitespaces apply to both.
        .....
        But it's failing the test (returning true) on things like
        >
        <p>
        <br>
        <br/>lskadfakjsdf;l ja <br>
        </p>
        That's because its second alternative, anchored only at the
        end, allows matching the empty string.

        If you change "test" to "exec", you can see what was matched.
        In the above case, it's a zero-length string (unless there
        is some whitespace hidden after the </p>).

        If you also provide the "g" flag, you can check the regexp's
        "lastIndex" property after the match, to see the index of the
        first character after the matched string. In the above case,
        it's at the end of the string.

        I.e., it matches the zero-length string at the end.
        for example. What gives? Am I doing something wrong? It seems like
        it's working elsewhere, just not in JS...
        "seems" like it's working "elsewhere" . To channel Chandler Bing:
        Could it BE any less precise? :)

        Your Javascript is working according to specification.

        /L
        --
        Lasse Reichstein Nielsen
        DHTML Death Colors: <URL:http://www.infimum.dk/HTML/rasterTriangleD OM.html>
        'Faith without judgement merely degrades the spirit divine.'

        Comment

        Working...