Regexp: Case-insensitive matching | N factorial

Collapse
This topic is closed.
X
X
 
  • Time
  • Show
Clear All
new posts
  • gentsquash@gmail.com

    Regexp: Case-insensitive matching | N factorial

    In a setting where I can specify only a JS regular
    expression, but not the JS code that will use it, I seek
    a regexp component that matches a string of letters,
    ignoring case. E.g, for "cat" I'd like the effect of

    ([Cc][Aa][Tt])

    but without having to have many occurrences of [Xx].


    Secondly, what is an efficient regexp that matches a
    string exactly when ALL words in a certain list occur in
    the string. I'd like the effect of

    (cat.*nip|nip.* cat)

    except that there are N words rather than just the two
    words "cat" and "nip". (I can assume that no word in the
    list is a prefix of any other.) Naturally, I'm looking for
    a regexp-solution that does not involve listing all
    N factorial
    many orderings.

    --Jonathan LF King, Mathematics dept, Univ. of Florida
  • Thomas 'PointedEars' Lahn

    #2
    Re: Regexp: Case-insensitive matching | N factorial

    RobG wrote:
    If you want to match the word cat exactly, then:
    >
    var reA = /\bcat\b/i;
    That depends on how you define a word. If you define a word as a sequence
    of word characters as specified in the ECMAScript Language Specification,
    Ed. 3 Final, section 15.10.2.6 (i.e. those matching /[0-9A-Za-z_]/), you are
    right.

    However, for example "Menü" is a word in German, and

    var reA = /\bmen\b/i;

    will (only) match the "Men" in "Menü" there. Because `ü' is not considered
    a word character per the Specification, and so the empty word ε between "n"
    and "ü" constitutes a word boundary matched by /\b/ (as e.g.

    "Menü".mat ch(/\bmen\b/i)

    shows).

    So for matching Unicode words in strings, you have to use

    var reA = /(^|\s)cat(\s|$)/i;

    instead; that is, a character sequence (here: without whitespace in-between)
    bounded by whitespace, or one or two input boundaries.


    PointedEars
    --
    Anyone who slaps a 'this page is best viewed with Browser X' label on
    a Web page appears to be yearning for the bad old days, before the Web,
    when you had very little chance of reading a document written on another
    computer, another word processor, or another network. -- Tim Berners-Lee

    Comment

    • RobG

      #3
      Re: Regexp: Case-insensitive matching | N factorial

      On Jun 26, 4:17 pm, Thomas 'PointedEars' Lahn <PointedE...@we b.de>
      wrote:
      RobG wrote:
      If you want to match the word cat exactly, then:
      >
      var reA = /\bcat\b/i;
      >
      That depends on how you define a word. If you define a word as a sequence
      of word characters as specified in the ECMAScript Language Specification,
      Ed. 3 Final, section 15.10.2.6 (i.e. those matching /[0-9A-Za-z_]/), you are
      right.
      >
      However, for example "Men¨¹" is a word in German, and
      >
      var reA = /\bmen\b/i;
      >
      will (only) match the "Men" in "Men¨¹" there. Because `¨¹' is not considered
      a word character per the Specification,
      Hence I included the sentence "Also, the regular expression's idea of
      a word
      boundary might be different to what you expect."

      and so the empty word ¦Å between "n"
      and "¨¹" constitutes a word boundary matched by /\b/ (as e.g.
      >
      "Men¨¹".mat ch(/\bmen\b/i)
      >
      shows).
      >
      So for matching Unicode words in strings, you have to use
      >
      var reA = /(^|\s)cat(\s|$)/i;
      That expression is commonly used for matching values in the HTML class
      attribute where the separator is specified as being whitespace. It is
      not sufficient for matching words in general where they may be
      followed by punctuation marks such as commas, semi-colons, colons,
      dashes, periods and so on.


      --
      Rob

      Comment

      • Thomas 'PointedEars' Lahn

        #4
        Re: Regexp: Case-insensitive matching | N factorial

        RobG wrote:
        Thomas 'PointedEars' Lahn wrote:
        >RobG wrote:
        >>If you want to match the word cat exactly, then:
        >> var reA = /\bcat\b/i;
        >That depends on how you define a word. If you define a word as a sequence
        >of word characters as specified in the ECMAScript Language Specification,
        >Ed. 3 Final, section 15.10.2.6 (i.e. those matching /[0-9A-Za-z_]/), you are
        >right.
        >>
        >However, for example "Menü" is a word in German, and
        >>
        > var reA = /\bmen\b/i;
        >>
        >will (only) match the "Men" in "Menü" there. Because `ü' is not considered
        >a word character per the Specification,
        >
        Hence I included the sentence "Also, the regular expression's idea of
        a word boundary might be different to what you expect."
        It was easy to overlook and provides no explanation as to what should be
        expected instead.
        >and so the empty word ε between "n"
        >and "ü" constitutes a word boundary matched by /\b/ (as e.g.
        >>
        > "Menü".mat ch(/\bmen\b/i)
        >>
        >shows).
        >>
        >So for matching Unicode words in strings, you have to use
        >>
        > var reA = /(^|\s)cat(\s|$)/i;
        >
        That expression is commonly used for matching values in the HTML class
        attribute where the separator is specified as being whitespace. It is
        not sufficient for matching words in general where they may be
        followed by punctuation marks such as commas, semi-colons, colons,
        dashes, periods and so on.
        Good point. However, a character class can take care of that. For example,
        in Unicode text that uses only ASCII and Latin-1 punctuation:

        var reA = /(^|[\s,;:.-])cat([\s,;:.-]|$)/i;

        But whether a punctuation mark really delimits a word is a matter of
        language, interpretation, and personal taste. For example, the HYPHEN-MINUS
        character ("-") may have been used as hyphen in compounds.

        An alternative would be to use the \w escape sequence to build your own
        character class:

        var reA = /(^|[^\wäöü])cat([^\wäöü]|$)/i;


        PointedEars
        --
        Anyone who slaps a 'this page is best viewed with Browser X' label on
        a Web page appears to be yearning for the bad old days, before the Web,
        when you had very little chance of reading a document written on another
        computer, another word processor, or another network. -- Tim Berners-Lee

        Comment

        • Dr J R Stockton

          #5
          Re: Regexp: Case-insensitive matching | N factorial

          In comp.lang.javas cript message <6aa0c1c4-b785-4da1-9107-b681df097261@c5
          8g2000hsc.googl egroups.com>, Wed, 25 Jun 2008 15:31:37,
          gentsquash@gmai l.com posted:
          >In a setting where I can specify only a JS regular
          >expression, but not the JS code that will use it, I seek
          >a regexp component that matches a string of letters,
          >ignoring case. E.g, for "cat" I'd like the effect of
          >
          ([Cc][Aa][Tt])
          >
          >but without having to have many occurrences of [Xx].
          If all else fails, read the manual. There are links in <URL:http://www.
          merlyn.demon.co .uk/js-valid.htm>.


          Note that the average intellectual level of those who post with @gmail
          addresses is so low that readers may kill-file it /in toto/.

          Secondly, what is an efficient regexp that matches a
          >string exactly when ALL words in a certain list occur in
          >the string. I'd like the effect of
          >
          (cat.*nip|nip.* cat)
          >
          >except that there are N words rather than just the two
          >words "cat" and "nip". (I can assume that no word in the
          >list is a prefix of any other.) Naturally, I'm looking for
          >a regexp-solution that does not involve listing all
          N factorial
          >many orderings.
          I doubt whether one exists to do a direct match, at least if it is to be
          compatible with any user agent that knows RegExps.

          But one could use S2 = S1.replace(/cat|nip/gi, "") and see whether the
          difference of the lengths matches the total of the strings, provided
          that no string can occur more than once and matchable strings cannot
          overlap.
          --Jonathan LF King, Mathematics dept, Univ. of Florida
          DSS.

          --
          (c) John Stockton, nr London, UK. ?@merlyn.demon. co.uk Turnpike v6.05 MIME.
          Web <URL:http://www.merlyn.demo n.co.uk/- FAQish topics, acronyms, & links.
          Proper <= 4-line sig. separator as above, a line exactly "-- " (SonOfRFC1036)
          Do not Mail News to me. Before a reply, quote with ">" or "" (SonOfRFC1036)

          Comment

          • RobG

            #6
            Re: Regexp: Case-insensitive matching | N factorial

            On Jun 26, 10:52 pm, Dr J R Stockton <j...@merlyn.de mon.co.ukwrote:
            [...]
            Note that the average intellectual level of those who post with @gmail
            addresses is so low that readers may kill-file it /in toto/.
            Bad day? My Google Groups profile has a non-gmail address that is
            easily discovered by those who care to do so.

            <URL: http://www.prejudicenoway.com.au/activities/2156.html >


            --
            Rob

            Comment

            Working...