Regular expressions (multiple match problem)

Collapse
This topic is closed.
X
X
 
  • Time
  • Show
Clear All
new posts
  • mikko.n

    Regular expressions (multiple match problem)

    I have recently been experimenting with GNU C library regular
    expression functions and noticed a problem with pattern matching. It
    seems to recognize only the first match but ignoring the rest of them.
    An example:

    mikko.c:
    -----

    #include <stdio.h>
    #include <regex.h>
    #include <sys/types.h>

    int main(int argc, char *argv[]) {
    regex_t p;
    regmatch_t pm[2];
    regcomp(&p,"k", 0);
    regexec(&p,"mik ko",2,pm,0);
    printf("start=% d end=%d\n",pm[0].rm_so,pm[0].rm_eo);
    printf("start=% d end=%d\n",pm[1].rm_so,pm[1].rm_eo);
    regfree(&p);
    return 0;
    }

    -----

    This intends to match regular expression 'k' against string 'mikko'
    and return start and end of two first matches in the array pm of
    regmatch_t:s. The output is, however:

    $ ./mikko
    start=2 end=3
    start=-1 end=-1

    instead of the expected

    start=2 end=3
    start=3 end=4

    Is this a bug in GNU library or have I overlooked something? I have
    not found any examples from the Internet of multiple subexpression
    matching with <regex.heithe r.
    With more complicated regular expressions it usually seems to return
    only the first match as here, but with wildcards the largest match,
    nevertheless only one of them.

    Thanks,

    Mikko Nummelin
  • Walter Roberson

    #2
    Re: Regular expressions (multiple match problem)

    In article <baa852e0-4b27-4dcd-bf1a-4c091e14875a@x4 1g2000hsb.googl egroups.com>,
    mikko.n <mnummeli@gmail .comwrote:
    >I have recently been experimenting with GNU C library regular
    >expression functions and noticed a problem with pattern matching.
    Then you should ask in a GNU newsgroup. Regular expressions are
    not part of the C standard, so the proper usage of
    any particular regular expression library should be discussed
    in the appropriate forum for that library.
    --
    "They called it golf because all the other four letter words
    were taken." -- Walter Hagen

    Comment

    • Antoninus Twink

      #3
      Re: Regular expressions (multiple match problem)

      On 2 Apr 2008 at 6:20, mikko.n wrote:
      I have recently been experimenting with GNU C library regular
      expression functions and noticed a problem with pattern matching. It
      seems to recognize only the first match but ignoring the rest of them.
      An example:
      >
      mikko.c:
      -----
      >
      #include <stdio.h>
      #include <regex.h>
      #include <sys/types.h>
      >
      int main(int argc, char *argv[]) {
      regex_t p;
      regmatch_t pm[2];
      regcomp(&p,"k", 0);
      regexec(&p,"mik ko",2,pm,0);
      printf("start=% d end=%d\n",pm[0].rm_so,pm[0].rm_eo);
      printf("start=% d end=%d\n",pm[1].rm_so,pm[1].rm_eo);
      regfree(&p);
      return 0;
      }
      >
      -----
      >
      This intends to match regular expression 'k' against string 'mikko'
      and return start and end of two first matches in the array pm of
      regmatch_t:s. The output is, however:
      >
      $ ./mikko
      start=2 end=3
      start=-1 end=-1
      >
      instead of the expected
      >
      start=2 end=3
      start=3 end=4
      >
      Is this a bug in GNU library or have I overlooked something? I have
      not found any examples from the Internet of multiple subexpression
      matching with <regex.heithe r.
      With more complicated regular expressions it usually seems to return
      only the first match as here, but with wildcards the largest match,
      nevertheless only one of them.
      The problem is that you misunderstand what a match is.

      If the regex matches, then pm[0] contains the offsets of the (first)
      match for the whole regex. But pm[1],... don't contain the offets for
      subsequent matches of the whole regex, but rather contain the offsets of
      any parenthesized subexpressions that matched (in the match recorded in
      pm[0]).

      For example, try:

      #include <stdio.h>
      #include <regex.h>
      #include <sys/types.h>

      int main(void)
      {
      regex_t p;
      regmatch_t pm[2];
      regcomp(&p,"k\\ (.\\)",0);
      regexec(&p,"mik ko",2,pm,0);
      printf("start=% d end=%d\n",pm[0].rm_so,pm[0].rm_eo);
      printf("start=% d end=%d\n",pm[1].rm_so,pm[1].rm_eo);
      regfree(&p);
      return 0;
      }


      $ ./a
      start=2 end=4
      start=3 end=4

      Comment

      • mikko.n

        #4
        Re: Regular expressions (multiple match problem)

        On 2 huhti, 11:01, Antoninus Twink <nos...@nospam. invalidwrote:
        On 2 Apr 2008 at 6:20, mikko.n wrote:
        >
        >
        >
        I have recently been experimenting with GNU C library regular
        expression functions and noticed a problem with pattern matching. It
        seems to recognize only the first match but ignoring the rest of them.
        An example:
        >
        mikko.c:
        -----
        >
        #include <stdio.h>
        #include <regex.h>
        #include <sys/types.h>
        >
        int main(int argc, char *argv[]) {
        regex_t p;
        regmatch_t pm[2];
        regcomp(&p,"k", 0);
        regexec(&p,"mik ko",2,pm,0);
        printf("start=% d end=%d\n",pm[0].rm_so,pm[0].rm_eo);
        printf("start=% d end=%d\n",pm[1].rm_so,pm[1].rm_eo);
        regfree(&p);
        return 0;
        }
        >
        -----
        >
        This intends to match regular expression 'k' against string 'mikko'
        and return start and end of two first matches in the array pm of
        regmatch_t:s. The output is, however:
        >
        $ ./mikko
        start=2 end=3
        start=-1 end=-1
        >
        instead of the expected
        >
        start=2 end=3
        start=3 end=4
        >
        Is this a bug in GNU library or have I overlooked something? I have
        not found any examples from the Internet of multiple subexpression
        matching with <regex.heithe r.
        With more complicated regular expressions it usually seems to return
        only the first match as here, but with wildcards the largest match,
        nevertheless only one of them.
        >
        The problem is that you misunderstand what a match is.
        >
        If the regex matches, then pm[0] contains the offsets of the (first)
        match for the whole regex. But pm[1],... don't contain the offets for
        subsequent matches of the whole regex, but rather contain the offsets of
        any parenthesized subexpressions that matched (in the match recorded in
        pm[0]).
        >
        For example, try:
        >
        #include <stdio.h>
        #include <regex.h>
        #include <sys/types.h>
        >
        int main(void)
        {
        regex_t p;
        regmatch_t pm[2];
        regcomp(&p,"k\\ (.\\)",0);
        regexec(&p,"mik ko",2,pm,0);
        printf("start=% d end=%d\n",pm[0].rm_so,pm[0].rm_eo);
        printf("start=% d end=%d\n",pm[1].rm_so,pm[1].rm_eo);
        regfree(&p);
        return 0;
        >
        }
        >
        $ ./a
        start=2 end=4
        start=3 end=4
        Is there then a simple alternative which would work so that it returns
        all the matches of the original regexp in the text?

        Mikko Nummelin

        Comment

        • Flash Gordon

          #5
          Re: Regular expressions (multiple match problem)

          mikko.n wrote, On 02/04/08 09:37:
          On 2 huhti, 11:01, Antoninus Twink <nos...@nospam. invalidwrote:
          >On 2 Apr 2008 at 6:20, mikko.n wrote:
          <snip>
          Is there then a simple alternative which would work so that it returns
          all the matches of the original regexp in the text?
          As Walter suggested, ask in a GNU group or mailing list where your
          question would be topical (there is one specifically for regexp) instead
          of comp.lang.c where it is not.

          I note that this time you have added a cross post to
          comp.unix.progr ammer where your question might be topical, but why
          continue posting where it is not?
          --
          Flash Gordon

          Comment

          • Antoninus Twink

            #6
            Re: Regular expressions (multiple match problem)

            On 2 Apr 2008 at 8:37, mikko.n wrote:
            Is there then a simple alternative which would work so that it returns
            all the matches of the original regexp in the text?
            Just use a loop, like this:


            #include <stdio.h>
            #include <regex.h>
            #include <sys/types.h>

            int main(void)
            {
            regex_t p;
            regmatch_t pm;
            char *s="mikko mikko";
            regoff_t last_match=0;
            regcomp(&p, "k", 0);
            while(regexec(& p, s+last_match, 1, &pm, 0) == 0) {
            printf("start=% d end=%d\n", pm.rm_so + last_match, pm.rm_eo + last_match);
            last_match += pm.rm_so+1;
            }
            regfree(&p);
            return 0;
            }

            Comment

            Working...