Need help with a Regular Expression

Collapse
X
 
  • Time
  • Show
Clear All
new posts
  • sangith
    New Member
    • Jun 2007
    • 25

    Need help with a Regular Expression

    Hi,
    I am trying to understand a concept in Regex in Perl. How to write regex in Perl such that metacharacter * is not greedy.

    Here is my code:-
    [CODE=perl]
    #!usr/bin/perl
    use strict;
    my $sentence = "Perl is a dynamic programming language created by Larry Wall and first released in 1987,
    Perl is based on the brace-delimited block style of AWK and C,
    and was widely adopted for its strengths in text processing
    and lack of the arbitrary limitations
    of many scripting languages at the time.";

    my $b;
    if ($sentence =~ /and(.*)\./s)
    {
    $b = $1;
    print "The following is the output:-\n";
    print "$b\n";
    }
    [/CODE]
    #Output:-
    The following is the output:-
    first released in 1987,
    Perl is based on the brace-delimited block style of AWK and C,
    and was widely adopted for its strengths in text processing
    and lack of the arbitrary limitations
    of many scripting languages at the time

    The * operator is very greedy and so I get the output like that.
    I want the output to be just from the last occurence of "and" upto the "." like the following:-

    lack of the arbitrary limitations
    of many scripting languages at the time

    So how do I achieve that? I tried using the repetition modifier {} after "and" but that does not work either.
    I would appreciate if you could help me with this.

    Thanks in advance,
    Sangith
  • KevinADC
    Recognized Expert Specialist
    • Jan 2007
    • 4092

    #2
    Originally posted by sangith
    Hi,
    I am trying to understand a concept in Regex in Perl. How to write regex in Perl such that metacharacter * is not greedy.

    Here is my code:-
    [CODE=perl]
    #!usr/bin/perl
    use strict;
    my $sentence = "Perl is a dynamic programming language created by Larry Wall and first released in 1987,
    Perl is based on the brace-delimited block style of AWK and C,
    and was widely adopted for its strengths in text processing
    and lack of the arbitrary limitations
    of many scripting languages at the time.";

    my $b;
    if ($sentence =~ /and(.*)\./s)
    {
    $b = $1;
    print "The following is the output:-\n";
    print "$b\n";
    }
    [/CODE]
    #Output:-
    The following is the output:-
    first released in 1987,
    Perl is based on the brace-delimited block style of AWK and C,
    and was widely adopted for its strengths in text processing
    and lack of the arbitrary limitations
    of many scripting languages at the time

    The * operator is very greedy and so I get the output like that.
    I want the output to be just from the last occurence of "and" upto the "." like the following:-

    lack of the arbitrary limitations
    of many scripting languages at the time

    So how do I achieve that? I tried using the repetition modifier {} after "and" but that does not work either.
    I would appreciate if you could help me with this.

    Thanks in advance,
    Sangith
    Regular expressions are probably one of the more complicated things about perl (and many languages) that the casual perl coder will have to learn. A significant thing to note is that a regular expression will try and match a pattern as early as it can in a string. The word "and" occurs several times in the string, perl will try and match the first occurance, just after Larry Wall: "Larry Wall and".

    In order to match the last occurance you actually want to use greedy matching:

    /.*and (.*)\./

    the first '.*' will match until the last occurance of: "and " (and-space). So you have to learn how to take advantage of greedy matching and when to use and when not to use it. But your problem is further complicated because it is a string of multiple lines (at least it looks that way in your post). To ignore the multiple-lines, you use the"s" modifier at the end of the regexp. This tells perl to treat the string as one long line and ignore all newlines except the one at the very end of the string (if there is one).

    This is one way it could be done:

    [CODE=perl]#!usr/bin/perl
    use strict;
    my $sentence = "Perl is a dynamic programming language created by Larry Wall and first released in 1987,
    Perl is based on the brace-delimited block style of AWK and C,
    and was widely adopted for its strengths in text processing
    and lack of the arbitrary limitations
    of many scripting languages at the time.";
    my $r;
    if ($sentence =~ /.*and (.*)\./s)
    {
    $r = $1;
    print "The following is the output:-\n";
    print "$r\n";
    }[/CODE]

    This is a bit contrived to fit the string you posted. The pattern you want to match appears to start at the beginning of a line within the string. But if you did not know where the pattern started in the string you would probably have to use a different search pattern to avoid substring matches like "land" or "sand".

    Here is a link that might help you:



    Take it a little at a time if it's confusing.

    Comment

    • sangith
      New Member
      • Jun 2007
      • 25

      #3
      Hi Kevin,
      Thank you so much for your help! Your approach works just great!
      I am using this perl code for parsing my text file. The string that I am searching for in the file is a fixed one and will not occur as a part of any other string, so this approach is the best one for me.

      Thanks again,
      Sangith


      Originally posted by KevinADC
      Regular expressions are probably one of the more complicated things about perl (and many languages) that the casual perl coder will have to learn. A significant thing to note is that a regular expression will try and match a pattern as early as it can in a string. The word "and" occurs several times in the string, perl will try and match the first occurance, just after Larry Wall: "Larry Wall and".

      In order to match the last occurance you actually want to use greedy matching:

      /.*and (.*)\./

      the first '.*' will match until the last occurance of: "and " (and-space). So you have to learn how to take advantage of greedy matching and when to use and when not to use it. But your problem is further complicated because it is a string of multiple lines (at least it looks that way in your post). To ignore the multiple-lines, you use the"s" modifier at the end of the regexp. This tells perl to treat the string as one long line and ignore all newlines except the one at the very end of the string (if there is one).

      This is one way it could be done:

      [CODE=perl]#!usr/bin/perl
      use strict;
      my $sentence = "Perl is a dynamic programming language created by Larry Wall and first released in 1987,
      Perl is based on the brace-delimited block style of AWK and C,
      and was widely adopted for its strengths in text processing
      and lack of the arbitrary limitations
      of many scripting languages at the time.";
      my $r;
      if ($sentence =~ /.*and (.*)\./s)
      {
      $r = $1;
      print "The following is the output:-\n";
      print "$r\n";
      }[/CODE]

      This is a bit contrived to fit the string you posted. The pattern you want to match appears to start at the beginning of a line within the string. But if you did not know where the pattern started in the string you would probably have to use a different search pattern to avoid substring matches like "land" or "sand".

      Here is a link that might help you:



      Take it a little at a time if it's confusing.

      Comment

      Working...