Need help with a Regular Expression

**KevinADC** · Jan 10 '08, 10:57 PM

Originally posted by sangith

Hi,
I am trying to understand a concept in Regex in Perl. How to write regex in Perl such that metacharacter * is not greedy.

Here is my code:-
[CODE=perl]
#!usr/bin/perl
use strict;
my $sentence = "Perl is a dynamic programming language created by Larry Wall and first released in 1987,
Perl is based on the brace-delimited block style of AWK and C,
and was widely adopted for its strengths in text processing
and lack of the arbitrary limitations
of many scripting languages at the time.";

my $b;
if ($sentence =~ /and(.*)\./s)
{
$b = $1;
print "The following is the output:-\n";
print "$b\n";
}
[/CODE]
#Output:-
The following is the output:-
first released in 1987,
Perl is based on the brace-delimited block style of AWK and C,
and was widely adopted for its strengths in text processing
and lack of the arbitrary limitations
of many scripting languages at the time

The * operator is very greedy and so I get the output like that.
I want the output to be just from the last occurence of "and" upto the "." like the following:-

lack of the arbitrary limitations
of many scripting languages at the time

So how do I achieve that? I tried using the repetition modifier {} after "and" but that does not work either.
I would appreciate if you could help me with this.

Thanks in advance,
Sangith

Regular expressions are probably one of the more complicated things about perl (and many languages) that the casual perl coder will have to learn. A significant thing to note is that a regular expression will try and match a pattern as early as it can in a string. The word "and" occurs several times in the string, perl will try and match the first occurance, just after Larry Wall: "Larry Wall and".

In order to match the last occurance you actually want to use greedy matching:

/.*and (.*)\./

the first '.*' will match until the last occurance of: "and " (and-space). So you have to learn how to take advantage of greedy matching and when to use and when not to use it. But your problem is further complicated because it is a string of multiple lines (at least it looks that way in your post). To ignore the multiple-lines, you use the"s" modifier at the end of the regexp. This tells perl to treat the string as one long line and ignore all newlines except the one at the very end of the string (if there is one).

This is one way it could be done:

[CODE=perl]#!usr/bin/perl
use strict;
my $sentence = "Perl is a dynamic programming language created by Larry Wall and first released in 1987,
Perl is based on the brace-delimited block style of AWK and C,
and was widely adopted for its strengths in text processing
and lack of the arbitrary limitations
of many scripting languages at the time.";
my $r;
if ($sentence =~ /.*and (.*)\./s)
{
$r = $1;
print "The following is the output:-\n";
print "$r\n";
}[/CODE]

This is a bit contrived to fit the string you posted. The pattern you want to match appears to start at the beginning of a line within the string. But if you did not know where the pattern started in the string you would probably have to use a different search pattern to avoid substring matches like "land" or "sand".

Here is a link that might help you:

perlretut - Perl regular expressions tutorial - Perldoc Browser

http://perldoc.perl.org/perlretut.html

Take it a little at a time if it's confusing.

**sangith** · Jan 10 '08, 11:40 PM

Hi Kevin,
Thank you so much for your help! Your approach works just great!
I am using this perl code for parsing my text file. The string that I am searching for in the file is a fixed one and will not occur as a part of any other string, so this approach is the best one for me.

Thanks again,
Sangith

Originally posted by KevinADC

Regular expressions are probably one of the more complicated things about perl (and many languages) that the casual perl coder will have to learn. A significant thing to note is that a regular expression will try and match a pattern as early as it can in a string. The word "and" occurs several times in the string, perl will try and match the first occurance, just after Larry Wall: "Larry Wall and".

In order to match the last occurance you actually want to use greedy matching:

/.*and (.*)\./

the first '.*' will match until the last occurance of: "and " (and-space). So you have to learn how to take advantage of greedy matching and when to use and when not to use it. But your problem is further complicated because it is a string of multiple lines (at least it looks that way in your post). To ignore the multiple-lines, you use the"s" modifier at the end of the regexp. This tells perl to treat the string as one long line and ignore all newlines except the one at the very end of the string (if there is one).

This is one way it could be done:

[CODE=perl]#!usr/bin/perl
use strict;
my $sentence = "Perl is a dynamic programming language created by Larry Wall and first released in 1987,
Perl is based on the brace-delimited block style of AWK and C,
and was widely adopted for its strengths in text processing
and lack of the arbitrary limitations
of many scripting languages at the time.";
my $r;
if ($sentence =~ /.*and (.*)\./s)
{
$r = $1;
print "The following is the output:-\n";
print "$r\n";
}[/CODE]

This is a bit contrived to fit the string you posted. The pattern you want to match appears to start at the beginning of a line within the string. But if you did not know where the pattern started in the string you would probably have to use a different search pattern to avoid substring matches like "land" or "sand".

Here is a link that might help you:

perlretut - Perl regular expressions tutorial - Perldoc Browser

http://perldoc.perl.org/perlretut.html

Take it a little at a time if it's confusing.

Need help with a Regular Expression

Need help with a Regular Expression

Comment

Comment