Regexp issue . . .

**Eric J. Roode** · Jul 19 '05, 04:49 AM

Re: Regexp issue . . .

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

"MichaelC" <mickyc@NOshaSP AMw.ca> wrote in
news:d9Dwb.4924 53$9l5.241927@p d7tw2no:
[color=blue]
> Hi all. I am having a particularly difficult time with a perl script
> that I am writing. The problem area is a place where I need to strip
> some newlines out of a file.
>
> My source data is text which is in paragraph form, but has line breaks
> within the paragraphs. I need to do as much processing as possible in
> order to minimise the amount of manual changes that I have to make.[/color]

You don't say what you mean by "paragraph form". If you're using that
term in the usual sense, then you mean that the paragraphs have double
newlines between them. Is that so? If so, Perl can read paragraph-at-a-
time for you:

$/ = '';
$paragraph = <>;

- --
Eric
$_ = reverse sort $ /. r , qw p ekca lre uJ reh
ts p , map $ _. $ " , qw e p h tona e and print

-----BEGIN PGP SIGNATURE-----
Version: PGPfreeware 7.0.3 for non-commercial use <http://www.pgp.com>

iQA/AwUBP8NO2mPeouI eTNHoEQKl7wCgwh aYGGLKl2VuQu4P7 cXtQv9C8ZQAn0K0
9YlaoVGjDaBonog RTFfOnn5h
=h9Av
-----END PGP SIGNATURE-----

**MichaelC** · Jul 19 '05, 04:49 AM

Re: Regexp issue . . .

"Eric J. Roode" <REMOVEsdnCAPS@ comcast.net> wrote in message
news:Xns943E4EE 1E1E8Dsdn.comca st@216.196.97.1 36...[color=blue]
> -----BEGIN PGP SIGNED MESSAGE-----
> Hash: SHA1
>
> "MichaelC" <mickyc@NOshaSP AMw.ca> wrote in
> news:d9Dwb.4924 53$9l5.241927@p d7tw2no:
>[color=green]
> > Hi all. I am having a particularly difficult time with a perl script
> > that I am writing. The problem area is a place where I need to strip
> > some newlines out of a file.
> >
> > My source data is text which is in paragraph form, but has line breaks
> > within the paragraphs. I need to do as much processing as possible in
> > order to minimise the amount of manual changes that I have to make.[/color]
>
> You don't say what you mean by "paragraph form". If you're using that
> term in the usual sense, then you mean that the paragraphs have double
> newlines between them. Is that so? If so, Perl can read paragraph-at-a-
> time for you:
>
> $/ = '';
> $paragraph = <>;
>[/color]

Sorry, I thought that I had defined my problem in
enough detail. My problem is that the text that I am
processing does NOT have double line breaks
between paragraphs, and the text has been presented
wrapped to 72 character width. I do not have access
to the original, as it was lost. That is the reason for
my current problem.
That said, statistically, in the text that I am processing,
the vast majority of lines that start with the set [A-Z"]
will start a new paragraph. The converse is als true,
in that lines that start [a-z,.!?] are definitely part of a
logical paragraph. In that sense, I am not using the
term "paragraph" in the way that you normally assume.

As an object example, the explanation above is a reasonable simulation of
the problem that I am facing. Logistically, the manually broken text is two
paragraphs with no extra line breaks between them. I neither require nor do
I desire double line breaks between paragraphs, what I ro need, though, is
each paragraph on a single line with a single line break at the end, and
ONLY there.

For example, I need to strip all but two line breaks out of the example that
I have provided, so that the text is contiguous from "Sorry, I" to "current
problem." and from "That said, " to "normally assume." After some thought,
I found a solution:

#!/usr/bin/perl

open(infl, "<in.txt" );
open(outfl, ">out.txt") ;

while( <infl> ) {

my $x = $_;
if ( $x =~ m!^[A-Z"]! ) { print outfl "\n"; }
$x =~ s!(^.+)\n!\1 !m;

print outfl $x;
}

close(infl);
close(outfl);

Thanks,

Michael

**Eric J. Roode** · Jul 19 '05, 04:49 AM

Re: Regexp issue . . .

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

"MichaelC" <mickyc@NOshaSP AMw.ca> wrote in
news:s5Vwb.4967 86$pl3.155625@p d7tw3no:
[color=blue]
> Sorry, I thought that I had defined my problem in
> enough detail.[/color]

I would say not. :-)
[color=blue]
> My problem is that the text that I am
> processing does NOT have double line breaks
> between paragraphs, and the text has been presented
> wrapped to 72 character width. I do not have access
> to the original, as it was lost. That is the reason for
> my current problem.
> That said, statistically, in the text that I am processing,
> the vast majority of lines that start with the set [A-Z"]
> will start a new paragraph. The converse is als true,
> in that lines that start [a-z,.!?] are definitely part of a
> logical paragraph. In that sense, I am not using the
> term "paragraph" in the way that you normally assume.[/color]

It sounds like you want to remove all newlines, except where the newline
is followed by an uppercase character. Is that correct?

If so, I'd suggest reading the entire file into memory, and doing a
simple substitution on it:

$/ = undef;
$content = <FILE>;
$content =~ s/\n(?![[:upper:]])//g;

- --
Eric
$_ = reverse sort $ /. r , qw p ekca lre uJ reh
ts p , map $ _. $ " , qw e p h tona e and print

-----BEGIN PGP SIGNATURE-----
Version: PGPfreeware 7.0.3 for non-commercial use <http://www.pgp.com>

iQA/AwUBP8SeSmPeouI eTNHoEQKoVQCfdS okT7bnrjmUOkqt4 NVFOnp9A48An3t1
xj9Z1HMNOPOnq8P J6NJF1KvR
=1T1p
-----END PGP SIGNATURE-----

Regexp issue . . .

Regexp issue . . .

Comment

Comment

Comment