Perl script to mimic uniq

Collapse
This topic is closed.
X
X
 
  • Time
  • Show
Clear All
new posts
  • Martin Foster

    Perl script to mimic uniq

    Hi.

    I would like to be able to mimic the unix tool 'uniq' within a Perl script.

    I have a file with entries that look like this

    4 10 21 37 58 83 111 145 184 226
    4 12 24 42 64 92 124 162 204 252
    4 11 23 44 67 95 134 168 215 271
    ..
    ..
    ..

    Many number sequences, I would like to analyze the file to tell me how often a
    sequence occurs throughout the file.

    I've began writing a script:

    #!/usr/bin/perl
    # Perl script to find most common CS
    use strict;

    my @line;
    my $infile = "/home/martin/DATABASE/large.txt";
    open INFILE, $infile or die "Shit! Couldn't open file $infile: $!\n";
    my @array = <INFILE>;
    my $no_lines = $#array;
    print "There are ", $no_lines+1, " lines in the large array\n";

    my (@table);
    foreach my $array (@array) {
    push(@table, [split(/\s/, $array) ]);
    }

    my $no_cells = $#{$table[$no_lines]};

    for (my $k =0; $k<=$no_lines; $k++) {
    print "[$k] occurs ";
    my $match=0;
    my $matched=0;
    for (my $h =0; $h<=$no_lines; $h++) {
    for (my $j =3; $j<=12; $j++ ) {
    if ($table[$k][$j] == $table[$h][$j]){
    $match++;
    }
    }
    if ($match==10) {
    $matched++;
    }
    }
    print "$matched times\n";
    } # end of large loop

    Does anyone know a better, quicker method of doing this?

    Many thanks in advance for any suggestions.
  • nobull@mail.com

    #2
    Re: Perl script to mimic uniq

    mdfoster44@nets cape.net (Martin Foster) wrote in message news:<6a20f90a. 0401291652.5fae 2f4a@posting.go ogle.com>...[color=blue]
    > I would like to be able to mimic the unix tool 'uniq' within a Perl script.[/color]

    There are Perl implementations of the Unix tools "out there". (Doing
    web search to find them is left as an exercise for the reader).
    [color=blue]
    > I have a file with entries that look like this
    >
    > 4 10 21 37 58 83 111 145 184 226
    > 4 12 24 42 64 92 124 162 204 252
    > 4 11 23 44 67 95 134 168 215 271
    > .
    > .
    > .
    >
    > Many number sequences, I would like to analyze the file to tell me how often a
    > sequence occurs throughout the file.[/color]

    That is not what Unix uniq does. 'uniq' compares adjacent lines.

    Always reduce your problems to their simplest form. The fact that the
    lines of the file happen to be sequences of numbers in not part of
    your problem's simplest form.

    I shall assume that you really want to count the number of times each
    distints line appears in a file.

    The cannonical Perl one-liner to do this is:

    perl -en '$c{$_}++; END { print "$c{$_} $_" for keys %c }'

    Or as a script:

    #!/usr/bin/perl
    use strict;
    use warnings;

    my %count;

    $count{$_}++ while <>;

    print "$count{$_} $_" for keys %count;
    __END__

    [color=blue]
    > I've began writing a script:[/color]

    Good. We don't like helping people who don't show what they've tried.
    As a requard I'll give you some general Perl programming tips!
    [color=blue]
    > #!/usr/bin/perl
    > # Perl script to find most common CS[/color]

    That comment does not describe what the script does.
    Wrong comments are worse than no comments.
    [color=blue]
    > use strict;[/color]

    Get as much help as you can, use warnings too![color=blue]
    >
    > my @line;[/color]

    You never use this variable.
    [color=blue]
    > my $infile = "/home/martin/DATABASE/large.txt";
    > open INFILE, $infile or die "Shit! Couldn't open file $infile: $!\n";
    > my @array = <INFILE>;
    > my $no_lines = $#array;[/color]

    Variable names should reflect what's in the variable.

    There's no point having a variable that's just a copy of $#array
    since you can always just use $#array.
    [color=blue]
    > print "There are ", $no_lines+1, " lines in the large array\n";[/color]

    It would be more ideomatic to use scalar(@array) rather than $#array+1
    [color=blue]
    > my (@table);
    > foreach my $array (@array) {
    > push(@table, [split(/\s/, $array) ]);
    > }[/color]

    For really simple for/push loops like this consider using map:

    my @table = map { [ split ] } @array;
    [color=blue]
    > my $no_cells = $#{$table[$no_lines]};[/color]

    Variable names should reflect what's in the variable.

    Anyhow you never use that variable.
    [color=blue]
    >
    > for (my $k =0; $k<=$no_lines; $k++) {[/color]

    Don't use C-style for in Perl unless you need to.

    for my $k ( 0 .. $no_lines ) {
    [color=blue]
    > print "[$k] occurs ";[/color]

    Hang on, $k is the line number (minus one) not the content of the
    line.
    I suspect there's more to your original problem than you are telling
    us.
    [color=blue]
    > my $match=0;
    > my $matched=0;
    > for (my $h =0; $h<=$no_lines; $h++) {
    > for (my $j =3; $j<=12; $j++ ) {[/color]

    Where did those 3 and 12 come from. I suspect there's more to your
    original problem than you are telling us.
    [color=blue]
    > if ($table[$k][$j] == $table[$h][$j]){
    > $match++;
    > }
    > }
    > if ($match==10) {
    > $matched++;
    > }[/color]

    Rather than counting matches and checking you have 10 it would be
    better to count mismatches an check you have 0. That way if the 12
    ever had to become 13 you wouldn't have to have to change 10 to 11
    [color=blue]
    > }
    > print "$matched times\n";
    > } # end of large loop
    >
    > Does anyone know a better, quicker method of doing this?[/color]

    Doing what? You've moved the goal-posts several times.
    [color=blue]
    > Many thanks in advance for any suggestions.[/color]

    I suggest that you get clear in your mind what you are asking before
    you ask it.

    I also suggest you post to newsgroups that still exist (this one
    doesn't, see FAQ). Your post will then be seen my many more people.

    Comment

    • Martin Foster

      #3
      Re: Perl script to mimic uniq

      nobull@mail.com wrote in message news:<4dafc536. 0401301107.1d2f 7cc9@posting.go ogle.com>...[color=blue]
      > mdfoster44@nets cape.net (Martin Foster) wrote in message news:<6a20f90a. 0401291652.5fae 2f4a@posting.go ogle.com>...[color=green]
      > > I would like to be able to mimic the unix tool 'uniq' within a Perl script.[/color]
      >
      > There are Perl implementations of the Unix tools "out there". (Doing
      > web search to find them is left as an exercise for the reader).
      >[color=green]
      > > I have a file with entries that look like this
      > >
      > > 4 10 21 37 58 83 111 145 184 226
      > > 4 12 24 42 64 92 124 162 204 252
      > > 4 11 23 44 67 95 134 168 215 271
      > > .
      > > .
      > > .
      > >
      > > Many number sequences, I would like to analyze the file to tell me how often a
      > > sequence occurs throughout the file.[/color]
      >
      > That is not what Unix uniq does. 'uniq' compares adjacent lines.[/color]

      I know, I can sort lines to be adjacent and then use uniq.
      [color=blue]
      >
      > Always reduce your problems to their simplest form. The fact that the
      > lines of the file happen to be sequences of numbers in not part of
      > your problem's simplest form.
      >
      > I shall assume that you really want to count the number of times each
      > distints line appears in a file.
      >
      > The cannonical Perl one-liner to do this is:
      >
      > perl -en '$c{$_}++; END { print "$c{$_} $_" for keys %c }'
      >
      > Or as a script:
      >
      > #!/usr/bin/perl
      > use strict;
      > use warnings;
      >
      > my %count;
      >
      > $count{$_}++ while <>;
      >
      > print "$count{$_} $_" for keys %count;
      > __END__
      >[/color]
      This is amazing, I don't understand how it works but it's very
      powerful.
      Can I se this script to compare the n columns of a file, no the entire
      file.
      [color=blue]
      >[color=green]
      > > I've began writing a script:[/color]
      >
      > Good. We don't like helping people who don't show what they've tried.
      > As a requard I'll give you some general Perl programming tips!
      >[color=green]
      > > #!/usr/bin/perl
      > > # Perl script to find most common CS[/color]
      >
      > That comment does not describe what the script does.
      > Wrong comments are worse than no comments.
      >[color=green]
      > > use strict;[/color]
      >
      > Get as much help as you can, use warnings too![color=green]
      > >
      > > my @line;[/color]
      >
      > You never use this variable.
      >[color=green]
      > > my $infile = "/home/martin/DATABASE/large.txt";
      > > open INFILE, $infile or die "Shit! Couldn't open file $infile: $!\n";
      > > my @array = <INFILE>;
      > > my $no_lines = $#array;[/color]
      >
      > Variable names should reflect what's in the variable.
      >
      > There's no point having a variable that's just a copy of $#array
      > since you can always just use $#array.
      >[color=green]
      > > print "There are ", $no_lines+1, " lines in the large array\n";[/color]
      >
      > It would be more ideomatic to use scalar(@array) rather than $#array+1
      >[color=green]
      > > my (@table);
      > > foreach my $array (@array) {
      > > push(@table, [split(/\s/, $array) ]);
      > > }[/color]
      >
      > For really simple for/push loops like this consider using map:
      >
      > my @table = map { [ split ] } @array;[/color]

      Ok. Thanks, I've not used map before, just beginning to learn.
      [color=blue]
      >[color=green]
      > > my $no_cells = $#{$table[$no_lines]};[/color]
      >
      > Variable names should reflect what's in the variable.
      >
      > Anyhow you never use that variable.
      >[color=green]
      > >
      > > for (my $k =0; $k<=$no_lines; $k++) {[/color]
      >
      > Don't use C-style for in Perl unless you need to.
      >
      > for my $k ( 0 .. $no_lines ) {
      >[color=green]
      > > print "[$k] occurs ";[/color]
      >
      > Hang on, $k is the line number (minus one) not the content of the
      > line.
      > I suspect there's more to your original problem than you are telling
      > us.
      >[color=green]
      > > my $match=0;
      > > my $matched=0;
      > > for (my $h =0; $h<=$no_lines; $h++) {
      > > for (my $j =3; $j<=12; $j++ ) {[/color]
      >
      > Where did those 3 and 12 come from. I suspect there's more to your
      > original problem than you are telling us.[/color]

      I've got a identifier for each line at the beginning, for example

      1666237 4 10 23 16 and so. The identifier is an id to link to
      something else and so on. I just want to compare the 10 columns with
      the numbers.
      [color=blue]
      >[color=green]
      > > if ($table[$k][$j] == $table[$h][$j]){
      > > $match++;
      > > }
      > > }
      > > if ($match==10) {
      > > $matched++;
      > > }[/color]
      >
      > Rather than counting matches and checking you have 10 it would be
      > better to count mismatches an check you have 0. That way if the 12
      > ever had to become 13 you wouldn't have to have to change 10 to 11
      >[color=green]
      > > }[/color]
      > print "$matched times\n";[color=green]
      > > } # end of large loop
      > >
      > > Does anyone know a better, quicker method of doing this?[/color]
      >
      > Doing what? You've moved the goal-posts several times.
      >[color=green]
      > > Many thanks in advance for any suggestions.[/color]
      >
      > I suggest that you get clear in your mind what you are asking before
      > you ask it.
      >
      > I also suggest you post to newsgroups that still exist (this one
      > doesn't, see FAQ). Your post will then be seen my many more people.[/color]
      BTW where is the FAQ, which says this newsgroup no longer exists?

      Comment

      • Jürgen Exner

        #4
        Re: Perl script to mimic uniq

        Martin Foster wrote:[color=blue]
        > I would like to be able to mimic the unix tool 'uniq' within a Perl
        > script.[/color]

        Unfortunately the FAQ entry is worded the opposite way:
        perldoc -q duplicate:
        "How can I remove duplicate elements from a list or array?"

        jue


        Comment

        • nobull@mail.com

          #5
          Re: Perl script to mimic uniq

          mdfoster44@nets cape.net (Martin Foster) wrote:
          [color=blue]
          > nobull@mail.com wrote:
          >[color=green]
          > > I shall assume that you really want to count the number of times each
          > > distinct line appears in a file.[/color][/color]
          [color=blue][color=green]
          > > perl -en '$c{$_}++; END { print "$c{$_} $_" for keys %c }'[/color][/color]
          [color=blue][color=green]
          > > Or as a script:[/color][/color]
          [color=blue][color=green]
          > > $count{$_}++ while <>;[/color][/color]
          [color=blue]
          > This is amazing, I don't understand how it works but it's very
          > powerful.[/color]

          If you look in the newsgroup that replaced this one when this one was
          deleted, you'll find every couple of months someone posts a script
          substancially like the one above and says "I found this - how does it
          work?".

          You could look at one of those threads.

          I believe it is also an example that is used in most Perl tutorials.
          [color=blue]
          > Can I se this script to compare the n columns of a file, no the entire
          > file.[/color]

          No you can't use this _script_. But you can use the technique.

          Rather than keying %count on the whole line you can use some sort of
          string manipulation to extract just part of the line to consider. The
          most normal way to manipulate strings in Perl is the m// and s///
          operators.
          [color=blue]
          > I've got a identifier for each line at the beginning, for example
          >
          > 1666237 4 10 23 16 and so. The identifier is an id to link to
          > something else and so on. I just want to compare the 10 columns with
          > the numbers.[/color]

          Well if, for example, we say the first 3 whitespace delimted columns
          are the identifier you could remove them thus:

          s/^(\S+\s+){3}// and $count{$_}++ while <>;
          [color=blue][color=green]
          > > I also suggest you post to newsgroups that still exist (this one
          > > doesn't, see FAQ). Your post will then be seen my many more people.[/color][/color]
          [color=blue]
          > BTW where is the FAQ, which says this newsgroup no longer exists?[/color]

          The Perl FAQ is part of the standard Perl documentation that can be
          found on any computer on which Perl has been installed and also on
          various Perl-related web sites.

          Comment

          • Martin Foster

            #6
            Re: Perl script to mimic uniq

            Thanks for your help.

            My script now looks like this:


            #!/usr/bin/perl
            # Perl script to find most common CS
            use strict;
            use warnings;

            my $infile = "/home/martin/DATABASE/large.txt";
            open INFILE, $infile or die "Shit! Couldn't open file $infile: $!\n";
            my %count;

            do {
            $_ =~ s/^(\S+\s+){2}//;
            $count{$_}++
            } while <INFILE>;

            print "$count{$_} $_" for keys %count;
            __END__

            So I'm feeding the file into the %count array by removing the first two
            columns with the identifier information and then counting the keys.
            How can I still keep the identifier part of the line linked to the array?
            Since this is the part which I'm really interested in.
            I can't keep the identifier in
            the %count array, since this would screw up the "for keys" part.

            I checked perldoc -q and found how to remove duplicates but I don't think
            I can rewrite this to do what I want.

            The "for keys" method is brillant but I'm losing the identifier.

            So I'm back to my original script which looks like this.

            #!/usr/bin/perl
            # Perl script to find most common CS
            use strict;
            use warnings;


            my $infile = "/home/martin/DATABASE/large.txt";
            open INFILE, $infile or die "Shit! Couldn't open file $infile: $!\n";
            my @array = <INFILE>;
            print "There are ", $#array+1, " lines in the large array\n";

            my (@table);
            foreach my $array (@array) {
            push(@table, [split(/\s/, $array) ]);
            }

            for (my $k =0; $k<=$#array; $k++) {
            print "$table[$k][1] $table[$k][2] occurs ";
            my $matched=0;
            for (my $h =0; $h<=$no_lines; $h++) {
            my $match=0;
            for (my $j =2; $j<=11; $j++ ) {
            if ($table[$k][$j] == $table[$h][$j]){
            $match++;
            }
            }
            if ($match==10) {
            $matched++;
            }
            }
            print "$matched times\n";
            } # end of large loop


            But this sad looking script is not very smart and very slow, I don't want to
            run over each line. I would like the script to search the file,
            identify a sequence as unique. If there are duplicate sequences
            in that file then print out how many and do not revisit that line
            if it has been counted as a duplicate.


            my data file looks like this, a small section only.


            810 141-2_1_2 4 10 21 37 58 83 111 145 184 226
            811 141-2_1_6 4 12 24 42 64 92 124 162 204 252
            812 141-2_1_7 4 11 23 44 67 95 134 168 215 271
            879 141_1_2 4 10 21 37 58 83 111 145 184 226
            880 141_1_6 4 12 24 42 64 92 124 162 204 252
            881 141_1_7 4 11 23 44 67 95 134 168 215 271
            882 152_1_15 4 12 26 44 72 104 138 178 228 282
            883 152_1_23 4 10 21 40 65 96 134 180 230 286
            884 152_1_24 4 10 21 40 65 96 134 180 230 286
            885 152_1_3 4 12 22 40 66 102 128 168 218 268

            Again many thanks for your help. I still don't get why you say
            this newsgroup has been deleted. What is the url for the replacement
            newsgroup?


            nobull@mail.com wrote in message news:<4dafc536. 0401310603.76f7 62e0@posting.go ogle.com>...[color=blue]
            > mdfoster44@nets cape.net (Martin Foster) wrote:
            >[color=green]
            > > nobull@mail.com wrote:
            > >[color=darkred]
            > > > I shall assume that you really want to count the number of times each
            > > > distinct line appears in a file.[/color][/color]
            >[color=green][color=darkred]
            > > > perl -en '$c{$_}++; END { print "$c{$_} $_" for keys %c }'[/color][/color]
            >[color=green][color=darkred]
            > > > Or as a script:[/color][/color]
            >[color=green][color=darkred]
            > > > $count{$_}++ while <>;[/color][/color]
            >[color=green]
            > > This is amazing, I don't understand how it works but it's very
            > > powerful.[/color]
            >
            > If you look in the newsgroup that replaced this one when this one was
            > deleted, you'll find every couple of months someone posts a script
            > substancially like the one above and says "I found this - how does it
            > work?".
            >
            > You could look at one of those threads.
            >
            > I believe it is also an example that is used in most Perl tutorials.
            >[color=green]
            > > Can I se this script to compare the n columns of a file, no the entire
            > > file.[/color]
            >
            > No you can't use this _script_. But you can use the technique.
            >
            > Rather than keying %count on the whole line you can use some sort of
            > string manipulation to extract just part of the line to consider. The
            > most normal way to manipulate strings in Perl is the m// and s///
            > operators.
            >[color=green]
            > > I've got a identifier for each line at the beginning, for example
            > >
            > > 1666237 4 10 23 16 and so. The identifier is an id to link to
            > > something else and so on. I just want to compare the 10 columns with
            > > the numbers.[/color]
            >
            > Well if, for example, we say the first 3 whitespace delimted columns
            > are the identifier you could remove them thus:
            >
            > s/^(\S+\s+){3}// and $count{$_}++ while <>;
            >[color=green][color=darkred]
            > > > I also suggest you post to newsgroups that still exist (this one
            > > > doesn't, see FAQ). Your post will then be seen my many more people.[/color][/color]
            >[color=green]
            > > BTW where is the FAQ, which says this newsgroup no longer exists?[/color]
            >
            > The Perl FAQ is part of the standard Perl documentation that can be
            > found on any computer on which Perl has been installed and also on
            > various Perl-related web sites.[/color]

            Comment

            • nobull@mail.com

              #7
              Re: Perl script to mimic uniq

              mdfoster44@nets cape.net (Martin Foster) spits TOFU in my face:
              [color=blue]
              > Thanks for your help.[/color]

              Please, if you want to thank me, learn to quote properly. TOFU ((new)
              Text Over, Full-quote Under) is considered very rude.
              [color=blue]
              > My script now looks like this:
              >
              >
              > #!/usr/bin/perl
              > # Perl script to find most common CS
              > use strict;
              > use warnings;
              >
              > my $infile = "/home/martin/DATABASE/large.txt";
              > open INFILE, $infile or die "Shit! Couldn't open file $infile: $!\n";
              > my %count;
              >
              > do {
              > $_ =~ s/^(\S+\s+){2}//;
              > $count{$_}++
              > } while <INFILE>;[/color]

              Please see perldoc perlsyn for how "do { BLOCK } while EXPR" is
              different from "while (EXPR) { BLOCK }". In this case you want the
              latter.

              Saying "$_ =~" i.e. "don't use $_, use $_ instead" is considered
              somwhat affected. Either use $_ (and don't mention it) or use
              something else instead.

              You are assuming the s/// succedes always. Whenever you are assume
              something like this will succede always you should decorate it with
              "or die". This acts as a comment saying "I'm assuming this succedes
              always". It also causes the program to crash out rather than carry on
              and do something weird if your assumption was wrong.
              [color=blue]
              > So I'm feeding the file into the %count array by removing the first two
              > columns with the identifier information and then counting the keys.
              > How can I still keep the identifier part of the line linked to the array?
              > Since this is the part which I'm really interested in.[/color]

              Ah, well you never mentioned that before. It helps to know what you
              want.
              [color=blue]
              > I can't keep the identifier in
              > the %count array, since this would screw up the "for keys" part.[/color]

              You can't keep it in the keys of %count, but you can keep it in the
              values.

              while (<INFILE>) {
              s/^(\S+\s+){2}// or die;
              push @{$count{$_}}, $1;
              };

              [color=blue]
              > I checked perldoc -q and found how to remove duplicates but I don't think
              > I can rewrite this to do what I want.[/color]

              Don't worry I'm sure your programming skill will improve. You appear
              smart but inexperienced. You do, however, seem to have an unfortunate
              streak of defeatism.
              [color=blue]
              > The "for keys" method is brillant but I'm losing the identifier.
              >
              > So I'm back to my original script which looks like this.[/color]

              Why? I showed you many ways to improve it independant of changing the
              algorithm.
              [color=blue]
              > #!/usr/bin/perl
              > # Perl script to find most common CS[/color]

              I still don't get how this comment relates to what your program does
              nor what you say you want it to do.
              [color=blue]
              > use strict;
              > use warnings;
              >
              >
              > my $infile = "/home/martin/DATABASE/large.txt";
              > open INFILE, $infile or die "Shit! Couldn't open file $infile: $!\n";
              > my @array = <INFILE>;
              > print "There are ", $#array+1, " lines in the large array\n";
              >
              > my (@table);
              > foreach my $array (@array) {
              > push(@table, [split(/\s/, $array) ]);
              > }
              >
              > for (my $k =0; $k<=$#array; $k++) {
              > print "$table[$k][1] $table[$k][2] occurs ";
              > my $matched=0;
              > for (my $h =0; $h<=$no_lines; $h++) {
              > my $match=0;
              > for (my $j =2; $j<=11; $j++ ) {
              > if ($table[$k][$j] == $table[$h][$j]){
              > $match++;
              > }
              > }
              > if ($match==10) {
              > $matched++;
              > }
              > }
              > print "$matched times\n";
              > } # end of large loop
              >
              >
              > But this sad looking script is not very smart and very slow, I don't want to
              > run over each line. I would like the script to search the file,
              > identify a sequence as unique. If there are duplicate sequences
              > in that file then print out how many and do not revisit that line
              > if it has been counted as a duplicate.[/color]

              It's not clear what you are saying.

              Are you saying you want the first ID (only) and the number of
              occurances of each distinct sequence?

              while (<INFILE>) {
              s/^(\S+\s+){2}// or die;
              push @{$count{$_}}, $1;
              };

              for ( values %count ) {
              print "$_->[0]occurs ",scalar(@$ _)," times\n";
              }
              [color=blue]
              > I still don't get why you say this newsgroup has been deleted.[/color]

              I say it because it is true, and because it will help people who
              didn't know this to reach a larger audience.
              [color=blue]
              > What is the url for the replacement newsgroup?[/color]

              What part of the answer to the Perl FAQ: "What are the Perl newsgroups
              on Usenet?" are you having trouble understanding?

              Comment

              • Martin Foster

                #8
                Re: Perl script to mimic uniq

                nobull@mail.com wrote in message news:<4dafc536. 0402030120.6236 ac20@posting.go ogle.com>...[color=blue]
                > mdfoster44@nets cape.net (Martin Foster) spits TOFU in my face:
                >[color=green]
                > > Thanks for your help.[/color]
                >
                > Please, if you want to thank me, learn to quote properly. TOFU ((new)
                > Text Over, Full-quote Under) is considered very rude.
                >[/color]
                I see.
                [color=blue][color=green]
                > > My script now looks like this:
                > >
                > >
                > > #!/usr/bin/perl
                > > # Perl script to find most common CS
                > > use strict;
                > > use warnings;
                > >
                > > my $infile = "/home/martin/DATABASE/large.txt";
                > > open INFILE, $infile or die "Shit! Couldn't open file $infile: $!\n";
                > > my %count;
                > >
                > > do {
                > > $_ =~ s/^(\S+\s+){2}//;
                > > $count{$_}++
                > > } while <INFILE>;[/color]
                >
                > Please see perldoc perlsyn for how "do { BLOCK } while EXPR" is
                > different from "while (EXPR) { BLOCK }". In this case you want the
                > latter.
                >
                > Saying "$_ =~" i.e. "don't use $_, use $_ instead" is considered
                > somwhat affected. Either use $_ (and don't mention it) or use
                > something else instead.
                >
                > You are assuming the s/// succedes always. Whenever you are assume
                > something like this will succede always you should decorate it with
                > "or die". This acts as a comment saying "I'm assuming this succedes
                > always". It also causes the program to crash out rather than carry on
                > and do something weird if your assumption was wrong.
                >[/color]
                This is good tip. I'll use this for now on.[color=blue][color=green]
                > > So I'm feeding the file into the %count array by removing the first two
                > > columns with the identifier information and then counting the keys.
                > > How can I still keep the identifier part of the line linked to the array?
                > > Since this is the part which I'm really interested in.[/color]
                >
                > Ah, well you never mentioned that before. It helps to know what you
                > want.
                >[color=green]
                > > I can't keep the identifier in
                > > the %count array, since this would screw up the "for keys" part.[/color]
                >
                > You can't keep it in the keys of %count, but you can keep it in the
                > values.
                >
                > while (<INFILE>) {
                > s/^(\S+\s+){2}// or die;
                > push @{$count{$_}}, $1;
                > };
                >
                >[color=green]
                > > I checked perldoc -q and found how to remove duplicates but I don't think
                > > I can rewrite this to do what I want.[/color]
                >
                > Don't worry I'm sure your programming skill will improve. You appear
                > smart but inexperienced. You do, however, seem to have an unfortunate
                > streak of defeatism.
                >[color=green]
                > > The "for keys" method is brillant but I'm losing the identifier.
                > >
                > > So I'm back to my original script which looks like this.[/color]
                >
                > Why? I showed you many ways to improve it independant of changing the
                > algorithm.
                >[color=green]
                > > #!/usr/bin/perl
                > > # Perl script to find most common CS[/color]
                >
                > I still don't get how this comment relates to what your program does
                > nor what you say you want it to do.[/color]
                The data list is a sequence of numbers, which are called coordination
                sequences, CS for short. My program tries to find the most common CS
                in the data file.[color=blue]
                >[color=green]
                > > use strict;
                > > use warnings;
                > >
                > >
                > > my $infile = "/home/martin/DATABASE/large.txt";
                > > open INFILE, $infile or die "Shit! Couldn't open file $infile: $!\n";
                > > my @array = <INFILE>;
                > > print "There are ", $#array+1, " lines in the large array\n";
                > >
                > > my (@table);
                > > foreach my $array (@array) {
                > > push(@table, [split(/\s/, $array) ]);
                > > }
                > >
                > > for (my $k =0; $k<=$#array; $k++) {
                > > print "$table[$k][1] $table[$k][2] occurs ";
                > > my $matched=0;
                > > for (my $h =0; $h<=$no_lines; $h++) {
                > > my $match=0;
                > > for (my $j =2; $j<=11; $j++ ) {
                > > if ($table[$k][$j] == $table[$h][$j]){
                > > $match++;
                > > }
                > > }
                > > if ($match==10) {
                > > $matched++;
                > > }
                > > }[/color]
                > print "$matched times\n";[color=green]
                > > } # end of large loop
                > >
                > >
                > > But this sad looking script is not very smart and very slow, I don't want to
                > > run over each line. I would like the script to search the file,
                > > identify a sequence as unique. If there are duplicate sequences
                > > in that file then print out how many and do not revisit that line
                > > if it has been counted as a duplicate.[/color]
                >
                > It's not clear what you are saying.
                >[/color]
                There is a list of number sequences. Each list is labelled uniquely
                by
                an identifier. I want to sort through the list, so I starting at the
                1st row and then my code loops through the list checking the
                sequences. If it finds a match, then that row does not need to be
                revisited again later in the loop, since it has been identified as a
                match to the 1st row. I guess I need to keep
                an index of some sort while looping the list. Then when I start at
                the 2nd row, I only loop over the sequences which are indexed as 'not
                yet matched'.
                I hope this makes more sense.

                [color=blue]
                > Are you saying you want the first ID (only) and the number of
                > occurances of each distinct sequence?[/color]
                Yes. This is very helpful. '$_->[0]' looks like
                a pointer. So your piece of code, maps the $1 column of the original
                line
                as a pointer to the values of the %count array. Then the "values" of
                %count are the unique "keys" of that array and "scalar" is counting
                the number of lines that are the same. Is that right?
                I'm trying to understand what your code does, since I want to use it.
                Perl is great, but it so difficult to read if you don't have a clue.
                [color=blue]
                >
                > while (<INFILE>) {
                > s/^(\S+\s+){2}// or die;
                > push @{$count{$_}}, $1;
                > };
                >
                > for ( values %count ) {
                > print "$_->[0]occurs ",scalar(@$ _)," times\n";
                > }
                >[color=green]
                > > I still don't get why you say this newsgroup has been deleted.[/color]
                >
                > I say it because it is true, and because it will help people who
                > didn't know this to reach a larger audience.
                >[color=green]
                > > What is the url for the replacement newsgroup?[/color]
                >
                > What part of the answer to the Perl FAQ: "What are the Perl newsgroups
                > on Usenet?" are you having trouble understanding?[/color]

                Comment

                • nobull@mail.com

                  #9
                  Re: Perl script to mimic uniq

                  mdfoster44@nets cape.net (Martin Foster) wrote in message news:<6a20f90a. 0402041647.6920 fd75@posting.go ogle.com>...[color=blue]
                  > nobull@mail.com wrote in message news:<4dafc536. 0402030120.6236 ac20@posting.go ogle.com>...[color=green]
                  > > mdfoster44@nets cape.net (Martin Foster) spits TOFU in my face:
                  > >[color=darkred]
                  > > > # Perl script to find most common CS[/color]
                  > >
                  > > I still don't get how this comment relates to what your program does
                  > > nor what you say you want it to do.[/color]
                  >
                  > The data list is a sequence of numbers, which are called coordination
                  > sequences, CS for short. My program tries to find the most common CS
                  > in the data file.[/color]

                  I still don't see anything in your program that relates to finding the
                  most common CS. It looks to me like your program is printing out the
                  number of occurances of each CS.
                  [color=blue][color=green][color=darkred]
                  > > > I would like the script to search the file,
                  > > > identify a sequence as unique. If there are duplicate sequences
                  > > > in that file then print out how many and do not revisit that line
                  > > > if it has been counted as a duplicate.[/color]
                  > >
                  > > It's not clear what you are saying.[/color]
                  >
                  > There is a list of number sequences. Each list is labelled uniquely
                  > by an identifier. I want to sort through the list, so I starting at the
                  > 1st row and then my code loops through the list checking the
                  > sequences. If it finds a match, then that row does not need to be
                  > revisited again later in the loop, since it has been identified as a
                  > match to the 1st row. I guess I need to keep
                  > an index of some sort while looping the list. Then when I start at
                  > the 2nd row, I only loop over the sequences which are indexed as 'not
                  > yet matched'.[/color]

                  I think you are mixing up your definition of the problem you are
                  trying to solve with the implementation of a partial solution.
                  [color=blue]
                  > I hope this makes more sense.[/color]

                  Not much.
                  [color=blue][color=green]
                  > > Are you saying you want the first ID (only) and the number of
                  > > occurances of each distinct sequence?[/color]
                  >
                  > Yes. This is very helpful.[/color]

                  Right. So that's what you want one output line for each distinct CS
                  in no particular order. You don't want to find the CS that appears
                  most often.

                  If you wanted the output sorted in order of frequently you would have
                  to put a sort in there somewhere.
                  [color=blue][color=green]
                  > >
                  > > while (<INFILE>) {
                  > > s/^(\S+\s+){2}// or die;
                  > > push @{$count{$_}}, $1;
                  > > };
                  > >
                  > > for ( values %count ) {
                  > > print "$_->[0]occurs ",scalar(@$ _)," times\n";
                  > > }[/color][/color]
                  [color=blue]
                  > '$_->[0]' looks like a pointer.[/color]

                  This is no accident. The values of %count are references (pointers)
                  to arrays of IDs.
                  [color=blue]
                  > So your piece of code, maps the $1 column of the original
                  > line as a pointer to the values of the %count array.[/color]

                  $1 in Perl is not like it is in awk.

                  In Perl $1 is whatever was captured by the first () capture in the
                  most recent regex in the current scope.

                  So in this case $1 is the first two columns (and the following
                  whitespace) of the original line. I believe, from what you've said
                  previously, that this is some sort of ID (identifier) and is not part
                  of the CS.

                  Actually you probably should thow away the whitespace between the ID
                  and the CS.

                  s/^(\S+\s+\S+)\s+// or die;

                  Also if you want to improve reability you could avoid $_ and $1 and
                  also rename %count to something more appropriate to its new role:

                  my ( $id, $cs ) = /^(\S+\s+\S+)\s+ (.*)/ or die;
                  push @{$ids_by_cs{$c s}}, $id;
                  [color=blue]
                  > Then the "values" of
                  > %count are the unique "keys" of that array and "scalar" is counting
                  > the number of lines that are the same. Is that right?[/color]

                  There is nothing for "that array" to refer to in the previous
                  sentence.

                  The values of the hash %count (or %ids_by_cs) are (a list of) pointers
                  to arrays. Each array contains the series of IDs that correspond to a
                  single CS. The keys of the hash are the distinct CSs themselves.

                  As to the uniqueness of the IDs there is nothing in the program that
                  either ensures that nor cares that the IDs in the input data are
                  unique.
                  [color=blue]
                  > "scalar" is counting the number of lines that are the same.[/color]

                  scalar is counting the number of elements in the array of IDs that
                  correspond to a single CS. So, yes, in effect this counts the number
                  of lines that were the same.
                  [color=blue]
                  > Perl is great, but it so difficult to read if you don't have a clue.[/color]

                  Oh, you noticed that, did you? :-)

                  Comment

                  • Aaron Sherman

                    #10
                    Re: Perl script to mimic uniq

                    mdfoster44@nets cape.net (Martin Foster) wrote in message news:<6a20f90a. 0401291652.5fae 2f4a@posting.go ogle.com>...[color=blue]
                    > Hi.
                    >
                    > I would like to be able to mimic the unix tool 'uniq' within a Perl script.[/color]

                    I think you were not asking for uniq per se, so much as "uniq -c"
                    specifically.

                    Here's a simple stab.

                    Note that, like "uniq -c", this requires the data to be sorted.
                    Sorting lines in the file is left as an excersise for the reader.

                    while(<>) {
                    if (defined($prev) && $_ ne $prev) {
                    printf "%7d %s", $n, $prev;
                    $n = 0;
                    }
                    } continue {
                    $prev = $_;
                    $n++;
                    }
                    printf "%7d %s", $n, $prev if defined $prev;

                    If you actually want to do both the sorting and the unique line
                    counting at the same time, you need to keep everything in memory
                    (possibly quite expensive, and this is why uniq doesn't do that). Try
                    this code in that case:

                    while(<>) {
                    $lines{$_}++;
                    }
                    foreach $line (sort keys %lines) {
                    printf "%7d %s", $lines{$line}, $line;
                    }


                    All of this is typed in from my head, so make sure to check my syntax,
                    etc before using.

                    Comment

                    Working...