Speed problem

Collapse
X
 
  • Time
  • Show
Clear All
new posts
  • john12345
    New Member
    • Feb 2008
    • 9

    Speed problem

    Hi, I am newbie to perl, below is the script i have worked on which gives me exactly what i want. The only problem i am facing is speed. This script is mapping the AB,BA,IS,RT,ET from file1.csv and putting them into columns. file2.txt contains first column of file1.csv which i have cut using unix command (this i would like to avoid and map it from the script itself). my question- Is there any way I can make this script faster as file1.csv has 1.6 millions of lines and grep command goes through them 6 times which is making the script real slow (takes 6 hours to map). but does gives the output what i want.
    Please help me

    Thanks in advance.
    John

    Code:
    file1.csv 
    112311,AB,312342
    112311,BA,LAN321
    112311,IS,LA3423
    321211,AB,342324
    432342,BA,LAN322
    432342,IS,LA3453
    432342,RT,324343
    432342,ET,ERTF
    file2.txt #has unique file1.csv column 1 data
    Code:
    112311
    321211
    432342
    output.txt
    Code:
    sc,ab_no bar_a,is_in,erto,ex_tk
    112311,312342,LAN321,LA3423,	,
    321211,342324,	,   ,   ,   ,
    432342,   ,LAN322,LA3453,324343,ERTF
    This is what i have,

    [CODE=perl]$file1 = "file1.csv" ;
    $file2 = "file2.txt" ;

    open(FILE2,"<$f ile2");

    while(<FILE2>) {
    chomp $_;
    $sc = $_;
    $ab = `grep "$sc,AB," $file1`;
    if($ab ne ""){
    @array = split(/,/, $ab);
    $ab_no = $array[2];
    } else {
    $ab_no = "\t";
    }
    $ba = `grep "$sc,BA,LAN " $file1`;
    if($ba ne ""){
    @array1 = split(/,/, $ba);
    $barra = $array1[2];
    } else {
    $bar_a = "\t";
    }
    $is = `grep "$sc,IS,LA" $file1`;
    if($is ne ""){
    @array2 = split(/,/, $is);
    $is_in = $array2[2];
    } else {
    $is_in = "\t";
    }
    $rt = `grep "$sc,RT,.*T O," $file1`;
    if($rt ne ""){
    @array3 = split(/,/, $rt);
    $erto = $array3[2];
    } else {
    $erto = "\t";
    }
    $et = `grep "$sc,ET," $file1`;
    if($et ne ""){
    @array4 = split(/,/, $et);
    $ex_tk = $array4[2];
    } else {
    $ex_tk = "\t";
    }
    print "$sc,$ab_no,$ba r_a,$is_in,$ert o,$ex_tk\n";
    }

    close(FILE1)[/CODE]
    Last edited by eWish; Feb 15 '08, 10:33 PM. Reason: Fixed Code Tags
  • KevinADC
    Recognized Expert Specialist
    • Jan 2007
    • 4092

    #2
    Originally posted by john12345
    Hi, I am newbie to perl, below is the script i have worked on which gives me exactly what i want. The only problem i am facing is speed. This script is mapping the AB,BA,IS,RT,ET from file1.csv and putting them into columns. file2.txt contains first column of file1.csv which i have cut using unix command (this i would like to avoid and map it from the script itself). my question- Is there any way I can make this script faster as file1.csv has 1.6 millions of lines and grep command goes through them 6 times which is making the script real slow (takes 6 hours to map). but does gives the output what i want.
    Please help me

    Thanks in advance.
    John
    It looks like file1 and file2 have the same order:

    112311
    321211
    432342

    Is file1 already in the order you want the output to be in or does the first column need to be sorted somehow? I assume file2 is not important, you just use it temporarily to get the first column value unless file1 is not in the correct order to begin with.

    Do you have this question posted on other forums and do you already have a solution? Because if you do I don't want to waste my time figuring it out.

    Comment

    • john12345
      New Member
      • Feb 2008
      • 9

      #3
      Originally posted by KevinADC
      It looks like file1 and file2 have the same order:

      112311
      321211
      432342

      Is file1 already in the order you want the output to be in or does the first column need to be sorted somehow? I assume file2 is not important, you just use it temporarily to get the first column value unless file1 is not in the correct order to begin with.

      Do you have this question posted on other forums and do you already have a solution? Because if you do I don't want to waste my time figuring it out.
      Kevin Thanks for fast response, No i have not posted this question anywhere else, coz I saw in my search that this the only forum which is active and fast.
      About file2.txt, it contains unique columns from file1.csv and I am getting file2 using unix command manually ( I want avoid manual step and do it in perl). The output file as above has new columns which comes out of file1 with column as match. so in the output i dont want any duplicates plus I want to create new columns from rows in column 2.

      Yes this script does gives solution, but as i mention it takes 6 to 7 hours to generate results, as it goes through 1.6 millions lines. thats why i need help, coz i am stuck. Any help will be appreciated.

      Thank you in advancs.

      Comment

      • eWish
        Recognized Expert Contributor
        • Jul 2007
        • 973

        #4
        What are the relationships between the three columns? If you are only storing the unique data from column1, what are you doing with the data from columns 2 and 3?

        --Kevin

        Comment

        • john12345
          New Member
          • Feb 2008
          • 9

          #5
          Originally posted by eWish
          What are the relationships between the three columns? If you are only storing the unique data from column1, what are you doing with the data from columns 2 and 3?

          --Kevin
          Hi, thanx for response.
          In file1
          col1 is "ID" (which could be duplicate), col2 has rows which includes "AB,BA,IS,RT,ET " for these rows i want to create new columns in output as shown in output.txt, and col3 has values of col2. So basically i want output with unique values for all columns and place tab for the null as shown in output.txt,
          Please see the output file "output.txt ",
          output.txt,
          sc, ab_no, bar_a, is_in, erto,ex_tk
          112311,312342,L AN321,LA3423, ,
          321211,342324, , , , ,
          432342, ,LAN322,LA3453, 324343,ERTF

          The output file has output with column 1 is "sc" (unique), column 2(which is one row 'AB' from file1) is "ab_no", column 3 (which is another row 'BA' in file1) is "bar_a", column 4 (which is 'IS' in file1 ) is "is_in" and so on..., So new columns are created from rows in file1 based on first column.

          Thank you in advance.

          Comment

          • KevinADC
            Recognized Expert Specialist
            • Jan 2007
            • 4092

            #6
            You didn't answer my question concerning the order of column1 in file1. Assuming it is in order, this should work and should be faster than 6 hours. Change the path to the files to match your system.

            [CODE=perl]use strict;
            use warnings;
            my %fields = (
            AB => "\t",
            BA => "\t",
            IS => "\t",
            RT => "\t",
            ET => "\t"
            );

            my $file1 = "c:/perl_test/file1.csv";
            open(FILE1,"<$f ile1") or die "$!";
            open(OUT, '>>', 'path/to/output.csv') or die "$!";
            #get the first line of the file so we can start
            my ($sc_old,$v1,$v 2) = split(/,/, <FILE1>);
            chomp $v2;
            $fields{$v1}=$v 2;

            while (<FILE1>) {
            chomp;
            my ($sc_new,$v1,$v 2) = split(/,/);
            if ($sc_new eq $sc_old) {
            $fields{$v1}=$v 2;
            }
            elsif (($sc_new ne $sc_old)) {
            print OUT "$sc_old,$field s{AB},$fields{B A},$fields{IS}, $fields{RT},$fi elds{ET}\n";
            %fields = (AB => "\t",BA => "\t",IS => "\t",RT => "\t",ET => "\t");
            $sc_old = $sc_new;
            $fields{$v1}=$v 2;
            }
            }
            close FILE1;
            # prints last set of records
            print OUT "$sc_old,$field s{AB},$fields{B A},$fields{IS}, $fields{RT},$fi elds{ET}\n";[/CODE]

            It maybe could be done even faster by someone that is more experienced with perl.

            Comment

            • john12345
              New Member
              • Feb 2008
              • 9

              #7
              Originally posted by KevinADC
              You didn't answer my question concerning the order of column1 in file1. Assuming it is in order, this should work and should be faster than 6 hours. Change the path to the files to match your system.

              [CODE=perl]use strict;
              use warnings;
              my %fields = (
              AB => "\t",
              BA => "\t",
              IS => "\t",
              RT => "\t",
              ET => "\t"
              );

              my $file1 = "c:/perl_test/file1.csv";
              open(FILE1,"<$f ile1") or die "$!";
              open(OUT, '>>', 'path/to/output.csv') or die "$!";
              #get the first line of the file so we can start
              my ($sc_old,$v1,$v 2) = split(/,/, <FILE1>);
              chomp $v2;
              $fields{$v1}=$v 2;

              while (<FILE1>) {
              chomp;
              my ($sc_new,$v1,$v 2) = split(/,/);
              if ($sc_new eq $sc_old) {
              $fields{$v1}=$v 2;
              }
              elsif (($sc_new ne $sc_old)) {
              print OUT "$sc_old,$field s{AB},$fields{B A},$fields{IS}, $fields{RT},$fi elds{ET}\n";
              %fields = (AB => "\t",BA => "\t",IS => "\t",RT => "\t",ET => "\t");
              $sc_old = $sc_new;
              $fields{$v1}=$v 2;
              }
              }
              close FILE1;
              # prints last set of records
              print OUT "$sc_old,$field s{AB},$fields{B A},$fields{IS}, $fields{RT},$fi elds{ET}\n";[/CODE]

              It maybe could be done even faster by someone that is more experienced with perl.
              Kevin
              Thankx for the hard work, really appreciate your time and help, works like beauty. Thnx again Kevin. One more thing I have to apply some rules to get the data from file1.csv as shown in the script i posted above for exampl,
              $ba = `grep "$sc,BA,LAN " $file1`; # Here i am grabbing those rows which has BA,LAN only. Is there any way i can do this in this same script.

              Thankx in advance


              John

              Comment

              • KevinADC
                Recognized Expert Specialist
                • Jan 2007
                • 4092

                #8
                Originally posted by john12345
                Kevin
                Thankx for the hard work, really appreciate your time and help, works like beauty. Thnx again Kevin. One more thing I have to apply some rules to get the data from file1.csv as shown in the script i posted above for exampl,
                $ba = `grep "$sc,BA,LAN " $file1`; # Here i am grabbing those rows which has BA,LAN only. Is there any way i can do this in this same script.

                Thankx in advance


                John
                Do you mean grep all the lines in the file that satisfy the condition and only those lines?

                Comment

                • john12345
                  New Member
                  • Feb 2008
                  • 9

                  #9
                  Originally posted by KevinADC
                  Do you mean grep all the lines in the file that satisfy the condition and only those lines?
                  Kevin
                  I really appreciate your attention. Thanks.
                  yes thats right has to grep lines that satisfy the conditions for example,
                  $ba = `grep "$sc,BA,LAN " $file1`;
                  so first grep all the lines that has ",BA,LAN" which gives all the unique id's (ie columns1 from file1) so based on this id's satisfy rest of the information.
                  $is = `grep "$sc,IS,LA" $file1`;
                  $rt = `grep "$sc,RT,.*T O," $file1`;
                  output will look like the same as your script gives.

                  Thank you in advance Kevin
                  John

                  Comment

                  • KevinADC
                    Recognized Expert Specialist
                    • Jan 2007
                    • 4092

                    #10
                    Maybe if there were just a couple of rules you could do it all in one script, but if you want to get all the possible combinations of the various fields for each id it would just get crazy. Thats the time to start looking at using a real database to store and process your data.

                    Comment

                    • john12345
                      New Member
                      • Feb 2008
                      • 9

                      #11
                      Originally posted by KevinADC
                      Maybe if there were just a couple of rules you could do it all in one script, but if you want to get all the possible combinations of the various fields for each id it would just get crazy. Thats the time to start looking at using a real database to store and process your data.
                      Thanks for the quick response Kevin, Yaa just because of this rules my script is too slow but your script is amazingly fast. Could you please help me with just one rule say for $ba = `grep "$sc,BA,LAN " $file1`;
                      Then i will try to work on it.

                      Thanks in advance
                      John

                      Comment

                      • KevinADC
                        Recognized Expert Specialist
                        • Jan 2007
                        • 4092

                        #12
                        let me see if I understand, $sc is a unique set of numbers, an ID of some kind. $sc can have up to five different fields: AB, BA, IS, RT, ET. That seems clear enough.

                        The question is can $sc only ever have one set of five fields in the file? Some of the fields might have no value and some might have a value but $sc can never have more than one set of five fields associated with it? For example, you can't have this scenario:

                        112311,AB,31234 2
                        112311,AB,55432 5

                        $sc 112311 will only ever have one AB field or no AB field but not more than zero or one. Same for all the other ID and fields.

                        Comment

                        • john12345
                          New Member
                          • Feb 2008
                          • 9

                          #13
                          Originally posted by KevinADC
                          let me see if I understand, $sc is a unique set of numbers, an ID of some kind. $sc can have up to five different fields: AB, BA, IS, RT, ET. That seems clear enough.

                          The question is can $sc only ever have one set of five fields in the file? Some of the fields might have no value and some might have a value but $sc can never have more than one set of five fields associated with it? For example, you can't have this scenario:

                          112311,AB,31234 2
                          112311,AB,55432 5

                          $sc 112311 will only ever have one AB field or no AB field but not more than zero or one. Same for all the other ID and fields.
                          Kevin thanx for response, Regarding your example your right. $sc can never have more than one set of five fields associated with it. So $sc can have some thing like as below,
                          [HTML]112311,AB,34353 4
                          112311,BA,LAN43 5
                          112311,IS,CA345 33
                          232323,AB,54334 5
                          232323,BA,MAN65 4
                          232323,ET,RTER[/HTML]So what I am looking is grep all "ID(col1),BA,LA N*" first (I need only those IDs that begins with "LAN" in column 3 not "MAN", then use this ID(col1) to get result) rest is same. your script is very close to want i am looking for, specialy after open(FILE), the only thing i am really not good with hashes. so Is there any way I can get the ID based on the rule. I appreciate your help Kevin.

                          Thank you
                          John

                          Comment

                          • KevinADC
                            Recognized Expert Specialist
                            • Jan 2007
                            • 4092

                            #14
                            I will see what I can do later today.

                            Comment

                            • john12345
                              New Member
                              • Feb 2008
                              • 9

                              #15
                              Originally posted by KevinADC
                              I will see what I can do later today.
                              Kevin,
                              Thankx again for your response, I appreciate your help. Thanks.
                              John

                              Comment

                              Working...