Speed problem

**KevinADC** · Feb 15 '08, 11:44 PM

Originally posted by john12345

Hi, I am newbie to perl, below is the script i have worked on which gives me exactly what i want. The only problem i am facing is speed. This script is mapping the AB,BA,IS,RT,ET from file1.csv and putting them into columns. file2.txt contains first column of file1.csv which i have cut using unix command (this i would like to avoid and map it from the script itself). my question- Is there any way I can make this script faster as file1.csv has 1.6 millions of lines and grep command goes through them 6 times which is making the script real slow (takes 6 hours to map). but does gives the output what i want.
Please help me

Thanks in advance.
John

It looks like file1 and file2 have the same order:

112311
321211
432342

Is file1 already in the order you want the output to be in or does the first column need to be sorted somehow? I assume file2 is not important, you just use it temporarily to get the first column value unless file1 is not in the correct order to begin with.

Do you have this question posted on other forums and do you already have a solution? Because if you do I don't want to waste my time figuring it out.

**john12345** · Feb 16 '08, 12:03 AM

Originally posted by KevinADC

It looks like file1 and file2 have the same order:

112311
321211
432342

Is file1 already in the order you want the output to be in or does the first column need to be sorted somehow? I assume file2 is not important, you just use it temporarily to get the first column value unless file1 is not in the correct order to begin with.

Do you have this question posted on other forums and do you already have a solution? Because if you do I don't want to waste my time figuring it out.

Kevin Thanks for fast response, No i have not posted this question anywhere else, coz I saw in my search that this the only forum which is active and fast.
About file2.txt, it contains unique columns from file1.csv and I am getting file2 using unix command manually ( I want avoid manual step and do it in perl). The output file as above has new columns which comes out of file1 with column as match. so in the output i dont want any duplicates plus I want to create new columns from rows in column 2.

Yes this script does gives solution, but as i mention it takes 6 to 7 hours to generate results, as it goes through 1.6 millions lines. thats why i need help, coz i am stuck. Any help will be appreciated.

Thank you in advancs.

**eWish** · Feb 16 '08, 02:31 AM

What are the relationships between the three columns? If you are only storing the unique data from column1, what are you doing with the data from columns 2 and 3?

--Kevin

**john12345** · Feb 16 '08, 02:50 AM

Originally posted by eWish

What are the relationships between the three columns? If you are only storing the unique data from column1, what are you doing with the data from columns 2 and 3?

--Kevin

Hi, thanx for response.
In file1
col1 is "ID" (which could be duplicate), col2 has rows which includes "AB,BA,IS,RT,ET " for these rows i want to create new columns in output as shown in output.txt, and col3 has values of col2. So basically i want output with unique values for all columns and place tab for the null as shown in output.txt,
Please see the output file "output.txt ",
output.txt,
sc, ab_no, bar_a, is_in, erto,ex_tk
112311,312342,L AN321,LA3423, ,
321211,342324, , , , ,
432342, ,LAN322,LA3453, 324343,ERTF

The output file has output with column 1 is "sc" (unique), column 2(which is one row 'AB' from file1) is "ab_no", column 3 (which is another row 'BA' in file1) is "bar_a", column 4 (which is 'IS' in file1 ) is "is_in" and so on..., So new columns are created from rows in file1 based on first column.

Thank you in advance.

**KevinADC** · Feb 16 '08, 04:43 AM

You didn't answer my question concerning the order of column1 in file1. Assuming it is in order, this should work and should be faster than 6 hours. Change the path to the files to match your system.

[CODE=perl]use strict;
use warnings;
my %fields = (
AB => "\t",
BA => "\t",
IS => "\t",
RT => "\t",
ET => "\t"
);

my $file1 = "c:/perl_test/file1.csv";
open(FILE1,"<$f ile1") or die "$!";
open(OUT, '>>', 'path/to/output.csv') or die "$!";
#get the first line of the file so we can start
my ($sc_old,$v1,$v 2) = split(/,/, <FILE1>);
chomp $v2;
$fields{$v1}=$v 2;

while (<FILE1>) {
chomp;
my ($sc_new,$v1,$v 2) = split(/,/);
if ($sc_new eq $sc_old) {
$fields{$v1}=$v 2;
}
elsif (($sc_new ne $sc_old)) {
print OUT "$sc_old,$field s{AB},$fields{B A},$fields{IS}, $fields{RT},$fi elds{ET}\n";
%fields = (AB => "\t",BA => "\t",IS => "\t",RT => "\t",ET => "\t");
$sc_old = $sc_new;
$fields{$v1}=$v 2;
}
}
close FILE1;
# prints last set of records
print OUT "$sc_old,$field s{AB},$fields{B A},$fields{IS}, $fields{RT},$fi elds{ET}\n";[/CODE]

It maybe could be done even faster by someone that is more experienced with perl.

**john12345** · Feb 17 '08, 01:35 AM

Originally posted by KevinADC

You didn't answer my question concerning the order of column1 in file1. Assuming it is in order, this should work and should be faster than 6 hours. Change the path to the files to match your system.

[CODE=perl]use strict;
use warnings;
my %fields = (
AB => "\t",
BA => "\t",
IS => "\t",
RT => "\t",
ET => "\t"
);

my $file1 = "c:/perl_test/file1.csv";
open(FILE1,"<$f ile1") or die "$!";
open(OUT, '>>', 'path/to/output.csv') or die "$!";
#get the first line of the file so we can start
my ($sc_old,$v1,$v 2) = split(/,/, <FILE1>);
chomp $v2;
$fields{$v1}=$v 2;

while (<FILE1>) {
chomp;
my ($sc_new,$v1,$v 2) = split(/,/);
if ($sc_new eq $sc_old) {
$fields{$v1}=$v 2;
}
elsif (($sc_new ne $sc_old)) {
print OUT "$sc_old,$field s{AB},$fields{B A},$fields{IS}, $fields{RT},$fi elds{ET}\n";
%fields = (AB => "\t",BA => "\t",IS => "\t",RT => "\t",ET => "\t");
$sc_old = $sc_new;
$fields{$v1}=$v 2;
}
}
close FILE1;
# prints last set of records
print OUT "$sc_old,$field s{AB},$fields{B A},$fields{IS}, $fields{RT},$fi elds{ET}\n";[/CODE]

It maybe could be done even faster by someone that is more experienced with perl.

Kevin
Thankx for the hard work, really appreciate your time and help, works like beauty. Thnx again Kevin. One more thing I have to apply some rules to get the data from file1.csv as shown in the script i posted above for exampl,
$ba = `grep "$sc,BA,LAN " $file1`; # Here i am grabbing those rows which has BA,LAN only. Is there any way i can do this in this same script.

Thankx in advance

John

**KevinADC** · Feb 17 '08, 03:28 AM

Originally posted by john12345

Kevin
Thankx for the hard work, really appreciate your time and help, works like beauty. Thnx again Kevin. One more thing I have to apply some rules to get the data from file1.csv as shown in the script i posted above for exampl,
$ba = `grep "$sc,BA,LAN " $file1`; # Here i am grabbing those rows which has BA,LAN only. Is there any way i can do this in this same script.

Thankx in advance

John

Do you mean grep all the lines in the file that satisfy the condition and only those lines?

**john12345** · Feb 17 '08, 06:10 AM

Originally posted by KevinADC

Do you mean grep all the lines in the file that satisfy the condition and only those lines?

Kevin
I really appreciate your attention. Thanks.
yes thats right has to grep lines that satisfy the conditions for example,
$ba = `grep "$sc,BA,LAN " $file1`;
so first grep all the lines that has ",BA,LAN" which gives all the unique id's (ie columns1 from file1) so based on this id's satisfy rest of the information.
$is = `grep "$sc,IS,LA" $file1`;
$rt = `grep "$sc,RT,.*T O," $file1`;
output will look like the same as your script gives.

Thank you in advance Kevin
John

**KevinADC** · Feb 17 '08, 06:31 AM

Maybe if there were just a couple of rules you could do it all in one script, but if you want to get all the possible combinations of the various fields for each id it would just get crazy. Thats the time to start looking at using a real database to store and process your data.

**john12345** · Feb 17 '08, 06:43 AM

Originally posted by KevinADC

Maybe if there were just a couple of rules you could do it all in one script, but if you want to get all the possible combinations of the various fields for each id it would just get crazy. Thats the time to start looking at using a real database to store and process your data.

Thanks for the quick response Kevin, Yaa just because of this rules my script is too slow but your script is amazingly fast. Could you please help me with just one rule say for $ba = `grep "$sc,BA,LAN " $file1`;
Then i will try to work on it.

Thanks in advance
John

**KevinADC** · Feb 17 '08, 10:23 PM

let me see if I understand, $sc is a unique set of numbers, an ID of some kind. $sc can have up to five different fields: AB, BA, IS, RT, ET. That seems clear enough.

The question is can $sc only ever have one set of five fields in the file? Some of the fields might have no value and some might have a value but $sc can never have more than one set of five fields associated with it? For example, you can't have this scenario:

112311,AB,31234 2
112311,AB,55432 5

$sc 112311 will only ever have one AB field or no AB field but not more than zero or one. Same for all the other ID and fields.

**john12345** · Feb 17 '08, 11:41 PM

Originally posted by KevinADC

let me see if I understand, $sc is a unique set of numbers, an ID of some kind. $sc can have up to five different fields: AB, BA, IS, RT, ET. That seems clear enough.

The question is can $sc only ever have one set of five fields in the file? Some of the fields might have no value and some might have a value but $sc can never have more than one set of five fields associated with it? For example, you can't have this scenario:

112311,AB,31234 2
112311,AB,55432 5

$sc 112311 will only ever have one AB field or no AB field but not more than zero or one. Same for all the other ID and fields.

Kevin thanx for response, Regarding your example your right. $sc can never have more than one set of five fields associated with it. So $sc can have some thing like as below,
[HTML]112311,AB,34353 4
112311,BA,LAN43 5
112311,IS,CA345 33
232323,AB,54334 5
232323,BA,MAN65 4
232323,ET,RTER[/HTML]So what I am looking is grep all "ID(col1),BA,LA N*" first (I need only those IDs that begins with "LAN" in column 3 not "MAN", then use this ID(col1) to get result) rest is same. your script is very close to want i am looking for, specialy after open(FILE), the only thing i am really not good with hashes. so Is there any way I can get the ID based on the rule. I appreciate your help Kevin.

Thank you
John

**KevinADC** · Feb 18 '08, 04:56 PM

I will see what I can do later today.

**john12345** · Feb 19 '08, 01:13 AM

Originally posted by KevinADC

I will see what I can do later today.

Kevin,
Thankx again for your response, I appreciate your help. Thanks.
John

Speed problem

Speed problem

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment