hi everyone,
I am kind of stuck and therefore would really appreciate some clues:
I actually have to run a script which has to compare two elements from two different files which are a blast file and a cdf file
I need also to keep the data structure
For this I chose the following strategy:
-dumping the files into two arrays
-doing a pattern matching between the two files.
-if it doesn't matches then remove the line.
-if the line has a different structure then keep the line
Here is the part of my script which take the most time
[CODE=perl]
foreach my $line(@CDF)
{
my $wanted;
if ($line =~ /^.*?\t.*?\t.*?\ t.*?\t.*?\t.*?\ t.*?\t.*?\t.*?\ t.*?\t.*?\t(.*? )\t/)
{
print "repeat again\n";
$wanted = ($1);
print $wanted."\n" ;
foreach my $lineB(@Blast)
{
if ($lineB =~ /^($wanted)\s/)
{
print $wanted."\n";
print OUTPUTFILEHANDL E "$line";
}
}
}
[/CODE]
It takes hours to run it and obtain my output file.
Here are my questions:
Trying to only use subsets from the file instead of the complete 90Mb files
I have tried to use coordinate using array like this :
[CODE=perl]
my @array;
print $array[0];
[/CODE]
and then it ends up here printing the first line of the file...whereas I want 12th element of the line to do the comparison.
and also tried to understand hashes
So far I have read that it might be faster to use arrays than hashes therefore
Is there anyone who could give me some clue about how to define my file as a grid where I could use the coordinate x,y to get my subsets and then do my comparison?
I also though about using hashes to link key to values which would constitute the subsets I need but this way too I am stuck
I know that I could use the object oriented way but after having a look at it I think it is even more difficult so I would prefer to use one of the two previous methods
Any help is very welcome as I've been stuck for a while on this...
I am kind of stuck and therefore would really appreciate some clues:
I actually have to run a script which has to compare two elements from two different files which are a blast file and a cdf file
I need also to keep the data structure
For this I chose the following strategy:
-dumping the files into two arrays
-doing a pattern matching between the two files.
-if it doesn't matches then remove the line.
-if the line has a different structure then keep the line
Here is the part of my script which take the most time
[CODE=perl]
foreach my $line(@CDF)
{
my $wanted;
if ($line =~ /^.*?\t.*?\t.*?\ t.*?\t.*?\t.*?\ t.*?\t.*?\t.*?\ t.*?\t.*?\t(.*? )\t/)
{
print "repeat again\n";
$wanted = ($1);
print $wanted."\n" ;
foreach my $lineB(@Blast)
{
if ($lineB =~ /^($wanted)\s/)
{
print $wanted."\n";
print OUTPUTFILEHANDL E "$line";
}
}
}
[/CODE]
It takes hours to run it and obtain my output file.
Here are my questions:
Trying to only use subsets from the file instead of the complete 90Mb files
I have tried to use coordinate using array like this :
[CODE=perl]
my @array;
print $array[0];
[/CODE]
and then it ends up here printing the first line of the file...whereas I want 12th element of the line to do the comparison.
and also tried to understand hashes
So far I have read that it might be faster to use arrays than hashes therefore
Is there anyone who could give me some clue about how to define my file as a grid where I could use the coordinate x,y to get my subsets and then do my comparison?
I also though about using hashes to link key to values which would constitute the subsets I need but this way too I am stuck
I know that I could use the object oriented way but after having a look at it I think it is even more difficult so I would prefer to use one of the two previous methods
Any help is very welcome as I've been stuck for a while on this...