2 questions for perl text manipulation

Collapse
X
 
  • Time
  • Show
Clear All
new posts
  • sessmurda
    New Member
    • Jul 2008
    • 14

    2 questions for perl text manipulation

    I've just started programming in perl and have written a few successful scripts but had a quick question on how to do 2 things.

    First here is a script that I wrote recently that works for what it is supposed to do, but is not quite what I want.

    Code:
    #!/usr/bin/perl
    
    $file_q = "x.txt";
    
    open(FILE, $file_q)||die "nope\n";
    while(<FILE>){
    
    @line = split(/\s+/, $_);
    
    if($line[0]=~/cere/){
    
    push(@wanted_lines,$line[2]);
    }}
    
    close (FILE);
    
    print "@wanted_lines\n";
    Basically what I need to do is to extract the nth character of each line beginning with 'cere' and push the output of that into an array. I will repeat that for some other strings as well. Then from there I need to be able to only print n characters per line so that I can say print 100 cere characters, then 100 a characters, then 100 b characters in a format similar to this:


    cere-xxxxxxxxxxxxxxx xxxxxxxxxxxxxxx xxxxxxxxxxxx
    aaaa-yyyyyyyyyyyyyyy yyyyyyyyyyyyyyy yyyyyyyyyyyy
    bbbb-zzzzzzzzzzzzzzz zzzzzzzzzzzzzzz zzzzzzzzzzzz

    any help is greatly appreciated!
    Last edited by eWish; Jul 1 '08, 01:52 AM. Reason: Please use code tags
  • KevinADC
    Recognized Expert Specialist
    • Jan 2007
    • 4092

    #2
    Hard to say wihtout seeing your data, but here is something you can maybe chew on:

    Code:
    use strict;
    use warnings;
    
    my $file_q = "x.txt";
    my @wanted = ();
    open(FILE, $file_q) or die "nope: $!\n";
    while(<FILE>){
       if(/^cere/){ # line begins with cere
          push @wanted_lines,substr($_,5,100);
       }
    }
    close (FILE);
    print "@wanted_lines\n";
    Look up substr() and how to use it.

    Comment

    • sessmurda
      New Member
      • Jul 2008
      • 14

      #3
      Basically the format of my data is like this, but contains closer to like 10,000 lines.

      cere 662376 G
      para 662376 C
      baya 662376 x
      cere 662375 C
      para 662375 G
      baya 662375 x
      cere 662374 G
      para 662374 C
      baya 662374 x
      cere 662373 C
      para 662373 A
      baya 662373 x
      cere 662372 A
      para 662372 A
      baya 662372 x
      cere 662371 T
      para 662371 C
      baya 662371 x
      cere 662370 G
      para 662370 G
      baya 662370 x
      cere 662369 C
      para 662369 A
      baya 662369 C
      cere 662368 A
      para 662368 A
      baya 662368 A
      cere 662367 T
      para 662367 C
      baya 662367 T
      cere 662366 C
      para 662366 C
      baya 662366 C
      cere 662365 G
      para 662365 C
      baya 662365 G
      cere 662364 A
      para 662364 G
      baya 662364 A
      cere 662363 C
      para 662363 C
      baya 662363 C
      cere 662362 G
      para 662362 G
      baya 662362 G
      cere 662361 T
      para 662361 T
      baya 662361 T
      cere 662360 C
      para 662360 A
      baya 662360 C
      cere 662359 A
      para 662359 T
      baya 662359 A
      cere 662358 C
      para 662358 G
      baya 662358 C

      I've been using the substring function, but the main thing is I want to align all the cere against all the para, against all the baya in a format similar to my first post while only printing a certain # of characters per line because 1) its so long, and 2) I have to do this to many different outputs. The problem with just the substring function I've been having is that itll list all of the cere points, then all of another, whereas I'd want it to be aligned so that I can compare.

      Comment

      • KevinADC
        Recognized Expert Specialist
        • Jan 2007
        • 4092

        #4
        Just going by the sample data, I wrote this:

        Code:
        use strict;
        use warnings;
        my %data = ();
        my @genes = (); 
        while (my $line=<DATA>) {
           $line =~ tr/ //d; # remove the spaces
           my ($var1, $var2, $var3) = unpack("A4A6A1",$line); # unpack is very efficient
           push @genes, $var1; #to maintain order. Can be omitted if order is not important 
           $data{$var1} .= $var3; # creates a hash 
        }
        
        foreach my $g (@genes) {
           print "$g ", substr($data{$g},0,10), "\n";
        }
        
        __DATA__
        cere 662376 G
        para 662376 C
        baya 662376 x
        cere 662375 C
        para 662375 G
        baya 662375 x
        cere 662374 G
        para 662374 C
        baya 662374 x
        cere 662373 C
        para 662373 A
        baya 662373 x
        cere 662372 A
        para 662372 A
        baya 662372 x
        cere 662371 T
        para 662371 C
        baya 662371 x
        cere 662370 G
        para 662370 G
        baya 662370 x
        cere 662369 C
        para 662369 A
        baya 662369 C
        cere 662368 A
        para 662368 A
        baya 662368 A
        cere 662367 T
        para 662367 C
        baya 662367 T
        cere 662366 C
        para 662366 C
        baya 662366 C
        cere 662365 G
        para 662365 C
        baya 662365 G
        cere 662364 A
        para 662364 G
        baya 662364 A
        cere 662363 C
        para 662363 C
        baya 662363 C
        cere 662362 G
        para 662362 G
        baya 662362 G
        cere 662361 T
        para 662361 T
        baya 662361 T
        cere 662360 C
        para 662360 A
        baya 662360 C
        cere 662359 A
        para 662359 T
        baya 662359 A
        cere 662358 C
        para 662358 G
        baya 662358 C

        Comment

        • sessmurda
          New Member
          • Jul 2008
          • 14

          #5
          Thanks! I've done a bit more manipulation to get it to do exactly what I want, your help is greatly appreciated!

          Comment

          Working...