perl regex

Collapse
X
 
  • Time
  • Show
Clear All
new posts
  • lilly07
    New Member
    • Jul 2008
    • 89

    perl regex

    I have a data file and 4th column looks like below: Some examples

    34899939-34899967
    34899939-34899967:349055 54-34905559
    34899939-34899967:349055 54-34905559:349055 60-34905574


    I have to extract like below:
    For the first line:
    $start = 34899939
    $end = 34899967
    $block_size = 1

    For the 2nd line:
    $start = 34899939
    $end = 34905559
    $block_size = 2
    $n1=34899939
    $n2=34899967
    $n3=34905554
    $n4=34905559

    For the 3rd line:
    $start = 34899939
    $end = 34905574
    $block_size = 3
    $n1=34899939
    $n2=34899967
    $n3=34905554
    $n4=34905559
    $n5=34905560
    $n6=34905574

    I am able to differentiate 1 block and 2 block depending upon : character and able to find the solution for the 3rd line as below:
    Code:
    sub special {
            
            chomp $_;
            my @v = split(/\s+/,$_);        
            if($v[3] =~ /\:/) {
            $num1 = $`;
            $num2 = $';
                    if($num1 =~ /\-/) {
                            $n1 = $`;
                            $n2 = $';
                    }
                    if($num2 =~ /\-/) {
                            $n3 = $`;
                            $n4 = $';
                    }
            }
            $start = $n1;
            $end = $n4;
            print "$n1 \t $n2 \t $n3 \t $n4 \n";
           
    }
    But how do I generalise the numbers with : to generate the $n(i)? Thanks.
  • KevinADC
    Recognized Expert Specialist
    • Jan 2007
    • 4092

    #2
    Your data is confusing. What do you mean "the 4th column looks like this"? You have posted three seperate lines of data. Are they part of a larger line of data? Why in your sub special() are you first splitting on spaces when there is no spaces in the data you posted?

    Comment

    • lilly07
      New Member
      • Jul 2008
      • 89

      #3
      Hi Kevin, sorry for the confusion. Let me try to explain.

      My 4th column can contain data as given in my previous post. It can contain without : or with one or two or three set of numbers separated by :

      Hence they are different kinds of data available in 4th column of each data.

      Every line I parse it to get the 4th column hence I split using space to get my 4th column data. And still parse with a special character : and then further split the example.

      If my 4th column is like 1st example, then it is easy for me to split the numbers and put them into $n1 and $n2.

      If my 4th column is like 2nd line example with one :, then my subroutine special does the job and assign $n1,$n2,$n3 and $n4.

      But I want to write a generalised routine which can handle like line 3 or even with more number of:

      I hope I have explained clearly.
      Regards

      Comment

      • KevinADC
        Recognized Expert Specialist
        • Jan 2007
        • 4092

        #4
        That helped clear it up. I hope I am not doing your school work for you.

        Code:
        while(<DATA>) {
           special($_);
        }
        
        sub special {
           local ($_) = @_; 
           chomp $_;
           my $col4 = (split(/\s+/))[3];
           my @blocks = split(/:/,$col4);
           my @temp;
           for (@blocks) {
              push @temp, split(/-/);
           }
           print "start = $temp[0]\n";
           print "end = $temp[-1]\n";
           print 'blocks = ', scalar @blocks,"\n";
           for (@temp) {
              print "\t$_\n";
           }
           print "\n";
        }
        __DATA__
        dummy dummy dummy 34899939-34899967 dummy
        dummy dummy dummy 34899939-34899967:34905554-34905559 dummy 
        dummy dummy dummy 34899939-34899967:34905554-34905559:34905560-34905574 dummy
        Apply your own file I/O inplace of DATA

        Comment

        • KevinADC
          Recognized Expert Specialist
          • Jan 2007
          • 4092

          #5
          output is:

          Code:
          start = 34899939
          end = 34899967
          blocks = 1
          	34899939
          	34899967
          
          start = 34899939
          end = 34905559
          blocks = 2
          	34899939
          	34899967
          	34905554
          	34905559
          
          start = 34899939
          end = 34905574
          blocks = 3
          	34899939
          	34899967
          	34905554
          	34905559
          	34905560
          	34905574
          edit the output for your needs to display it how you need to. I added "start", "end" and "blocks" just to make it easier to read.

          Comment

          Working...