regular expression inquiry

Collapse
This topic is closed.
X
X
 
  • Time
  • Show
Clear All
new posts
  • vito

    regular expression inquiry

    I'm processing the following sequence with length more than 100k

    1 cagatgctga taaaaaagtg tgttcctcat agcatttatt taattgaaat atttcaagaa
    61 cttgaatgta ctaaaaattg agacaaacag tagcaaatca taaaaaaaaa ttgaagtgaa
    121 ttttacaact ggattcatgt gcctaatatt ttcattggga agtggattca tgtttaacat
    181 ttccattggg <snippet>

    i wrote a program

    <?php
    session_start() ;

    if (isset ($_POST['seq']) )
    $seq = $_POST['seq'];
    else
    $seq= $_GET['seq'];

    $seq = preg_replace("/[\s\r\n0-9]/", "", $seq);
    echo $seq;

    ?>

    but it generates an output of fragmented sequences (i.e. partially processed
    result), what is the problem?


  • Benjamin Esham

    #2
    Re: regular expression inquiry

    vito wrote:
    i wrote a program
    >
    $seq = preg_replace("/[\s\r\n0-9]/", "", $seq);
    >
    but it generates an output of fragmented sequences (i.e. partially
    processed result), what is the problem?
    Your regular expression may be giving unexpected results: since the RE is
    double-quoted, the \r and \n are converted to their respective special
    characters *before* being sent to preg_replace(), while the \s is left to be
    processed by preg_replace(). You might try

    $seq = preg_replace('/[[:space:]0-9]/', '', $seq);

    which sidesteps the issue entirely by using [:space:]. Personally, I would
    just use

    $seq = preg_replace('/[^acgt]/', '', $seq);

    which removes everything except for the characters [acgt]. This will work
    no matter what other stuff happens to be present in the input file.

    HTH,
    --
    Benjamin D. Esham
    bdesham@gmail.c om | AIM: bdesham128 | Jabber: same as e-mail
    Más sabe el diablo por viejo que por diablo. (Spanish proverb)

    Comment

    • yawnmoth

      #3
      Re: regular expression inquiry

      vito wrote:
      I'm processing the following sequence with length more than 100k
      >
      1 cagatgctga taaaaaagtg tgttcctcat agcatttatt taattgaaat atttcaagaa
      61 cttgaatgta ctaaaaattg agacaaacag tagcaaatca taaaaaaaaa ttgaagtgaa
      121 ttttacaact ggattcatgt gcctaatatt ttcattggga agtggattca tgtttaacat
      181 ttccattggg <snippet>
      >
      i wrote a program
      >
      <?php
      session_start() ;
      >
      if (isset ($_POST['seq']) )
      $seq = $_POST['seq'];
      else
      $seq= $_GET['seq'];
      >
      $seq = preg_replace("/[\s\r\n0-9]/", "", $seq);
      echo $seq;
      >
      ?>
      >
      but it generates an output of fragmented sequences (i.e. partially processed
      result), what is the problem?
      To quote from php.net's "Pattern Syntax" article:

      "By default, a whitespace character (eg. \s) is any character that the
      C library function isspace() recognizes, though it is possible to
      compile PCRE with alternative character type tables. Normally isspace()
      matches space, formfeed, newline, carriage return, horizontal tab, and
      vertical tab."

      Per that, including \r and \n in the class definition is redundant.

      Anyway, what you should get with the code you wrote is an uninterupted
      sequence of a, g, t, and c's. What would you get had you provided the
      above sequence? And what's the output that you want?

      Comment

      • Colin McKinnon

        #4
        Re: regular expression inquiry

        vito wrote:
        I'm processing the following sequence with length more than 100k
        >
        1 cagatgctga taaaaaagtg tgttcctcat agcatttatt taattgaaat atttcaagaa
        61 cttgaatgta ctaaaaattg agacaaacag tagcaaatca taaaaaaaaa
        ttgaagtgaa
        121 ttttacaact ggattcatgt gcctaatatt ttcattggga agtggattca
        tgtttaacat 181 ttccattggg <snippet>
        >
        Nice genes.
        i wrote a program
        >
        <?php
        session_start() ;
        >
        if (isset ($_POST['seq']) )
        $seq = $_POST['seq'];
        else
        $seq= $_GET['seq'];
        >
        $seq = preg_replace("/[\s\r\n0-9]/", "", $seq);
        echo $seq;
        >
        ?>
        You provided an example of the input but not the output, so I'm not quite
        sure. But...

        Since you're using PCRE, why not:
        /[\s\r\n\d]/

        or even

        /[^ctag]/

        It improbable, but you do know that \s doesn't match ascii chr(11) but
        [:space:] does.

        HTH
        C.

        Comment

        Working...