Sorting Massive amounts of Data

Collapse
This topic is closed.
X
X
 
  • Time
  • Show
Clear All
new posts
  • Advo

    Sorting Massive amounts of Data

    Hi there.

    I've got some data currently stored in an Excel table, or a notepad
    file depending on which would be easier to work with.. Im assuming
    notepad(.txt) so that data can be read into php.

    Anyway, this data is majorly corrupted and contains stuff i dont need.
    For instance, a sample line is:

    MNs73dd78INFORM ATION I NEED 32427 12759 39384425 495242 15.206412
    0.191214 44.164503 -93.993798

    As you can see, its pretty messed up. I only need the "INFORMATIO N I
    NEED" part, not the 9 characters before, nor any numbers or decimal
    places afterwards.

    I cant do this manually, as there are tens of thounsands of records.

    Any ideas how I could go about this please.

    Im assuming I would some how remove the first 9 characters, and then
    some how remove all numbers and decimal places, but i've got no idea
    how to go about it.

    Kind Regards

  • Moot

    #2
    Re: Sorting Massive amounts of Data

    Advo wrote:
    Hi there.
    >
    I've got some data currently stored in an Excel table, or a notepad
    file depending on which would be easier to work with.. Im assuming
    notepad(.txt) so that data can be read into php.
    >
    Anyway, this data is majorly corrupted and contains stuff i dont need.
    For instance, a sample line is:
    >
    MNs73dd78INFORM ATION I NEED 32427 12759 39384425 495242 15.206412
    0.191214 44.164503 -93.993798
    >
    As you can see, its pretty messed up. I only need the "INFORMATIO N I
    NEED" part, not the 9 characters before, nor any numbers or decimal
    places afterwards.
    >
    I cant do this manually, as there are tens of thounsands of records.
    >
    Any ideas how I could go about this please.
    >
    Im assuming I would some how remove the first 9 characters, and then
    some how remove all numbers and decimal places, but i've got no idea
    how to go about it.
    >
    Kind Regards
    Open the file
    Loop through each line with fgets [1]
    Use substr [2] to yank out from character 9 to whatever you need
    Dump data into an array or an output file as you see fit

    Now, if the data in each line you need is variable length (you don't
    say), then it becomes trickier. You may have to tokenize the line with
    something like explode [3], then do some logic to get only the tokens
    you want. Or you may be able to figure out a regular expression to
    trim off the data to the right (I'll leave that to someone much wiser
    than I). There has to be some kind of standard format to the file
    which delimits each field (either that, or whoever set up this file
    royally screwed up).

    [1] - http://us3.php.net/manual/en/function.fgets.php
    [2] - http://us3.php.net/manual/en/function.substr.php
    [3] - http://us3.php.net/manual/en/function.explode.php

    Comment

    • Sanders Kaufman

      #3
      Re: Sorting Massive amounts of Data

      Moot wrote:
      Advo wrote:
      >For instance, a sample line is:
      >>
      >MNs73dd78INFOR MATION I NEED 32427 12759 39384425 495242 15.206412
      > 0.191214 44.164503 -93.993798
      >>
      >Im assuming I would some how remove the first 9 characters, and then
      >some how remove all numbers and decimal places, but i've got no idea
      >how to go about it.
      >>
      >Kind Regards
      >
      Open the file
      Loop through each line with fgets [1]
      Use substr [2] to yank out from character 9 to whatever you need
      Dump data into an array or an output file as you see fit
      That's one way - but using RegEx would probably be
      better... if you can figure out a *pattern* in the
      corruption (e.g. first 9 characters, all numerals
      after the last alpha character).

      If you can't figure out a forumla/pattern - you'll
      just have to do it manually. - no way around it.


      Comment

      • lorento

        #4
        Re: Sorting Massive amounts of Data

        Advo wrote:
        >
        MNs73dd78INFORM ATION I NEED 32427 12759 39384425 495242 15.206412
        0.191214 44.164503 -93.993798
        >
        As you can see, its pretty messed up. I only need the "INFORMATIO N I
        NEED" part, not the 9 characters before, nor any numbers or decimal
        places afterwards.
        >
        You can use regex maybe like this (not tested yet):

        <?php

        $fr = fopen ("data.txt", "r");
        $fw = fopen ("data_clean.tx t", "a");

        while (!feof($fr))
        {
        $ln = fgets($fr, 1024);
        $ln = preg_replace ("/^(\w{9})(.*)(\d +)/", "$2", $ln);
        fwrite($fw, $ln);
        }
        fclose($fr);
        fclose($fw);

        ?>

        --



        Comment

        • malatestapunk

          #5
          Re: Sorting Massive amounts of Data

          Or you could do something like:

          $matches = array();
          $fileText = file_get_conten ts('data.txt'); // PHP4.3+
          if (preg_match_all ('/^.{9}(SUBPATTER N_THAT_MATCHES_ YOUR_DATA).*$/m',
          $fileText, $matches)) {
          // $matches now contains your data, in this format:
          // $matches[0][0] - the whole first matched line
          // $matches[0][1] - your data from the first matched line
          // $matches[1][0] - the whole second matched line
          // $matches[1][1] - your data from the second matched line
          // ... and so on.
          } else {
          // Your search failed - nothing matched.
          }

          Note that you'll have to replace SUBPATTERN_THAT _MATCHES_YOUR_D ATA with
          a valid regular expression describing the data you'd like to extract.

          lorento wrote:
          Advo wrote:
          >

          MNs73dd78INFORM ATION I NEED 32427 12759 39384425 495242 15.206412
          0.191214 44.164503 -93.993798

          As you can see, its pretty messed up. I only need the "INFORMATIO N I
          NEED" part, not the 9 characters before, nor any numbers or decimal
          places afterwards.
          >
          You can use regex maybe like this (not tested yet):
          >
          <?php
          >
          $fr = fopen ("data.txt", "r");
          $fw = fopen ("data_clean.tx t", "a");
          >
          while (!feof($fr))
          {
          $ln = fgets($fr, 1024);
          $ln = preg_replace ("/^(\w{9})(.*)(\d +)/", "$2", $ln);
          fwrite($fw, $ln);
          }
          fclose($fr);
          fclose($fw);
          >
          ?>
          >
          --

          http://www.theaussiemap.com

          Comment

          Working...