processing raw logs faster

Collapse
This topic is closed.
X
X
 
  • Time
  • Show
Clear All
new posts
  • changereality@gmail.com

    processing raw logs faster

    I am trying to process raw IIS log files and insert them into a MySQL
    database. I have no problem accomplishing this, but the php code runs
    very slow.

    Right now, it is processing 10,000 lines in a log file every ~300
    seconds. I am dealing with daily log files with records over 500,000.
    It takes hours to process a daily file.

    <?php
    error_reporting (E_ALL);

    $server = "localhost:3306 "; // db host
    $user = ""; // db username
    $pswd = ""; // db password
    $db = ""; // db name

    $DB_link = mysql_connect($ server,$user,$p swd) OR DIE (mysql_error()) ;

    mysql_select_db ($db, $DB_link) OR DIE (mysql_error()) ;

    $start_time = time();

    $handle = fopen("/logs/ex050830.log", "r");
    $linecnt = 0;
    $totalcnt = 0;
    while (!feof($handle) ) {

    $list = array();
    $buffer = fgets($handle, 20000);

    if (! preg_match("/^\s*?#/", $buffer) ){
    $linecnt++;
    $line = split(" ", $buffer);

    $stmt = "INSERT INTO logs ( `hit_date` , `hit_time` , `s-sitename` ,
    `s-computername` , ".
    "`s-ip` , `cs-method` , `cs-uri-stem` , `cs-uri-query` ,
    `s-port` , `cs-username` , `c-ip` , ".
    "`cs-version` , `User-Agent` , `Cookie` , `Referer` ,
    `cs-host` , `sc-status` , `sc-substatus` , ".
    "`sc-win32-status` , `sc-bytes` , `cs-bytes` , `time_taken` )
    ".
    " VALUES ( '".$line[0]."', '".$line[1]."', '".$line[2]."',
    '".$line[3]."', '".$line[4]."', ".

    "'".$line[5]."','".$line[6]."','".$line[7]."','".$line[8]."','".$line[9]."',
    ".

    "'".$line[10]."','".$line[11]."','".$line[12]."','".$line[13]."','".$line[14]."','".$line[15]."',
    ".

    "'".$line[16]."','".$line[17]."','".$line[18]."','".$line[19]."','".$line[20]."','".trim($li ne[21])."'
    )";


    @mysql_query($s tmt);

    if( $linecnt >= 10000 ){
    $totalcnt += $linecnt;
    echo "[ ".$totalcnt ." ( ". ( time() - $start_time) ." )
    ]\t";
    $linecnt = 0;
    }

    }

    }
    fclose($handle) ;

    ?>

  • Andy Hassall

    #2
    Re: processing raw logs faster

    On 31 Aug 2005 12:43:15 -0700, "changereality@ gmail.com"
    <changereality@ gmail.com> wrote:
    [color=blue]
    >I am trying to process raw IIS log files and insert them into a MySQL
    >database. I have no problem accomplishing this, but the php code runs
    >very slow.[/color]

    Is it the PHP that's slow, or the database updates? Put some timing code in to
    get an idea of which bit takes longest, so you can optimise the right part.

    --
    Andy Hassall :: andy@andyh.co.u k :: http://www.andyh.co.uk
    http://www.andyhsoftware.co.uk/space :: disk and FTP usage analysis tool

    Comment

    • Chung Leong

      #3
      Re: processing raw logs faster

      You should really rethink your approach. What information do you want
      to extract out of the log? Simply sticking every pagehit into the
      database will cause MySQL to blow up in no time.

      Comment

      • changereality@gmail.com

        #4
        Re: processing raw logs faster

        >Is it the PHP that's slow, or the database updates? Put some timing[color=blue]
        >code in toget an idea of which bit takes longest, so you can optimise
        >the right part.[/color]

        I removed the sql insert and the code ran faster but not by much.
        [color=blue]
        >You should really rethink your approach. What information do you want
        >to extract out of the log? Simply sticking every pagehit into the
        >database will cause MySQL to blow up in no time.[/color]

        What would you suggest?

        Comment

        • Mladen Gogala

          #5
          Re: processing raw logs faster

          On Wed, 31 Aug 2005 17:27:53 -0700, changereality@g mail.com wrote:
          [color=blue]
          >
          > What would you suggest?[/color]

          You can get a decent database from http://www.postgresql.org. It's kinda
          better suited for heavy OLTP processing then MySQL. Not as good as Oracle
          RDBMS, but definitely getting there.

          --


          Comment

          • Colin McKinnon

            #6
            Re: processing raw logs faster

            Mladen Gogala wrote:
            [color=blue]
            > On Wed, 31 Aug 2005 17:27:53 -0700, changereality@g mail.com wrote:
            >[color=green]
            >>
            >> What would you suggest?[/color]
            >
            > You can get a decent database from http://www.postgresql.org. It's kinda
            > better suited for heavy OLTP processing then MySQL. Not as good as Oracle
            > RDBMS, but definitely getting there.
            >[/color]

            No - he wanted it to go faster.

            changereality@g mail.com wrote:
            [color=blue]
            > $list = array();
            > $buffer = fgets($handle, 20000);
            >
            > if (! preg_match("/^\s*?#/", $buffer) ){[/color]

            Here's your first problem. Regexes are slow. My PERL RE's are a bit a bit
            rusty - but that looks a bit suspect anyway. Try coding it without REs.
            [color=blue]
            > $stmt = "INSERT INTO logs ( `hit_date` , `hit_time` , `s-sitename` ,
            > `s-computername` , ".
            > "`s-ip` , `cs-method` , `cs-uri-stem` , `cs-uri-query` ,
            > `s-port` , `cs-username` , `c-ip` , ".
            > "`cs-version` , `User-Agent` , `Cookie` , `Referer` ,
            > `cs-host` , `sc-status` , `sc-substatus` , ".
            > "`sc-win32-status` , `sc-bytes` , `cs-bytes` , `time_taken` )
            > ".[/color]

            Join the strings together - OK it doesn't help the readability - but you
            will get some performance benefit. Actually it would be a lot better to
            move the invariant parts outside the loop:

            $stub="INSERT INTO logs....VALUES( ";
            while (!feof($handle) ) {
            ....
            $stmt=$stub . "'".$line[0]."', '".$line[1]."', '".$line[2]."',

            You could try a more functional approach to generating the VALUES clause -
            something like:

            $stmt = $stub . "'" . implode("','",$ line) . "')";

            This could be more efficient:
            [color=blue]
            > if( $linecnt >= 10000 ){
            > $totalcnt += $linecnt;
            > echo "[ ".$totalcnt ." ( ". ( time() - $start_time) ." )
            > ]\t";
            > $linecnt = 0;
            > }[/color]

            Instead:

            if (!($linecnt % 10000)) {
            echo "[ ".$linecnt. " ( ". ( time() - $start_time) ." ) ]\t";
            }

            You should also get a boost by using INSERT DELAYED (assuming your DBMS and
            table are compliant)

            $stub="INSERT DELAYED INTO logs....VALUES( ";

            HTH

            C.

            Comment

            • NC

              #7
              Re: processing raw logs faster

              changereality@g mail.com wrote:[color=blue]
              >
              > I am trying to process raw IIS log files and insert them into a MySQL
              > database. I have no problem accomplishing this, but the php code runs
              > very slow.[/color]

              Why do you need PHP code at all? If I remember correctly, IIS log
              files are comma-separated text files that can be imported directly
              into MySQL using LOAD DATA INFILE command...

              Cheers,
              NC

              Comment

              • Chung Leong

                #8
                Re: processing raw logs faster

                Colin McKinnon wrote:[color=blue]
                > Here's your first problem. Regexes are slow. My PERL RE's are a bit a bit
                > rusty - but that looks a bit suspect anyway. Try coding it without REs.[/color]

                The PCRE extension caches compile regex, so the overhead isn't that
                high when you keep testing for the same expression. It's highly
                unlikely that you can write something more efficient for the expression
                /^\s*#/ in PHP.

                The call to split() definitely should be replaced by explode(),
                however. The Posix regex functions do recompile the expression each
                time.

                Comment

                • Mladen Gogala

                  #9
                  Re: processing raw logs faster

                  On Thu, 01 Sep 2005 09:49:27 +0100, Colin McKinnon wrote:
                  [color=blue][color=green]
                  >> You can get a decent database from http://www.postgresql.org. It's
                  >> kinda better suited for heavy OLTP processing then MySQL. Not as good
                  >> as Oracle RDBMS, but definitely getting there.
                  >>
                  >>[/color]
                  > No - he wanted it to go faster.[/color]

                  Well, that is precisely what he will get, unless he's using MySQL in the
                  single user mode. Being an Oracle DBA for 15+ years and working for large
                  companies (my largest database was > 1TB and with 3000 concurrent users) I
                  forgot that there are such databases. MySQL is a serious junk which has
                  problems with transactions (if "dirty reads" are disabled and standard
                  ANSI-level transaction isolation enabled, database speed is 25% of the
                  original), MySQL is prone to lock escalation (many row locks are converted
                  to a table lock, row locking was introduced in MySQL v4, so it is very
                  new) and MySQL is notorious for index corruption and having to rebuild
                  indexes. Postgres has no lock escalation, normally operates in
                  ANSI-standard serialization level and indexes have to be rebuilt only on
                  rare occasions. For large loads, which is precisely what he's describing,
                  nothing in the world beats Oracle RDBMS with parallel/direct insert.
                  Postgres is the best free alternative. MySQL is a serious junk. The ease
                  with which you dismissed Postgres tells me that you probably don't know
                  much about databases.

                  [color=blue]
                  > changereality@g mail.com wrote:
                  >[color=green]
                  >> $list = array();
                  >> $buffer = fgets($handle, 20000);
                  >>
                  >> if (! preg_match("/^\s*?#/", $buffer) ){[/color]
                  >
                  > Here's your first problem. Regexes are slow.[/color]

                  I love unexplained declarations/qualifications like this! In the form of
                  verdict, such absolute statements do not need explanations. Unfortunately,
                  this verdict is bound to be struck down on appeal. Regex is a standard
                  programming mechanism and is not "slow" unless used carelessly. Loops can
                  also be slow, if used carelessly. Avoid using loops, too? How about switch
                  statements? I will not even mention new PHP5 and try-catch combination
                  with exceptions. Creating monstrosities without regex will save you
                  nothing because the few instructions saved by refraining from using
                  regular expressions will be more then compensated by the monstrosity of
                  the program without RE. Your logic leads directly to the assembly language
                  programming - it's without a doubt much faster then PHP, Perl, Java or
                  anything else. You will not have to suffer from regex. Just buy few
                  mathematical books describing the theory of algorithms, including finite
                  automata and implement that in the assembly language. The speed will be
                  unbeatable, once you finish he program in the year 2525. Obey the
                  principles of the good programming and you will find that the programs
                  will work fine, even with RE. Write stupid programs which are
                  algorithmically incorrect and not even coding in binary will make them
                  less stupid.
                  [color=blue]
                  > My PERL RE's are a bit a bit
                  > rusty - but that looks a bit suspect anyway. Try coding it without REs.[/color]

                  Some advice is not meant to be followed. This is one such example. First
                  thing to do is to profile your program, using Zend Studio if possible, and
                  find out where the time is spent. That is the most logical thing to do. If
                  your complaint is that "the program is slow", find out exactly what is
                  slow and where is the time spent. My experience tells me that the main
                  cause for slowness is usually inadequate usage of the database. With
                  MySQL, it's not hard to do. Unless you know the database very well, you
                  will likely have performance problems. Avoiding RE will then do nothing
                  but mutilate, scar and disembowel the program.

                  --


                  Comment

                  Working...