faster way to get from SQL database into array

Collapse
This topic is closed.
X
X
 
  • Time
  • Show
Clear All
new posts
  • Toby Inkster

    #16
    Re: faster way to get from SQL database into array

    Norman Peelman wrote:
    The odds of any two strings having the same md5 are on the order of 2^64
    operations.
    The odds of any two *given* strings having the same MD5 are about 3.4E-38.

    The odds that some two strings have the same MD5 are 1. On a modern
    computer, you should be able to calculate a "collision" in minutes.
    <http://www.links.org/?p=6>

    --
    Toby A Inkster BSc (Hons) ARCS
    Contact Me ~ http://tobyinkster.co.uk/contact

    Comment

    • Jerry Stuckle

      #17
      Re: faster way to get from SQL database into array

      Norman Peelman wrote:
      "Toby Inkster" <usenet200611@t obyinkster.co.u kwrote in message
      news:l22354-d0o.ln1@ophelia .g5n.co.uk...
      >
      >>Norman Peelman wrote:
      >>
      >>
      >>>Well, I knew i'd get some flak for that. :) I was assuming they would
      >
      be
      >
      >>>unique enough for the situation.
      >>
      >>There are varying degrees of uniqueness?
      >>
      >>--
      >>Toby A Inkster BSc (Hons) ARCS
      >>Contact Me ~ http://tobyinkster.co.uk/contact
      >>
      >
      >
      The poster is converting strings not files to md5. md5 is not a method
      of encryption per se but a more complicated method of CRC to 'help' validate
      data... It is convienent to use md5 to 'one-way encrypt' small strings used
      as passwords. I would assume that the poster is also storing the matching
      plaintext along with the md5 so if a match was found he could then see if
      the plaintext matched as well but, according to the definitions i've read on
      the net; the odds are extremely high, paraphrasing:
      >
      Taken from

      >
      The odds of any two strings having the same md5 are on the order of 2^64
      operations.
      The odds of finding a string having a given md5 is on the order of 2^128
      operations.
      >
      ...so yes, i'd say that there are variable degrees of uniqueness. So maybe
      an easier way than my first post would be:
      >
      Using textmd5 as PRIMARY KEY
      >
      $db_link = mysql_connect(. ..,...,...); //etc.
      $query = "INSERT INTO logs_full VALUES ($textmd5, $textplain)";
      $resone = mysql_query($qu ery, $dbc);
      if (!mysql_errno($ db_link))
      {
      // match found (duplicate entry for key error 1062 returned from MySQL)
      // could then double check textplain here although odds of being here
      are very high
      // do other stuff here
      }
      else
      {
      // no match found
      // new entry has just been inserted
      }
      >
      Of course this assumes he is also storing new md5's to check against. I
      gueuss we don't fully know what the poster completely intends on doing
      unless he pops back in with a little more info.
      >
      Norm
      >
      >
      And when you're comparing unknown strings, the odds go down rapidly.

      The odds of anyone having the same birthday as you is 1 in 365.25.
      However, in a group of 24 people, the odds of any two of them having the
      same birthday are 1 in 2.

      I don't have the time (or inkling) to determine how many md5 hashes you
      would have to generate to get a reasonable chance of hitting a
      duplicate. But it is much less than 2^64.

      --
      =============== ===
      Remove the "x" from my email address
      Jerry Stuckle
      JDS Computer Training Corp.
      jstucklex@attgl obal.net
      =============== ===

      Comment

      • Norman Peelman

        #18
        Re: faster way to get from SQL database into array

        "Toby Inkster" <usenet200611@t obyinkster.co.u kwrote in message
        news:f91454-2rt.ln1@ophelia .g5n.co.uk...
        Norman Peelman wrote:
        >
        The odds of any two strings having the same md5 are on the order of 2^64
        operations.
        >
        The odds of any two *given* strings having the same MD5 are about 3.4E-38.
        >
        The odds that some two strings have the same MD5 are 1. On a modern
        computer, you should be able to calculate a "collision" in minutes.
        <http://www.links.org/?p=6>
        >
        --
        Toby A Inkster BSc (Hons) ARCS
        Contact Me ~ http://tobyinkster.co.uk/contact
        >
        Ok, not sure what's going on (read a bunch of posts concerning collisions)
        but I cannot reproduce the output of the link above. Give the two separate
        strings I get two different hashes:

        simple script to show differences in input strings and the md5 hashes

        <?php
        $str1 =
        'd131dd02c5e6ee c4693d9a0698aff 95c2fcab5871246 7eab4004583eb8f b7f8955ad340609 f
        4b30283e4888325 71415a085125e8f 7cdc99fd91dbdf2 80373c5bd8823e3 156348f5bae6dac d
        436c919c6dd53e2 b487da03fd02396 306d248cda0e99f 33420f577ee8ce5 4b67080a80d1ec6 9
        821bcb6a8839396 f9652b6ff72a70' ;
        $str2 =
        'd131dd02c5e6ee c4693d9a0698aff 95c2fcab5071246 7eab4004583eb8f b7f8955ad340609 f
        4b30283e4888325 f1415a085125e8f 7cdc99fd91dbd72 80373c5bd8823e3 156348f5bae6dac d
        436c919c6dd53e2 3487da03fd02396 306d248cda0e99f 33420f577ee8ce5 4b67080280d1ec6 9
        821bcb6a8839396 f965ab6ff72a70' ;
        $md51 = md5($str1);
        $md52 = md5($str2);

        $str_diffs = show_diff($str1 , $str2);

        echo '<pre>';
        echo "$str_diffs<br> ";
        echo "$str1<br>" ;
        echo 'Length: '.strlen($str1) .'<br>';
        echo "md5: $md51<br>";
        echo 'md5 length: '.strlen($md51) .'<br>';
        echo "$str2<br>" ;
        echo 'Length: '.strlen($str2) .'<br>';
        echo "md5: $md52<br>";
        echo 'md5 length: '.strlen($md52) .'<br>';
        echo '</pre>';
        echo '<br>';

        function show_diff($str1 , $str2)
        {
        $str1_tmp = '';
        $str2_tmp = '';
        for ($loop = 0; $loop <= strlen($str1)-1; $loop++)
        {
        if ($str1{$loop} != $str2{$loop})
        {
        $font_start = '<font color="red">';
        $font_end = '</font>';
        }
        else
        {
        $font_start = '';
        $font_end = '';
        }
        $str1_tmp .= $font_start.$st r1{$loop}.$font _end;
        $str2_tmp .= $font_start.$st r2{$loop}.$font _end;
        }
        $final = $str1_tmp.'<br> '.$str2_tmp.'<b r>';
        return $final;
        }
        ?>

        output:

        d131dd02c5e6eec 4693d9a0698aff9 5c2fcab58712467 eab4004583eb8fb 7f8955ad340609f 4
        b30283e48883257 1415a085125e8f7 cdc99fd91dbdf28 0373c5bd8823e31 56348f5bae6dacd 4
        36c919c6dd53e2b 487da03fd023963 06d248cda0e99f3 3420f577ee8ce54 b67080a80d1ec69 8
        21bcb6a8839396f 9652b6ff72a70

        d131dd02c5e6eec 4693d9a0698aff9 5c2fcab50712467 eab4004583eb8fb 7f8955ad340609f 4
        b30283e4888325f 1415a085125e8f7 cdc99fd91dbd728 0373c5bd8823e31 56348f5bae6dacd 4
        36c919c6dd53e23 487da03fd023963 06d248cda0e99f3 3420f577ee8ce54 b67080280d1ec69 8
        21bcb6a8839396f 965ab6ff72a70


        d131dd02c5e6eec 4693d9a0698aff9 5c2fcab58712467 eab4004583eb8fb 7f8955ad340609f 4
        b30283e48883257 1415a085125e8f7 cdc99fd91dbdf28 0373c5bd8823e31 56348f5bae6dacd 4
        36c919c6dd53e2b 487da03fd023963 06d248cda0e99f3 3420f577ee8ce54 b67080a80d1ec69 8
        21bcb6a8839396f 9652b6ff72a70

        Length: 256

        md5: edde4181249fea6 8547c2fd0edd2e2 2f

        md5 length: 32

        d131dd02c5e6eec 4693d9a0698aff9 5c2fcab50712467 eab4004583eb8fb 7f8955ad340609f 4
        b30283e4888325f 1415a085125e8f7 cdc99fd91dbd728 0373c5bd8823e31 56348f5bae6dacd 4
        36c919c6dd53e23 487da03fd023963 06d248cda0e99f3 3420f577ee8ce54 b67080280d1ec69 8
        21bcb6a8839396f 965ab6ff72a70

        Length: 256

        md5: e234dbc6aa0932d 9dd5facd53ba037 2a

        md5 length: 32



        Is it possible this has been fixed (running PHP4.3.4) as I notice the post
        is from 2005... Or am I not looking at something right?

        Norm


        Comment

        • Erwin Moller

          #19
          Re: faster way to get from SQL database into array

          Toby Inkster wrote:
          Erwin Moller wrote:
          >
          >You have a certain value that you transform to a md5 and check if it is
          >your db allready, right?
          >
          No -- he has *hundred of thousands* of "certain values" that he needs to
          transform to an MD5 and check to see if it's in his DB already.
          >
          What you are suggesting is (pseudo-code abound):
          No, I am explicitely NOT suggesting that because that would be very stupid
          and I am not. :P
          I suggested he does 1 query with the precalculated md5-hash (You must have
          misread my post: please reread my post, it clearly suggest 1 query).

          But it is entirely possible I do not understand the problem.
          That has to do with the fact I cannot imagine a setup that needs to do this,
          and if it is needed, that design was bad and needs to be redone IMHO.

          I understood that the database holds the md5-hashes, not the raw data, and
          the OP said this in his original message:
          "I currently have a list of md5 strings and need to check if a new
          string is in that list hundreds of thousands of times."

          What is unclear to me is WHY he has zillion values to transform to
          md5-hashes.
          Sounds like very bad design to me.

          But since Cleverburn dropped out of this discussion, it will be hard to tell
          what he actually is doing. We can only guess.

          Regards,
          Erwin Moller


          Comment

          • Toby Inkster

            #20
            Re: faster way to get from SQL database into array

            Norman Peelman wrote:
            Is it possible this has been fixed (running PHP4.3.4) as I notice the post
            is from 2005... Or am I not looking at something right?
            The strings given in the article need to be un-wordwrapped and un-hexed
            before comparison...

            <?php

            # These are hexadecimal versions of the true strings that need to be
            # tested for MD5 values.
            $hex_A =
            'd131dd02c5e6ee c4693d9a0698aff 95c2fcab5871246 7eab4004583eb8f b7f8955ad340609 f
            4b30283e4888325 71415a085125e8f 7cdc99fd91dbdf2 80373c5bd8823e3 156348f5bae6dac d
            436c919c6dd53e2 b487da03fd02396 306d248cda0e99f 33420f577ee8ce5 4b67080a80d1ec6 9
            821bcb6a8839396 f9652b6ff72a70' ;
            $hex_B =
            'd131dd02c5e6ee c4693d9a0698aff 95c2fcab5071246 7eab4004583eb8f b7f8955ad340609 f
            4b30283e4888325 f1415a085125e8f 7cdc99fd91dbd72 80373c5bd8823e3 156348f5bae6dac d
            436c919c6dd53e2 3487da03fd02396 306d248cda0e99f 33420f577ee8ce5 4b67080280d1ec6 9
            821bcb6a8839396 f965ab6ff72a70' ;

            # First, remove extra characters introduced by wrapping in article.
            $hex_A = preg_replace('/[\r\n\s]/', '', $hex_A);
            $hex_B = preg_replace('/[\r\n\s]/', '', $hex_B);

            # Now, unhex them.
            $str_A = pack('H*', $hex_A);
            $str_B = pack('H*', $hex_B);

            $sha1_A = sha1($str_A);
            $sha1_B = sha1($str_B);

            $crc32_A = crc32($str_A);
            $crc32_B = crc32($str_B);

            $md5_A = md5($str_A);
            $md5_B = md5($str_B);

            echo "<pre>\n";
            echo "To prove the strings are different, look at their SHA-1 and CRC32 values.\n";
            echo "\n";
            echo "SHA-1 A: {$sha1_A}\n";
            echo "SHA-1 B: {$sha1_B}\n";
            echo "\n";
            echo "CRC32 A: {$crc32_A}\n";
            echo "CRC32 B: {$crc32_B}\n";
            echo "\n";
            echo "Now behold...\n";
            echo "\n";
            echo "MD5 A: {$md5_A}\n";
            echo "MD5 B: {$md5_B}\n";
            echo "</pre>\n";

            ?>


            --
            Toby A Inkster BSc (Hons) ARCS
            Contact Me ~ http://tobyinkster.co.uk/contact

            Comment

            • Norman Peelman

              #21
              Re: faster way to get from SQL database into array

              "Erwin Moller"
              <since_humans_r ead_this_I_am_s pammed_too_much @spamyourself.c omwrote in
              message news:457fdab9$0 $326$e4fe514c@n ews.xs4all.nl.. .
              Toby Inkster wrote:
              >
              Erwin Moller wrote:
              You have a certain value that you transform to a md5 and check if it is
              your db allready, right?
              No -- he has *hundred of thousands* of "certain values" that he needs to
              transform to an MD5 and check to see if it's in his DB already.

              What you are suggesting is (pseudo-code abound):
              >
              No, I am explicitely NOT suggesting that because that would be very stupid
              and I am not. :P
              I suggested he does 1 query with the precalculated md5-hash (You must have
              misread my post: please reread my post, it clearly suggest 1 query).
              >
              But it is entirely possible I do not understand the problem.
              That has to do with the fact I cannot imagine a setup that needs to do
              this,
              and if it is needed, that design was bad and needs to be redone IMHO.
              >
              I understood that the database holds the md5-hashes, not the raw data, and
              the OP said this in his original message:
              "I currently have a list of md5 strings and need to check if a new
              string is in that list hundreds of thousands of times."
              >
              What is unclear to me is WHY he has zillion values to transform to
              md5-hashes.
              Sounds like very bad design to me.
              >
              But since Cleverburn dropped out of this discussion, it will be hard to
              tell
              what he actually is doing. We can only guess.
              >
              Regards,
              Erwin Moller
              >
              >
              Yeah, that's what I said...

              Norm
              --
              FREE Avatar hosting at www.easyavatar.com


              Comment

              • Norman Peelman

                #22
                Re: faster way to get from SQL database into array

                "Toby Inkster" <usenet200611@t obyinkster.co.u kwrote in message
                news:cv1554-stm.ln1@ophelia .g5n.co.uk...
                Norman Peelman wrote:
                >
                Is it possible this has been fixed (running PHP4.3.4) as I notice the
                post
                is from 2005... Or am I not looking at something right?
                >
                The strings given in the article need to be un-wordwrapped and un-hexed
                before comparison...
                >
                <sniped>
                >
                --
                Toby A Inkster BSc (Hons) ARCS
                Contact Me ~ http://tobyinkster.co.uk/contact
                >
                Ok, I guess that headecimal conversion wasn't apparrant (to me). I
                understand it now. Wondered what I was doing wrong. My question becomes
                that of feasability... how bad is this for passwords as opposed to forging
                documents?

                Norm
                --
                FREE Avatar hosting at www.easyavatar.com


                Comment

                • Cleverbum@hotmail.com

                  #23
                  Re: faster way to get from SQL database into array


                  Erwin Moller wrote:
                  Toby Inkster wrote:
                  >
                  Erwin Moller wrote:
                  You have a certain value that you transform to a md5 and check if it is
                  your db allready, right?
                  No -- he has *hundred of thousands* of "certain values" that he needs to
                  transform to an MD5 and check to see if it's in his DB already.

                  What you are suggesting is (pseudo-code abound):
                  >
                  No, I am explicitely NOT suggesting that because that would be very stupid
                  and I am not. :P
                  I suggested he does 1 query with the precalculated md5-hash (You must have
                  misread my post: please reread my post, it clearly suggest 1 query).
                  >
                  But it is entirely possible I do not understand the problem.
                  That has to do with the fact I cannot imagine a setup that needs to do this,
                  and if it is needed, that design was bad and needs to be redone IMHO.
                  >
                  I understood that the database holds the md5-hashes, not the raw data, and
                  the OP said this in his original message:
                  "I currently have a list of md5 strings and need to check if a new
                  string is in that list hundreds of thousands of times."
                  >
                  What is unclear to me is WHY he has zillion values to transform to
                  md5-hashes.
                  Sounds like very bad design to me.
                  >
                  But since Cleverburn dropped out of this discussion, it will be hard to tell
                  what he actually is doing. We can only guess.
                  >
                  Regards,
                  Erwin Moller
                  Sorry to have dropped out for a day, I've got a datafile which contains
                  some duplicate entries, to check that the entry is not a duplicate I
                  first check that the MD5 of that entry is not in a list of the MD5s of
                  lines which have already been processed.
                  Computing the md5 and then checking it against a database is very much
                  quicker than comparing the raw data strings to one another.

                  Comment

                  • Cleverbum@hotmail.com

                    #24
                    Re: faster way to get from SQL database into array


                    Norman Peelman wrote:
                    <Cleverbum@hotm ail.comwrote in message
                    news:1165846992 .423101.215040@ f1g2000cwa.goog legroups.com...
                    I currently have a list of md5 strings and need to check if a new
                    string is in that list hundreds of thousands of times. I've found that
                    the fastest way to do this is to have all the md5's stored in an array
                    and use the php function in_array().
                    my only problem now is that populating this array with data from my sql
                    server is rather slow, I currently use the lines:

                    $resone = mysql_query("SE LECT * FROM logs_full");
                    mysql_close();

                    while ($row = mysql_fetch_arr ay($resone)) {
                    $md5array[$md5count]= $row['textmd5'];
                    $md5count++;
                    }

                    to do this. does anyone have a faster method?
                    >
                    I can only shorten it:
                    >
                    $resone = mysql_query("SE LECT * FROM logs_full");
                    mysql_close();
                    >
                    while ($row = mysql_fetch_ass oc($resone)) {
                    $md5array[]= $row['textmd5'];
                    }
                    >
                    How big (how many rows) is the table you are reading in? Are you storing new
                    md5's when they aren't found in the db?
                    >
                    >
                    Norm
                    --
                    FREE Avatar hosting at www.easyavatar.com
                    The table currently has about 390,000 rows, and no I don't send the new
                    MD5s to it as soon as they are found to be new, I just put them into
                    the array.
                    I've found that moving away from single insert queries speeds up the
                    script many times over and so now I insert the data into the SQL tables
                    when I have 1000 new unique values.

                    Comment

                    • Cleverbum@hotmail.com

                      #25
                      Re: faster way to get from SQL database into array


                      Norman Peelman wrote:
                      <Cleverbum@hotm ail.comwrote in message
                      news:1165846992 .423101.215040@ f1g2000cwa.goog legroups.com...
                      I currently have a list of md5 strings and need to check if a new
                      string is in that list hundreds of thousands of times. I've found that
                      the fastest way to do this is to have all the md5's stored in an array
                      and use the php function in_array().
                      my only problem now is that populating this array with data from my sql
                      server is rather slow, I currently use the lines:

                      $resone = mysql_query("SE LECT * FROM logs_full");
                      mysql_close();

                      while ($row = mysql_fetch_arr ay($resone)) {
                      $md5array[$md5count]= $row['textmd5'];
                      $md5count++;
                      }

                      to do this. does anyone have a faster method?
                      >
                      Maybe,
                      >
                      I think you are going about this project all wrong:
                      >
                      Assumption - your 'list' of md5's is actually the md5's in the database.
                      Assumption - you arecreating md5's from strings in text files.
                      Assumption - you need to check to make sure that the new value isn't already
                      in the database.
                      >
                      Problem - You are reading in the ENTIRE database.
                      Problem - you may be tempted to think in_array() would be faster, but it has
                      to start at the beginning of the array for each new value every time.
                      Problem - big waste of time and resources. You are doing double work.
                      >
                      Solution - Let MySQL do what it was designed to do. Since md5's are meant to
                      be unique in their own right, simply make your 'textmd5' field the PRIMARY
                      KEY and it will automatically be indexed. Now only do the operations you
                      actually require:
                      >
                      $query = "SELECT textmd5 FROM logs_full WHERE textmd5 = 'search_md5';
                      $resone = mysql_query($qu ery, $dbc);
                      I don't know if it's because my SQL server and webserver are on
                      different machines, or because it's a feature of the language, but this
                      simply isn't as fast as the binary search which I am now using.
                      if (!mysql_num_row s($resone))
                      {
                      // no match found
                      // insert new info into database
                      }
                      else
                      {
                      //match found
                      // no need to insert
                      }
                      >
                      no arrays being used and I guarantee it will be WAY faster for any size of
                      database. Now for the real question. Any particular reason you are creating
                      an md5 database? It's already being done...
                      >
                      Norm
                      --
                      FREE Avatar hosting at www.easyavatar.com

                      Comment

                      • Norman Peelman

                        #26
                        Re: faster way to get from SQL database into array

                        <Cleverbum@hotm ail.comwrote in message
                        news:1166018776 .381774.325990@ f1g2000cwa.goog legroups.com...
                        >
                        Norman Peelman wrote:
                        <Cleverbum@hotm ail.comwrote in message
                        news:1165846992 .423101.215040@ f1g2000cwa.goog legroups.com...
                        I currently have a list of md5 strings and need to check if a new
                        string is in that list hundreds of thousands of times. I've found that
                        the fastest way to do this is to have all the md5's stored in an array
                        and use the php function in_array().
                        my only problem now is that populating this array with data from my
                        sql
                        server is rather slow, I currently use the lines:
                        >
                        $resone = mysql_query("SE LECT * FROM logs_full");
                        mysql_close();
                        >
                        while ($row = mysql_fetch_arr ay($resone)) {
                        $md5array[$md5count]= $row['textmd5'];
                        $md5count++;
                        }
                        >
                        to do this. does anyone have a faster method?
                        >
                        I can only shorten it:

                        $resone = mysql_query("SE LECT * FROM logs_full");
                        mysql_close();

                        while ($row = mysql_fetch_ass oc($resone)) {
                        $md5array[]= $row['textmd5'];
                        }

                        How big (how many rows) is the table you are reading in? Are you storing
                        new
                        md5's when they aren't found in the db?


                        Norm
                        --
                        FREE Avatar hosting at www.easyavatar.com
                        >
                        The table currently has about 390,000 rows, and no I don't send the new
                        MD5s to it as soon as they are found to be new, I just put them into
                        the array.
                        I've found that moving away from single insert queries speeds up the
                        script many times over and so now I insert the data into the SQL tables
                        when I have 1000 new unique values.
                        >
                        Ok, so then the next set of questions would be:
                        1) How large do you expect this table to get and how soon? (already consumes
                        12.5megs just for the textmd5 field alone)
                        2) How are you INSERTING the new info at the 1000 mark?


                        ....what about LOAD DATA INFILE with the IGNORE switch?


                        ....or using array_diff()?




                        Norm


                        Comment

                        • Toby Inkster

                          #27
                          Re: faster way to get from SQL database into array

                          Norman Peelman wrote:
                          My question becomes that of feasability... how bad is this for
                          passwords as opposed to forging documents?
                          For low to medium security purposes, I'm guessing that MD5 will still be a
                          useful hash function for a few years yet -- I don't think it is yet the
                          time to rip all the md5() functions out of your existing programmes and
                          replace them with more modern hashes. But my advice would certainly be to
                          consider hashes such as SHA256 and Whirlpool for future projects.

                          For checksummy-type purposes, like verifying a file hasn't been
                          accidentally damaged during a download or compression/decompression, then
                          I'd recommend comparing not just hashes, but also file sizes.

                          For passwordy-type purposes, I'd recommend *not* storing the length of the
                          un-hashed password, as this gives an additional clue to what the password
                          is.

                          (Also, Google for "md5deep".)

                          --
                          Toby A Inkster BSc (Hons) ARCS
                          Contact Me ~ http://tobyinkster.co.uk/contact

                          Comment

                          • Toby Inkster

                            #28
                            Re: faster way to get from SQL database into array

                            Erwin Moller wrote:
                            I suggested he does 1 query with the precalculated md5-hash.
                            I know you did -- but there are "hundreds of thousands" of precalculated
                            ms5-hashes (needles) -- it said so in the original post -- thus hundreds
                            of thousands of queries.
                            You must have misread my post: please reread my post, it clearly
                            suggest 1 query
                            .... per hash to be found!

                            Please reread my code: it clearly implements the technique you suggested.

                            --
                            Toby A Inkster BSc (Hons) ARCS
                            Contact Me ~ http://tobyinkster.co.uk/contact

                            Comment

                            Working...