faster way to get from SQL database into array

**Toby Inkster** · Dec 13 '06, 01:35 AM

Re: faster way to get from SQL database into array

Norman Peelman wrote:

The odds of any two strings having the same md5 are on the order of 2^64
operations.

The odds of any two *given* strings having the same MD5 are about 3.4E-38.

The odds that some two strings have the same MD5 are 1. On a modern
computer, you should be able to calculate a "collision" in minutes.
<http://www.links.org/?p=6>

--
Toby A Inkster BSc (Hons) ARCS
Contact Me ~ http://tobyinkster.co.uk/contact

**Jerry Stuckle** · Dec 13 '06, 02:15 AM

Re: faster way to get from SQL database into array

Norman Peelman wrote:

"Toby Inkster" <usenet200611@t obyinkster.co.u kwrote in message
news:l22354-d0o.ln1@ophelia .g5n.co.uk...
>

>>Norman Peelman wrote:
>>
>>

>>>Well, I knew i'd get some flak for that. :) I was assuming they would

>
be
>

>>>unique enough for the situation.

>>
>>There are varying degrees of uniqueness?
>>
>>--
>>Toby A Inkster BSc (Hons) ARCS
>>Contact Me ~ http://tobyinkster.co.uk/contact
>>

>
>
The poster is converting strings not files to md5. md5 is not a method
of encryption per se but a more complicated method of CRC to 'help' validate
data... It is convienent to use md5 to 'one-way encrypt' small strings used
as passwords. I would assume that the poster is also storing the matching
plaintext along with the md5 so if a match was found he could then see if
the plaintext matched as well but, according to the definitions i've read on
the net; the odds are extremely high, paraphrasing:
>
Taken from

define:MD5 - Google Search

http://www.google.com/search?hl=en&lr=&defl=en&q=define:MD5&sa=X&oi=glossary_definition&ct=title

>
The odds of any two strings having the same md5 are on the order of 2^64
operations.
The odds of finding a string having a given md5 is on the order of 2^128
operations.
>
...so yes, i'd say that there are variable degrees of uniqueness. So maybe
an easier way than my first post would be:
>
Using textmd5 as PRIMARY KEY
>
$db_link = mysql_connect(. ..,...,...); //etc.
$query = "INSERT INTO logs_full VALUES ($textmd5, $textplain)";
$resone = mysql_query($qu ery, $dbc);
if (!mysql_errno($ db_link))
{
// match found (duplicate entry for key error 1062 returned from MySQL)
// could then double check textplain here although odds of being here
are very high
// do other stuff here
}
else
{
// no match found
// new entry has just been inserted
}
>
Of course this assumes he is also storing new md5's to check against. I
gueuss we don't fully know what the poster completely intends on doing
unless he pops back in with a little more info.
>
Norm
>
>

And when you're comparing unknown strings, the odds go down rapidly.

The odds of anyone having the same birthday as you is 1 in 365.25.
However, in a group of 24 people, the odds of any two of them having the
same birthday are 1 in 2.

I don't have the time (or inkling) to determine how many md5 hashes you
would have to generate to get a reasonable chance of hitting a
duplicate. But it is much less than 2^64.

--
=============== ===
Remove the "x" from my email address
Jerry Stuckle
JDS Computer Training Corp.
jstucklex@attgl obal.net
=============== ===

**Norman Peelman** · Dec 13 '06, 03:55 AM

Re: faster way to get from SQL database into array

"Toby Inkster" <usenet200611@t obyinkster.co.u kwrote in message
news:f91454-2rt.ln1@ophelia .g5n.co.uk...

Norman Peelman wrote:
>

The odds of any two strings having the same md5 are on the order of 2^64
operations.

>
The odds of any two *given* strings having the same MD5 are about 3.4E-38.
>
The odds that some two strings have the same MD5 are 1. On a modern
computer, you should be able to calculate a "collision" in minutes.
<http://www.links.org/?p=6>
>
--
Toby A Inkster BSc (Hons) ARCS
Contact Me ~ http://tobyinkster.co.uk/contact
>

Ok, not sure what's going on (read a bunch of posts concerning collisions)
but I cannot reproduce the output of the link above. Give the two separate
strings I get two different hashes:

simple script to show differences in input strings and the md5 hashes

<?php
$str1 =
'd131dd02c5e6ee c4693d9a0698aff 95c2fcab5871246 7eab4004583eb8f b7f8955ad340609 f
4b30283e4888325 71415a085125e8f 7cdc99fd91dbdf2 80373c5bd8823e3 156348f5bae6dac d
436c919c6dd53e2 b487da03fd02396 306d248cda0e99f 33420f577ee8ce5 4b67080a80d1ec6 9
821bcb6a8839396 f9652b6ff72a70' ;
$str2 =
'd131dd02c5e6ee c4693d9a0698aff 95c2fcab5071246 7eab4004583eb8f b7f8955ad340609 f
4b30283e4888325 f1415a085125e8f 7cdc99fd91dbd72 80373c5bd8823e3 156348f5bae6dac d
436c919c6dd53e2 3487da03fd02396 306d248cda0e99f 33420f577ee8ce5 4b67080280d1ec6 9
821bcb6a8839396 f965ab6ff72a70' ;
$md51 = md5($str1);
$md52 = md5($str2);

$str_diffs = show_diff($str1 , $str2);

echo '<pre>';
echo "$str_diffs ";
echo "$str1 " ;
echo 'Length: '.strlen($str1) .' ';
echo "md5: $md51 ";
echo 'md5 length: '.strlen($md51) .' ';
echo "$str2 " ;
echo 'Length: '.strlen($str2) .' ';
echo "md5: $md52 ";
echo 'md5 length: '.strlen($md52) .' ';
echo '</pre>';
echo ' ';

function show_diff($str1 , $str2)
{
$str1_tmp = '';
$str2_tmp = '';
for ($loop = 0; $loop <= strlen($str1)-1; $loop++)
{
if ($str1{$loop} != $str2{$loop})
{
$font_start = '';
$font_end = '';
}
else
{
$font_start = '';
$font_end = '';
}
$str1_tmp .= $font_start.$st r1{$loop}.$font _end;
$str2_tmp .= $font_start.$st r2{$loop}.$font _end;
}
$final = $str1_tmp.' '.$str2_tmp.'';
return $final;
}
?>

output:

d131dd02c5e6eec 4693d9a0698aff9 5c2fcab58712467 eab4004583eb8fb 7f8955ad340609f 4
b30283e48883257 1415a085125e8f7 cdc99fd91dbdf28 0373c5bd8823e31 56348f5bae6dacd 4
36c919c6dd53e2b 487da03fd023963 06d248cda0e99f3 3420f577ee8ce54 b67080a80d1ec69 8
21bcb6a8839396f 9652b6ff72a70

d131dd02c5e6eec 4693d9a0698aff9 5c2fcab50712467 eab4004583eb8fb 7f8955ad340609f 4
b30283e4888325f 1415a085125e8f7 cdc99fd91dbd728 0373c5bd8823e31 56348f5bae6dacd 4
36c919c6dd53e23 487da03fd023963 06d248cda0e99f3 3420f577ee8ce54 b67080280d1ec69 8
21bcb6a8839396f 965ab6ff72a70

d131dd02c5e6eec 4693d9a0698aff9 5c2fcab58712467 eab4004583eb8fb 7f8955ad340609f 4
b30283e48883257 1415a085125e8f7 cdc99fd91dbdf28 0373c5bd8823e31 56348f5bae6dacd 4
36c919c6dd53e2b 487da03fd023963 06d248cda0e99f3 3420f577ee8ce54 b67080a80d1ec69 8
21bcb6a8839396f 9652b6ff72a70

Length: 256

md5: edde4181249fea6 8547c2fd0edd2e2 2f

md5 length: 32

d131dd02c5e6eec 4693d9a0698aff9 5c2fcab50712467 eab4004583eb8fb 7f8955ad340609f 4
b30283e4888325f 1415a085125e8f7 cdc99fd91dbd728 0373c5bd8823e31 56348f5bae6dacd 4
36c919c6dd53e23 487da03fd023963 06d248cda0e99f3 3420f577ee8ce54 b67080280d1ec69 8
21bcb6a8839396f 965ab6ff72a70

Length: 256

md5: e234dbc6aa0932d 9dd5facd53ba037 2a

md5 length: 32

Is it possible this has been fixed (running PHP4.3.4) as I notice the post
is from 2005... Or am I not looking at something right?

Norm

**Erwin Moller** · Dec 13 '06, 10:25 AM

Re: faster way to get from SQL database into array

Toby Inkster wrote:

Erwin Moller wrote:
>

>You have a certain value that you transform to a md5 and check if it is
>your db allready, right?

>
No -- he has *hundred of thousands* of "certain values" that he needs to
transform to an MD5 and check to see if it's in his DB already.
>
What you are suggesting is (pseudo-code abound):

No, I am explicitely NOT suggesting that because that would be very stupid
and I am not. :P
I suggested he does 1 query with the precalculated md5-hash (You must have
misread my post: please reread my post, it clearly suggest 1 query).

But it is entirely possible I do not understand the problem.
That has to do with the fact I cannot imagine a setup that needs to do this,
and if it is needed, that design was bad and needs to be redone IMHO.

I understood that the database holds the md5-hashes, not the raw data, and
the OP said this in his original message:
"I currently have a list of md5 strings and need to check if a new
string is in that list hundreds of thousands of times."

What is unclear to me is WHY he has zillion values to transform to
md5-hashes.
Sounds like very bad design to me.

But since Cleverburn dropped out of this discussion, it will be hard to tell
what he actually is doing. We can only guess.

Regards,
Erwin Moller

**Toby Inkster** · Dec 13 '06, 10:35 AM

Re: faster way to get from SQL database into array

Norman Peelman wrote:

Is it possible this has been fixed (running PHP4.3.4) as I notice the post
is from 2005... Or am I not looking at something right?

The strings given in the article need to be un-wordwrapped and un-hexed
before comparison...

<?php

# These are hexadecimal versions of the true strings that need to be
# tested for MD5 values.
$hex_A =
'd131dd02c5e6ee c4693d9a0698aff 95c2fcab5871246 7eab4004583eb8f b7f8955ad340609 f
4b30283e4888325 71415a085125e8f 7cdc99fd91dbdf2 80373c5bd8823e3 156348f5bae6dac d
436c919c6dd53e2 b487da03fd02396 306d248cda0e99f 33420f577ee8ce5 4b67080a80d1ec6 9
821bcb6a8839396 f9652b6ff72a70' ;
$hex_B =
'd131dd02c5e6ee c4693d9a0698aff 95c2fcab5071246 7eab4004583eb8f b7f8955ad340609 f
4b30283e4888325 f1415a085125e8f 7cdc99fd91dbd72 80373c5bd8823e3 156348f5bae6dac d
436c919c6dd53e2 3487da03fd02396 306d248cda0e99f 33420f577ee8ce5 4b67080280d1ec6 9
821bcb6a8839396 f965ab6ff72a70' ;

# First, remove extra characters introduced by wrapping in article.
$hex_A = preg_replace('/[\r\n\s]/', '', $hex_A);
$hex_B = preg_replace('/[\r\n\s]/', '', $hex_B);

# Now, unhex them.
$str_A = pack('H*', $hex_A);
$str_B = pack('H*', $hex_B);

$sha1_A = sha1($str_A);
$sha1_B = sha1($str_B);

$crc32_A = crc32($str_A);
$crc32_B = crc32($str_B);

$md5_A = md5($str_A);
$md5_B = md5($str_B);

echo "<pre>\n";
echo "To prove the strings are different, look at their SHA-1 and CRC32 values.\n";
echo "\n";
echo "SHA-1 A: {$sha1_A}\n";
echo "SHA-1 B: {$sha1_B}\n";
echo "\n";
echo "CRC32 A: {$crc32_A}\n";
echo "CRC32 B: {$crc32_B}\n";
echo "\n";
echo "Now behold...\n";
echo "\n";
echo "MD5 A: {$md5_A}\n";
echo "MD5 B: {$md5_B}\n";
echo "</pre>\n";

?>

--
Toby A Inkster BSc (Hons) ARCS
Contact Me ~ http://tobyinkster.co.uk/contact

**Norman Peelman** · Dec 13 '06, 12:25 PM

Re: faster way to get from SQL database into array

"Erwin Moller"
<since_humans_r ead_this_I_am_s pammed_too_much @spamyourself.c omwrote in
message news:457fdab9$0 $326$e4fe514c@n ews.xs4all.nl.. .

Toby Inkster wrote:
>

Erwin Moller wrote:

You have a certain value that you transform to a md5 and check if it is
your db allready, right?

No -- he has *hundred of thousands* of "certain values" that he needs to
transform to an MD5 and check to see if it's in his DB already.

What you are suggesting is (pseudo-code abound):

>
No, I am explicitely NOT suggesting that because that would be very stupid
and I am not. :P
I suggested he does 1 query with the precalculated md5-hash (You must have
misread my post: please reread my post, it clearly suggest 1 query).
>
But it is entirely possible I do not understand the problem.
That has to do with the fact I cannot imagine a setup that needs to do

this,

and if it is needed, that design was bad and needs to be redone IMHO.
>
I understood that the database holds the md5-hashes, not the raw data, and
the OP said this in his original message:
"I currently have a list of md5 strings and need to check if a new
string is in that list hundreds of thousands of times."
>
What is unclear to me is WHY he has zillion values to transform to
md5-hashes.
Sounds like very bad design to me.
>
But since Cleverburn dropped out of this discussion, it will be hard to

tell

what he actually is doing. We can only guess.
>
Regards,
Erwin Moller
>
>

Yeah, that's what I said...

Norm
--
FREE Avatar hosting at www.easyavatar.com

**Norman Peelman** · Dec 13 '06, 01:15 PM

Re: faster way to get from SQL database into array

"Toby Inkster" <usenet200611@t obyinkster.co.u kwrote in message
news:cv1554-stm.ln1@ophelia .g5n.co.uk...

Norman Peelman wrote:
>

Is it possible this has been fixed (running PHP4.3.4) as I notice the

post

is from 2005... Or am I not looking at something right?

>
The strings given in the article need to be un-wordwrapped and un-hexed
before comparison...
>

>
--
Toby A Inkster BSc (Hons) ARCS
Contact Me ~ http://tobyinkster.co.uk/contact
>

Ok, I guess that headecimal conversion wasn't apparrant (to me). I
understand it now. Wondered what I was doing wrong. My question becomes
that of feasability... how bad is this for passwords as opposed to forging
documents?

Norm
--
FREE Avatar hosting at www.easyavatar.com

**Cleverbum@hotmail.com** · Dec 13 '06, 01:35 PM

Re: faster way to get from SQL database into array

Erwin Moller wrote:

Toby Inkster wrote:
>

Erwin Moller wrote:

You have a certain value that you transform to a md5 and check if it is
your db allready, right?

No -- he has *hundred of thousands* of "certain values" that he needs to
transform to an MD5 and check to see if it's in his DB already.

What you are suggesting is (pseudo-code abound):

>
No, I am explicitely NOT suggesting that because that would be very stupid
and I am not. :P
I suggested he does 1 query with the precalculated md5-hash (You must have
misread my post: please reread my post, it clearly suggest 1 query).
>
But it is entirely possible I do not understand the problem.
That has to do with the fact I cannot imagine a setup that needs to do this,
and if it is needed, that design was bad and needs to be redone IMHO.
>
I understood that the database holds the md5-hashes, not the raw data, and
the OP said this in his original message:
"I currently have a list of md5 strings and need to check if a new
string is in that list hundreds of thousands of times."
>
What is unclear to me is WHY he has zillion values to transform to
md5-hashes.
Sounds like very bad design to me.
>
But since Cleverburn dropped out of this discussion, it will be hard to tell
what he actually is doing. We can only guess.
>
Regards,
Erwin Moller

Sorry to have dropped out for a day, I've got a datafile which contains
some duplicate entries, to check that the entry is not a duplicate I
first check that the MD5 of that entry is not in a list of the MD5s of
lines which have already been processed.
Computing the md5 and then checking it against a database is very much
quicker than comparing the raw data strings to one another.

**Cleverbum@hotmail.com** · Dec 13 '06, 01:45 PM

Re: faster way to get from SQL database into array

Norman Peelman wrote:

<Cleverbum@hotm ail.comwrote in message
news:1165846992 .423101.215040@ f1g2000cwa.goog legroups.com...

I currently have a list of md5 strings and need to check if a new
string is in that list hundreds of thousands of times. I've found that
the fastest way to do this is to have all the md5's stored in an array
and use the php function in_array().
my only problem now is that populating this array with data from my sql
server is rather slow, I currently use the lines:

$resone = mysql_query("SE LECT * FROM logs_full");
mysql_close();

while ($row = mysql_fetch_arr ay($resone)) {
$md5array[$md5count]= $row['textmd5'];
$md5count++;
}

to do this. does anyone have a faster method?

>
I can only shorten it:
>
$resone = mysql_query("SE LECT * FROM logs_full");
mysql_close();
>
while ($row = mysql_fetch_ass oc($resone)) {
$md5array[]= $row['textmd5'];
}
>
How big (how many rows) is the table you are reading in? Are you storing new
md5's when they aren't found in the db?
>
>
Norm
--
FREE Avatar hosting at www.easyavatar.com

The table currently has about 390,000 rows, and no I don't send the new
MD5s to it as soon as they are found to be new, I just put them into
the array.
I've found that moving away from single insert queries speeds up the
script many times over and so now I insert the data into the SQL tables
when I have 1000 new unique values.

**Cleverbum@hotmail.com** · Dec 13 '06, 01:45 PM

Re: faster way to get from SQL database into array

Norman Peelman wrote:

<Cleverbum@hotm ail.comwrote in message
news:1165846992 .423101.215040@ f1g2000cwa.goog legroups.com...

I currently have a list of md5 strings and need to check if a new
string is in that list hundreds of thousands of times. I've found that
the fastest way to do this is to have all the md5's stored in an array
and use the php function in_array().
my only problem now is that populating this array with data from my sql
server is rather slow, I currently use the lines:

$resone = mysql_query("SE LECT * FROM logs_full");
mysql_close();

while ($row = mysql_fetch_arr ay($resone)) {
$md5array[$md5count]= $row['textmd5'];
$md5count++;
}

to do this. does anyone have a faster method?

>
Maybe,
>
I think you are going about this project all wrong:
>
Assumption - your 'list' of md5's is actually the md5's in the database.
Assumption - you arecreating md5's from strings in text files.
Assumption - you need to check to make sure that the new value isn't already
in the database.
>
Problem - You are reading in the ENTIRE database.
Problem - you may be tempted to think in_array() would be faster, but it has
to start at the beginning of the array for each new value every time.
Problem - big waste of time and resources. You are doing double work.
>
Solution - Let MySQL do what it was designed to do. Since md5's are meant to
be unique in their own right, simply make your 'textmd5' field the PRIMARY
KEY and it will automatically be indexed. Now only do the operations you
actually require:
>
$query = "SELECT textmd5 FROM logs_full WHERE textmd5 = 'search_md5';
$resone = mysql_query($qu ery, $dbc);

I don't know if it's because my SQL server and webserver are on
different machines, or because it's a feature of the language, but this
simply isn't as fast as the binary search which I am now using.

if (!mysql_num_row s($resone))
{
// no match found
// insert new info into database
}
else
{
//match found
// no need to insert
}
>
no arrays being used and I guarantee it will be WAY faster for any size of
database. Now for the real question. Any particular reason you are creating
an md5 database? It's already being done...
>
Norm
--
FREE Avatar hosting at www.easyavatar.com

**Norman Peelman** · Dec 13 '06, 04:15 PM

Re: faster way to get from SQL database into array

<Cleverbum@hotm ail.comwrote in message
news:1166018776 .381774.325990@ f1g2000cwa.goog legroups.com...

>
Norman Peelman wrote:

<Cleverbum@hotm ail.comwrote in message
news:1165846992 .423101.215040@ f1g2000cwa.goog legroups.com...

I currently have a list of md5 strings and need to check if a new
string is in that list hundreds of thousands of times. I've found that
the fastest way to do this is to have all the md5's stored in an array
and use the php function in_array().
my only problem now is that populating this array with data from my

sql

server is rather slow, I currently use the lines:
>
$resone = mysql_query("SE LECT * FROM logs_full");
mysql_close();
>
while ($row = mysql_fetch_arr ay($resone)) {
$md5array[$md5count]= $row['textmd5'];
$md5count++;
}
>
to do this. does anyone have a faster method?
>

I can only shorten it:

$resone = mysql_query("SE LECT * FROM logs_full");
mysql_close();

while ($row = mysql_fetch_ass oc($resone)) {
$md5array[]= $row['textmd5'];
}

How big (how many rows) is the table you are reading in? Are you storing

new

md5's when they aren't found in the db?

Norm
--
FREE Avatar hosting at www.easyavatar.com

>
The table currently has about 390,000 rows, and no I don't send the new
MD5s to it as soon as they are found to be new, I just put them into
the array.
I've found that moving away from single insert queries speeds up the
script many times over and so now I insert the data into the SQL tables
when I have 1000 new unique values.
>

Ok, so then the next set of questions would be:
1) How large do you expect this table to get and how soon? (already consumes
12.5megs just for the textmd5 field alone)
2) How are you INSERTING the new info at the 1000 mark?

....what about LOAD DATA INFILE with the IGNORE switch?

fw_error_www

http://dev.mysql.com/doc/refman/4.1/en/load-data.html

....or using array_diff()?

PHP: array_diff - Manual

http://us2.php.net/manual/en/function.array-diff.php

Computes the difference of arrays

Norm

**Toby Inkster** · Dec 13 '06, 05:35 PM

Re: faster way to get from SQL database into array

Norman Peelman wrote:

My question becomes that of feasability... how bad is this for
passwords as opposed to forging documents?

For low to medium security purposes, I'm guessing that MD5 will still be a
useful hash function for a few years yet -- I don't think it is yet the
time to rip all the md5() functions out of your existing programmes and
replace them with more modern hashes. But my advice would certainly be to
consider hashes such as SHA256 and Whirlpool for future projects.

For checksummy-type purposes, like verifying a file hasn't been
accidentally damaged during a download or compression/decompression, then
I'd recommend comparing not just hashes, but also file sizes.

For passwordy-type purposes, I'd recommend *not* storing the length of the
un-hashed password, as this gives an additional clue to what the password
is.

(Also, Google for "md5deep".)

--
Toby A Inkster BSc (Hons) ARCS
Contact Me ~ http://tobyinkster.co.uk/contact

**Toby Inkster** · Dec 14 '06, 10:25 PM

Re: faster way to get from SQL database into array

Erwin Moller wrote:

I suggested he does 1 query with the precalculated md5-hash.

I know you did -- but there are "hundreds of thousands" of precalculated
ms5-hashes (needles) -- it said so in the original post -- thus hundreds
of thousands of queries.

You must have misread my post: please reread my post, it clearly
suggest 1 query

.... per hash to be found!

Please reread my code: it clearly implements the technique you suggested.

--
Toby A Inkster BSc (Hons) ARCS
Contact Me ~ http://tobyinkster.co.uk/contact

faster way to get from SQL database into array

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment