Hey all,
Sorry the subject should have said..
"best way to match values in TWO tables"
I have two tables that I need to match based off an Unique ID in both tables. Im running this process using hadoop streaming with python, so the actual code is a bit different (ie using csv files to debug locally). I've tried a couple different methods and both are not quite fast enough... ha!
First was like this, using the two tables as lists and comparing the ID's
This works but slow... and very memory intensive (my table1 is 500MB and table2 is extremely large)
I also tried using a list and a dictionary
This is better, with a loop through table1 taking about 4 seconds...
This is still too long..
Any thoughts on how to best match my two ID's? I haven't tried using two dictionaries? would that improve performance?
Thanks ahead of time!
Sorry the subject should have said..
"best way to match values in TWO tables"
I have two tables that I need to match based off an Unique ID in both tables. Im running this process using hadoop streaming with python, so the actual code is a bit different (ie using csv files to debug locally). I've tried a couple different methods and both are not quite fast enough... ha!
First was like this, using the two tables as lists and comparing the ID's
Code:
reader = open("D:\\temp\\table1.csv",'r')
for line in reader:
line = line.strip()
TmpArr.append( line.split(',') )
reader.close()
reader = open("D:\\temp\\table2.csv",'r')
for line in reader:
line = line.strip()
Tmp2Arr = line.split(',')
For line2 in TmpArr:
If Tmp2Arr[0] == TmpArr[i][0]:
Do some stuff...
I also tried using a list and a dictionary
Code:
reader = open("D:\\temp\\table1.csv",'r')
for line in reader:
line = line.strip()
TmpArr=line.split(',')
TmpDict[TmpArr[0]]=TmpArr[1]+str(',')+TmpArr[5]+str(',')+TmpArr[6]
reader.close()
reader = open("D:\\temp\\table2.csv",'r')
for line in reader:
line = line.strip()
Tmp2Arr = line.split(',')
for k,v in TmmDict.iteritems():
If Tmp2Arr[0] == k:
Do some stuff...
This is still too long..
Any thoughts on how to best match my two ID's? I haven't tried using two dictionaries? would that improve performance?
Thanks ahead of time!
Comment