best way to match values in two tables...

Collapse
X
 
  • Time
  • Show
Clear All
new posts
  • erbrose
    New Member
    • Oct 2006
    • 58

    best way to match values in two tables...

    Hey all,
    Sorry the subject should have said..
    "best way to match values in TWO tables"
    I have two tables that I need to match based off an Unique ID in both tables. Im running this process using hadoop streaming with python, so the actual code is a bit different (ie using csv files to debug locally). I've tried a couple different methods and both are not quite fast enough... ha!
    First was like this, using the two tables as lists and comparing the ID's

    Code:
    reader = open("D:\\temp\\table1.csv",'r')
    for line in reader:
        line = line.strip()
        TmpArr.append( line.split(',') )    
    reader.close()
    
    reader = open("D:\\temp\\table2.csv",'r')
    for line in reader:
        line = line.strip()
        Tmp2Arr = line.split(',')
        For line2 in TmpArr:
            If Tmp2Arr[0] == TmpArr[i][0]:
                Do some stuff...
    This works but slow... and very memory intensive (my table1 is 500MB and table2 is extremely large)

    I also tried using a list and a dictionary
    Code:
    reader = open("D:\\temp\\table1.csv",'r')
    for line in reader:
        line = line.strip()
        TmpArr=line.split(',')
    TmpDict[TmpArr[0]]=TmpArr[1]+str(',')+TmpArr[5]+str(',')+TmpArr[6]
    reader.close()
    
    reader = open("D:\\temp\\table2.csv",'r')
    for line in reader:
        line = line.strip()
        Tmp2Arr = line.split(',')
        for k,v in TmmDict.iteritems():
            If Tmp2Arr[0] == k:
                Do some stuff...
    This is better, with a loop through table1 taking about 4 seconds...
    This is still too long..
    Any thoughts on how to best match my two ID's? I haven't tried using two dictionaries? would that improve performance?
    Thanks ahead of time!
    Last edited by bvdet; Aug 4 '10, 12:13 PM. Reason: corrrected title
  • dwblas
    Recognized Expert Contributor
    • May 2008
    • 626

    #2
    You want to do this the other way around, i.e, look up the items in the dictionary. I would also suggest that you strip() the keys before using them to get rid of any spaces.
    Code:
    reader = open("D:\\temp\\table2.csv",'r')
    for line in reader:
        line = line.strip()
        Tmp2Arr = line.split(',')
        If Tmp2Arr[0].strip() in TmpDict:
            print TmpDict[Tmp2Arr[0]]
    You can also use the intersection of 2 sets if you only want to identify the elements that are the same.
    This works but slow... and very memory intensive (my table1 is 500MB and table2 is extremely large)
    If a dictionary does not work for whatever reason, the next step up is an SQL database. SQLite comes with Python, so post back if you want some help in that area.
    Code:
    reader = open("D:\\temp\\table1.csv",'r')
    for line in reader:
        line = line.strip()
        TmpArr=line.split(',')
        ##
        ## indentation error here in the code as posted (should be indented)
        TmpDict[TmpArr[0]]=TmpArr[1]+str(',')+TmpArr[5]+str(',')+TmpArr[6]
    ##
    ## also a list is slightly faster
        TmpDict[TmpArr[0]]=[TmpArr[1], TmpArr[5], TmpArr[6]]
    ## then you can just join them
    ## an example
    test_list = ["one", "two", "three"]
    print ",".join(test_list)

    reader.close()

    Comment

    • erbrose
      New Member
      • Oct 2006
      • 58

      #3
      wow, thanks dwblas!
      That dramatically increased the search speed! Still alot to learn with python..
      Very much appreciated!
      Eric

      Comment

      • MMcCarthy
        Recognized Expert MVP
        • Aug 2006
        • 14387

        #4
        If your query has been answered please mark as Best Answer the post you feel gave you the solution.

        Mary

        Comment

        Working...