Removing duplicate entries in a csv file using a python script

**KaezarRex** · Nov 19 '07, 03:28 PM

Originally posted by sathish119

I m a beginner to python. Could you tell me how should i proceed to remove duplicate rows in a csv file

If the order of the information in your csv file doesn't matter, you could put each line of the file into a list, convert the list into a set, and then write the list back into the file. When you convert the list to a set, all duplicate elements disappear.

[CODE=python]reader = open("file.csv" , "r")
lines = reader.read().s plit("\n")
reader.close()

writer = open("file.csv" , "w")
for line in set(lines):
writer.write(li ne + "\n")
writer.close()[/CODE]

**bvdet** · Nov 19 '07, 04:03 PM

This code maintains the order of the data:[code=Python]>>> rows = open('data.txt' ).read().split( '\n')
>>> newrows = []
>>> for row in rows:
... if row not in newrows:
... newrows.append( row)
...
>>> f = open('data1.txt ', 'w')
>>> f.write('\n'.jo in(newrows))
>>> f.close()[/code]

**KaezarRex** · Nov 19 '07, 04:23 PM

Here is another way to solve your problem using bvdet's method and the csv module.

[CODE=python]import csv
rows = csv.reader(open ("file.csv", "rb"))
newrows = []
for row in rows:
if row not in newrows:
newrows.append( row)
writer = csv.writer(open ("file.csv", "wb"))
writer.writerow s(newrows)[/CODE]

**gpadmini24** · Nov 20 '07, 07:12 AM

Originally posted by KaezarRex

Here is another way to solve your problem using bvdet's method and the csv module.

[CODE=python]import csv
rows = csv.reader(open ("file.csv", "rb"))
newrows = []
for row in rows:
if row not in newrows:
newrows.append( row)
writer = csv.writer(open ("file.csv", "wb"))
writer.writerow s(newrows)[/CODE]

from above code,when i am using set(rows),am getting error' list objects are unhashable'..i think list is hashble (by hashq module)...then y am i getting this error??pls explain

**gpadmini24** · Nov 20 '07, 09:06 AM

hi...
from above code,when i am using set(rows),am getting error' list objects are unhashable'..i think list is hashble (by hashq module)...then y am i getting this error??pls explain

**bvdet** · Nov 20 '07, 11:03 AM

Originally posted by gpadmini24

hi...
from above code,when i am using set(rows),am getting error' list objects are unhashable'..i think list is hashble (by hashq module)...then y am i getting this error??pls explain

This error indicates you are attempting to create a set from objects that are mutable. By definition, a set is a group of unique immutable objects. The csv.reader() function returns a list of lists. A list is a mutable object. KaezarRex's earlier example in this thread was applying set() to a list of strings. A string is an immutable object.

**sathish119** · Nov 23 '07, 02:42 PM

Originally posted by KaezarRex

Here is another way to solve your problem using bvdet's method and the csv module.

[CODE=python]import csv
rows = csv.reader(open ("file.csv", "rb"))
newrows = []
for row in rows:
if row not in newrows:
newrows.append( row)
writer = csv.writer(open ("file.csv", "wb"))
writer.writerow s(newrows)[/CODE]

Hey i used this code and i was able to remove the duplicate entries. thanks. actually this csv file is generated by a java code. if the code is modified, the output should remain the same. to acheive this i found that the files should be sorted in some order to compare(since the rows are selected by the java code randomly). could you tell me how to sort the contents for ex. priority: Column 5, Column 8, Column1. is it possible to sort the newrows list before writing.

**sathish119** · Nov 23 '07, 02:46 PM

Originally posted by bvdet

This code maintains the order of the data:[code=Python]>>> rows = open('data.txt' ).read().split( '\n')
>>> newrows = []
>>> for row in rows:
... if row not in newrows:
... newrows.append( row)
...
>>> f = open('data1.txt ', 'w')
>>> f.write('\n'.jo in(newrows))
>>> f.close()[/code]

hey thanks for ur reply. i used the logic which KaezarRex said. could you see prev post and tell me your suggestion

**bvdet** · Nov 23 '07, 04:39 PM

Originally posted by sathish119

Hey i used this code and i was able to remove the duplicate entries. thanks. actually this csv file is generated by a java code. if the code is modified, the output should remain the same. to acheive this i found that the files should be sorted in some order to compare(since the rows are selected by the java code randomly). could you tell me how to sort the contents for ex. priority: Column 5, Column 8, Column1. is it possible to sort the newrows list before writing.

I am in Python 2.3. Define a comparison function to pass to the list sort method:[code=Python]def comp581(a, b):
x = cmp(a[5], b[5])
if not x:
y = cmp(a[8], b[8])
if not y:
return cmp(a[1], b[1])
return y
return x

yourList.sort(c omp581)[/code]In Python 2.4:[code=Python]yourList.sort(k ey=lambda i: (i[5], i[8], i[1]))[/code]

**sathish119** · Nov 26 '07, 06:52 AM

Originally posted by bvdet

I am in Python 2.3. Define a comparison function to pass to the list sort method:[code=Python]def comp581(a, b):
x = cmp(a[5], b[5])
if not x:
y = cmp(a[8], b[8])
if not y:
return cmp(a[1], b[1])
return y
return x

yourList.sort(c omp581)[/code]In Python 2.4:[code=Python]yourList.sort(k ey=lambda i: (i[5], i[8], i[1]))[/code]

thanks a lot. im able to sort the list now. (the version im using is 2.5.1 - i used the 'lambda' functionality)

Removing duplicate entries in a csv file using a python script

Removing duplicate entries in a csv file using a python script

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment