Removing duplicate entries in a csv file using a python script

Collapse
X
 
  • Time
  • Show
Clear All
new posts
  • sathish119
    New Member
    • Nov 2007
    • 4

    Removing duplicate entries in a csv file using a python script

    I m a beginner to python. Could you tell me how should i proceed to remove duplicate rows in a csv file
  • KaezarRex
    New Member
    • Sep 2007
    • 52

    #2
    Originally posted by sathish119
    I m a beginner to python. Could you tell me how should i proceed to remove duplicate rows in a csv file
    If the order of the information in your csv file doesn't matter, you could put each line of the file into a list, convert the list into a set, and then write the list back into the file. When you convert the list to a set, all duplicate elements disappear.

    [CODE=python]reader = open("file.csv" , "r")
    lines = reader.read().s plit("\n")
    reader.close()

    writer = open("file.csv" , "w")
    for line in set(lines):
    writer.write(li ne + "\n")
    writer.close()[/CODE]

    Comment

    • bvdet
      Recognized Expert Specialist
      • Oct 2006
      • 2851

      #3
      This code maintains the order of the data:[code=Python]>>> rows = open('data.txt' ).read().split( '\n')
      >>> newrows = []
      >>> for row in rows:
      ... if row not in newrows:
      ... newrows.append( row)
      ...
      >>> f = open('data1.txt ', 'w')
      >>> f.write('\n'.jo in(newrows))
      >>> f.close()[/code]

      Comment

      • KaezarRex
        New Member
        • Sep 2007
        • 52

        #4
        Here is another way to solve your problem using bvdet's method and the csv module.

        [CODE=python]import csv
        rows = csv.reader(open ("file.csv", "rb"))
        newrows = []
        for row in rows:
        if row not in newrows:
        newrows.append( row)
        writer = csv.writer(open ("file.csv", "wb"))
        writer.writerow s(newrows)[/CODE]

        Comment

        • gpadmini24
          New Member
          • Nov 2007
          • 2

          #5
          Originally posted by KaezarRex
          Here is another way to solve your problem using bvdet's method and the csv module.

          [CODE=python]import csv
          rows = csv.reader(open ("file.csv", "rb"))
          newrows = []
          for row in rows:
          if row not in newrows:
          newrows.append( row)
          writer = csv.writer(open ("file.csv", "wb"))
          writer.writerow s(newrows)[/CODE]

          from above code,when i am using set(rows),am getting error' list objects are unhashable'..i think list is hashble (by hashq module)...then y am i getting this error??pls explain

          Comment

          • gpadmini24
            New Member
            • Nov 2007
            • 2

            #6
            hi...
            from above code,when i am using set(rows),am getting error' list objects are unhashable'..i think list is hashble (by hashq module)...then y am i getting this error??pls explain

            Comment

            • bvdet
              Recognized Expert Specialist
              • Oct 2006
              • 2851

              #7
              Originally posted by gpadmini24
              hi...
              from above code,when i am using set(rows),am getting error' list objects are unhashable'..i think list is hashble (by hashq module)...then y am i getting this error??pls explain
              This error indicates you are attempting to create a set from objects that are mutable. By definition, a set is a group of unique immutable objects. The csv.reader() function returns a list of lists. A list is a mutable object. KaezarRex's earlier example in this thread was applying set() to a list of strings. A string is an immutable object.

              Comment

              • sathish119
                New Member
                • Nov 2007
                • 4

                #8
                Originally posted by KaezarRex
                Here is another way to solve your problem using bvdet's method and the csv module.

                [CODE=python]import csv
                rows = csv.reader(open ("file.csv", "rb"))
                newrows = []
                for row in rows:
                if row not in newrows:
                newrows.append( row)
                writer = csv.writer(open ("file.csv", "wb"))
                writer.writerow s(newrows)[/CODE]
                Hey i used this code and i was able to remove the duplicate entries. thanks. actually this csv file is generated by a java code. if the code is modified, the output should remain the same. to acheive this i found that the files should be sorted in some order to compare(since the rows are selected by the java code randomly). could you tell me how to sort the contents for ex. priority: Column 5, Column 8, Column1. is it possible to sort the newrows list before writing.

                Comment

                • sathish119
                  New Member
                  • Nov 2007
                  • 4

                  #9
                  Originally posted by bvdet
                  This code maintains the order of the data:[code=Python]>>> rows = open('data.txt' ).read().split( '\n')
                  >>> newrows = []
                  >>> for row in rows:
                  ... if row not in newrows:
                  ... newrows.append( row)
                  ...
                  >>> f = open('data1.txt ', 'w')
                  >>> f.write('\n'.jo in(newrows))
                  >>> f.close()[/code]
                  hey thanks for ur reply. i used the logic which KaezarRex said. could you see prev post and tell me your suggestion

                  Comment

                  • bvdet
                    Recognized Expert Specialist
                    • Oct 2006
                    • 2851

                    #10
                    Originally posted by sathish119
                    Hey i used this code and i was able to remove the duplicate entries. thanks. actually this csv file is generated by a java code. if the code is modified, the output should remain the same. to acheive this i found that the files should be sorted in some order to compare(since the rows are selected by the java code randomly). could you tell me how to sort the contents for ex. priority: Column 5, Column 8, Column1. is it possible to sort the newrows list before writing.
                    I am in Python 2.3. Define a comparison function to pass to the list sort method:[code=Python]def comp581(a, b):
                    x = cmp(a[5], b[5])
                    if not x:
                    y = cmp(a[8], b[8])
                    if not y:
                    return cmp(a[1], b[1])
                    return y
                    return x

                    yourList.sort(c omp581)[/code]In Python 2.4:[code=Python]yourList.sort(k ey=lambda i: (i[5], i[8], i[1]))[/code]

                    Comment

                    • sathish119
                      New Member
                      • Nov 2007
                      • 4

                      #11
                      Originally posted by bvdet
                      I am in Python 2.3. Define a comparison function to pass to the list sort method:[code=Python]def comp581(a, b):
                      x = cmp(a[5], b[5])
                      if not x:
                      y = cmp(a[8], b[8])
                      if not y:
                      return cmp(a[1], b[1])
                      return y
                      return x

                      yourList.sort(c omp581)[/code]In Python 2.4:[code=Python]yourList.sort(k ey=lambda i: (i[5], i[8], i[1]))[/code]
                      thanks a lot. im able to sort the list now. (the version im using is 2.5.1 - i used the 'lambda' functionality)

                      Comment

                      Working...