How to remove the duplicate lines retaining first occurences

Collapse
X
 
  • Time
  • Show
Clear All
new posts
  • helloR
    New Member
    • Jun 2015
    • 8

    How to remove the duplicate lines retaining first occurences

    Let's say a input text file "input_msg. txt" file ( file size is 70,000 kb ) contains following records..

    Jan 1 02:32:40 hello welcome to python world
    Jan 1 02:32:40 hello welcome to python world
    Mar 31 23:31:55 learn python
    Mar 31 23:31:55 learn python be smart
    Mar 31 23:31:56 python is good scripting language
    Jan 1 00:00:01 hello welcome to python world
    Jan 1 00:00:02 hello welcome to python world
    Mar 31 23:31:55 learn python
    Mar 31 23:31:56 python is good scripting language

    The expected output file ( Let's say outputfile.txt ) should contain below records...

    Jan 1 02:32:40 hello welcome to python world
    Jan 1 02:32:40 hello welcome to python world
    Mar 31 23:31:55 learn python
    Mar 31 23:31:55 learn python be smart
    Mar 31 23:31:56 python is good scripting language
    Jan 1 00:00:01 hello welcome to python world
    Jan 1 00:00:02 hello welcome to python world

    Note: I need all the records (including duplicate) which are starting with "Jan 1" and also I don't need Duplicate records not starting with "Jan 1"

    I have tried the following program where all the duplicate records are getting deleted.
    Code:
    def remove_Duplicate_Lines(inputfile, outputfile):  
       with open(inputfile) as fin, open(outputfile, 'w') as out:
          lines = (line.rstrip() for line in fin)
          unique_lines = OrderedDict.fromkeys( (line for line in lines if line) )
          out.writelines("\n".join(unique_lines.iterkeys()))
     return 0
    Oputput of my program are below:

    Jan 1 02:32:40 hello welcome to python world
    Mar 31 23:31:55 learn python
    Mar 31 23:31:55 learn python be smart
    Mar 31 23:31:56 python is good scripting language
    Jan 1 00:00:01 hello welcome to python world

    Your help would be appreciated!!!
    Last edited by bvdet; Jul 15 '15, 04:56 PM. Reason: Please use code tags when posting code [code]......[/code]
  • bvdet
    Recognized Expert Specialist
    • Oct 2006
    • 2851

    #2
    Use a for loop and conditionally append to a list.
    Code:
    data = """Jan 1 02:32:40 hello welcome to python world
    Jan 1 02:32:40 hello welcome to python world
    Mar 31 23:31:55 learn python
    Mar 31 23:31:55 learn python be smart
    Mar 31 23:31:56 python is good scripting language
    Jan 1 00:00:01 hello welcome to python world
    Jan 1 00:00:02 hello welcome to python world
    Mar 31 23:31:55 learn python
    Mar 31 23:31:56 python is good scripting language"""
    
    output = []
    
    for line in data.split("\n"):
        if line.startswith("Jan 1"):
            output.append(line)
        elif line not in output:
            output.append(line)
    
    print "\n".join(output)
    The output:
    Code:
    >>> Jan 1 02:32:40 hello welcome to python world
    Jan 1 02:32:40 hello welcome to python world
    Mar 31 23:31:55 learn python
    Mar 31 23:31:55 learn python be smart
    Mar 31 23:31:56 python is good scripting language
    Jan 1 00:00:01 hello welcome to python world
    Jan 1 00:00:02 hello welcome to python world
    >>>

    Comment

    • helloR
      New Member
      • Jun 2015
      • 8

      #3
      @bvdet: Thanks you very much!! I have already been tried this solution but here the problem is if the input message file size is more then it takes more time....

      Below is the program which I have tried:
      Code:
      inputFile = open("in.txt", "r")
      log = []
      for line in inputFile:
          if line in log and line[0:5] != "Jan 1":
              pass
          else:
              log.append(line)
      inputFile.close()
      outFile = open("out.txt", "w")
      for item in log:
          outFile.write(item)
      outFile.close()
      Note: I have tried with input file size as ~70000 kb and it takes ~9 minutes to complete the execution.

      Pls let me know if we can do it some elegant way.....
      Last edited by bvdet; Jul 15 '15, 07:55 PM. Reason: Fix code tags. Tag [code] should be before code and closing tag [/code] after code.

      Comment

      • bvdet
        Recognized Expert Specialist
        • Oct 2006
        • 2851

        #4
        Try writing to the file one time.
        Code:
        outFile.write("\n".join(log))

        Comment

        • helloR
          New Member
          • Jun 2015
          • 8

          #5
          @bvet: You mean something like below:
          Code:
          inputFile = open("in.txt", "r")
          outFile = open("out.txt", "w")
          log = []
          for line in inputFile:
             if line in log and line[0:5] != "Jan 1":
                pass
             else:
                log.append(line)
             outFile.write("\n".join(log))
          inputFile.close()
          outFile.close()
          Pls correct me if i am wrong.

          Comment

          • bvdet
            Recognized Expert Specialist
            • Oct 2006
            • 2851

            #6
            No, write to the file outside of the for loop:
            Code:
            outFile.write("\n".join(log))
            inputFile.close()
            outFile.close()

            Comment

            • helloR
              New Member
              • Jun 2015
              • 8

              #7
              @bvdet: Thank you for your help!!! This is working but still there is performance issue whenever input file size is more....

              Could you please take a look into below code...
              Code:
              def remove_Duplicate_Lines(inputfile, outputfile):
                 with open(inputfile) as fin, open(outputfile, 'w') as out:
                    lines = (line.rstrip() for line in fin)
                    unique_lines = OrderedDict.fromkeys( (line for line in lines if line) )
                    out.writelines("\n".join(unique_lines.iterkeys()))
                 return 0
              Last edited by helloR; Jul 18 '15, 04:26 PM. Reason: Adding more information...

              Comment

              • bvdet
                Recognized Expert Specialist
                • Oct 2006
                • 2851

                #8
                You are iterating over the lines in the file twice. Try eliminating one of them. It is possible OrderedDict may be slower than a for loop. I don't know one way or the other. You can use module timeit to check different methods.

                Comment

                Working...