How to remove the duplicate lines retaining first occurences

**bvdet** · Jul 15 '15, 05:04 PM

Use a for loop and conditionally append to a list.

Code:

data = """Jan 1 02:32:40 hello welcome to python world
Jan 1 02:32:40 hello welcome to python world
Mar 31 23:31:55 learn python
Mar 31 23:31:55 learn python be smart
Mar 31 23:31:56 python is good scripting language
Jan 1 00:00:01 hello welcome to python world
Jan 1 00:00:02 hello welcome to python world
Mar 31 23:31:55 learn python
Mar 31 23:31:56 python is good scripting language"""

output = []

for line in data.split("\n"):
    if line.startswith("Jan 1"):
        output.append(line)
    elif line not in output:
        output.append(line)

print "\n".join(output)

The output:

Code:

>>> Jan 1 02:32:40 hello welcome to python world
Jan 1 02:32:40 hello welcome to python world
Mar 31 23:31:55 learn python
Mar 31 23:31:55 learn python be smart
Mar 31 23:31:56 python is good scripting language
Jan 1 00:00:01 hello welcome to python world
Jan 1 00:00:02 hello welcome to python world
>>>

**helloR** · Jul 15 '15, 07:42 PM

@bvdet: Thanks you very much!! I have already been tried this solution but here the problem is if the input message file size is more then it takes more time....

Below is the program which I have tried:

Code:

inputFile = open("in.txt", "r")
log = []
for line in inputFile:
    if line in log and line[0:5] != "Jan 1":
        pass
    else:
        log.append(line)
inputFile.close()
outFile = open("out.txt", "w")
for item in log:
    outFile.write(item)
outFile.close()

Note: I have tried with input file size as ~70000 kb and it takes ~9 minutes to complete the execution.

Pls let me know if we can do it some elegant way.....

**bvdet** · Jul 15 '15, 07:58 PM

Try writing to the file one time.

Code:

outFile.write("\n".join(log))

**helloR** · Jul 16 '15, 08:07 PM

@bvet: You mean something like below:

Code:

inputFile = open("in.txt", "r")
outFile = open("out.txt", "w")
log = []
for line in inputFile:
   if line in log and line[0:5] != "Jan 1":
      pass
   else:
      log.append(line)
   outFile.write("\n".join(log))
inputFile.close()
outFile.close()

Pls correct me if i am wrong.

**bvdet** · Jul 17 '15, 01:31 AM

No, write to the file outside of the for loop:

Code:

outFile.write("\n".join(log))
inputFile.close()
outFile.close()

**helloR** · Jul 18 '15, 04:23 PM

@bvdet: Thank you for your help!!! This is working but still there is performance issue whenever input file size is more....

Could you please take a look into below code...

Code:

def remove_Duplicate_Lines(inputfile, outputfile):
   with open(inputfile) as fin, open(outputfile, 'w') as out:
      lines = (line.rstrip() for line in fin)
      unique_lines = OrderedDict.fromkeys( (line for line in lines if line) )
      out.writelines("\n".join(unique_lines.iterkeys()))
   return 0

**bvdet** · Jul 18 '15, 06:48 PM

You are iterating over the lines in the file twice. Try eliminating one of them. It is possible OrderedDict may be slower than a for loop. I don't know one way or the other. You can use module timeit to check different methods.

How to remove the duplicate lines retaining first occurences

How to remove the duplicate lines retaining first occurences

Comment

Comment

Comment

Comment

Comment

Comment

Comment