Ignoring similar entries in a column of many rows

**bvdet** · Sep 17 '10, 01:01 PM

This can easily be accomplished with the in operator:

Code:

s = '''45,56,890
67,92,4502
76,367,89
67,92,4502
67,92,4502
92,14,05
02,56,125
25,02,61
02,56,125'''

# create list of lines
lineList = s.split("\n")
output = []
# simulate iteration on file object or list of lines
for line in lineList:
    if line not in output:
        output.append(line)
print "\n".join(output)

Output:
>>> 45,56,890
67,92,4502
76,367,89
92,14,05
02,56,125
25,02,61
>>>

**dwblas** · Sep 17 '10, 07:15 PM

If you have a large amount of data, use sets, i.e each group of three numbers would become a tuple added to a set. Sets don't add duplicates but do alter the order. If you want to retain the original file order, then use an OrderedDict for a large data set. If bvdet's solution runs in a reasonable amount of time then there is no reason to change. Post back if you want more info on either alternate solution.

**Atrisa** · Sep 20 '10, 09:02 AM

Thanks a lot both of you. I need to use bvdet's code like in the following:

Code:

outfile = open('capture25000-column3.txt', 'w')

for (i, eachline) in enumerate(open('capture25000.txt','r')):
 column3 = eachline.split(' ')[2]
 newcolumn = column3.split('.')[-1]
 outfile.write(newcolumn + '\n')
 output = []
 lineList = newcolumn.split("\n")
 for line in lineList:
  if line not in output:
   output.append(line)
 print "\n".join(output)

Basically what I have is an input file of 25000 rows, 'capture25000.t xt'. I extract the column that I need from this file and put that column in another file 'capture25000-column3.txt'. So how do I use this new output file in bvdet's code? I have done the way above, but it again prints the same values as in the 'newcolumn' list or 'capture25000-column3.txt' file. I want to get rid of the similar lines in the 'capture25000-column3.txt' file.

**bvdet** · Sep 20 '10, 03:24 PM

Judging from your code, it appears that the file format is not like the sample data in your first post. Could you post a representative sample of your actual data?

**Atrisa** · Sep 20 '10, 03:38 PM

The actual data which is in a file called 'capture25000-column3.txt' is a set of port numbers like this:

52141
21283
43439
21283
21283
23479
80
10464
80
80
21283
43439
80
10464
10464
21283
.
.
.

and there are 25000 rows of those numbers. If you need all of them, I could send it to your email.

**bvdet** · Sep 20 '10, 05:44 PM

So that's supposed to be the output? What does the corresponding data look like?

**Atrisa** · Sep 20 '10, 06:26 PM

I first have an input file that has captured data. I need only the third column from that file, so I did it this way:

Code:

# for (i, eachline) in enumerate(open('capture25000.txt','r')):
#  column3 = eachline.split(' ')[2]

Then I get the list 'column3' which looks like this:

27.20.29.12.212 83
27.20.29.12.212 83
65.24.112.69.23 479
27.120.98.21.80
32.16.45.74.104 64
56.176.45.74.10 464
80.290.20.12.21 283
45.85.178.252.2 8302
80.91.188.120.1 328
27.144.102.81.1 3696
62.244.260.6.33 344
54.57.14.68.80
200.97.17.48.80

There 2500 rows of those. Then I use the following piece of code:

Code:

#  newcolumn = column3.split('.')[-1]

to get only the port numbers as in the following:

21283
21283
23479
80
10464
1046
1283
28302
1328
13696
33344
80
80
.
.
.

As you notice there are more than one occurrence of every port number. I want to leave only the first occurrence of any number and get rid of the rest, which your code is supposed to do. But since I have those port numbers in an external file, I didn't know how to use your code for my purpose.

Thanks a lot again for your time

**bvdet** · Sep 20 '10, 06:36 PM

See if this works for you:

Code:

outfile = open('capture25000-column3.txt', 'w')

f = open('capture25000.txt')
output = []
for line in f:
    column3 = line.strip().split()[2].split(".")[-1]
    if column3 not in output:
        output.append(column3)

f.close()
outfile.write("\n".join(output))
outfile.close()

**Atrisa** · Sep 21 '10, 07:44 AM

Thanks a lot bvdet. That worked well:)

**Atrisa** · Sep 21 '10, 11:41 AM

I also want to count how many times the port numbers appeared in the 'column3' and put the port number and beside it the number of appearances of that port in a file, so that I have something like this:

21283 3
21283 2
23479 1
80 7
10464 2

and so on. This is my code that tries to do that, but it gives me only 1 appearance for each port:

Code:

outfile = open('capture25000-column3.txt', 'w')

f = open('capture25000.txt')

output = []
for line in f:
   column3 = line.strip().split()[2].split(".")[-1].split("\n")
   for val in column3:   
      if not val in output:
         print val, column3.count(val)
         output.append(val)

f.close()
outfile.write("\n".join(output))  
outfile.close()

It seems to count the appearances from the file where I have only the unique numbers, so basically each of those appear for once there. What am I missing in the above code?

**bvdet** · Sep 21 '10, 01:44 PM

Variable column3 is a string representing a number. Splitting column3 on the "\n" character does nothing except return a one element list. column3 is only added to the output list one time, so it is only counted once.

I would recommend storing the data in a dictionary. The port numbers would be the dictionary keys and the counts would be the values.

Code:

outfile = open('capture25000-column3.txt', 'w')

f = open('capture25000.txt')
# initialize a dictionary
dd = {}
# iterate on the file object
for line in f:
    # get the port number from the third item in line
    column3 = line.strip().split()[2].split(".")[-1]
    # if column3 not in dd, add to dd and set quantity to 0
    dd.setdefault(column3, 0)
    # increment dd[port number] by one
    dd[column3] += 1

f.close()
# write the dictionary to disk
outfile.write("\n".join(["%s %s" % (key, dd[key]) for key in dd]))
outfile.close()

**Atrisa** · Sep 21 '10, 02:21 PM

Thanks a lot bvdet for your help and your time. That works perfectly. It's two weeks that I have started learning Python, that's why I am a bit lost.

**Atrisa** · Sep 21 '10, 07:18 PM

Now I have created the following method in the code that you did earlier. The method creates a dictionary that has the port numbers and their corresponding port names:

Code:

def get_port_name (name, port_table):
# ports.txt is a file where I have put the IANA PORT NUMBERS
    f= open('ports.txt','r')
    content=f.readline()
    keycount=1
    while content:
        key = str(keycount)
        line1=content.split()   
        if len(line1)>3 and line1[0]!="#":
            port=line1[1].split('/')
            if len(port)==2:
                port_table[int(port[0])]={'keycode':line1[0],'description':line1[2]}
        keycount +=1
        content=f.readline()

ports_name={}
get_port_name ('Ports', ports_name)
print ports_name;

A sample of the result of the code above is like this:

-----------------------------------------------------
80: {'keycode': 'www-http', 'description': 'World'},
82: {'keycode': 'xfer', 'description': 'XFER'},
83: {'keycode': 'mit-ml-dev', 'description': 'MIT'},
---------------------------------------------------

e.g. the first line shows port 80 with the name being 'www-http'. I want to grab those keycode values and put them in the 3rd column of the output file that you gave earlier and have a result like this for all the port numbers:

80 115 HTTP
443 52 HTTPS

and so on, with the following line:

Code:

outfile.write("\n".join(["%s %s %s" % (key, dd[key], port_table['keycode'][key]) for key in dd]))

but it doesn't work. How to use the method here for the output? I am yet well familiar with the syntax of Python methods I guess.

**bvdet** · Sep 21 '10, 07:25 PM

It may be as simple as:

Code:

port_table[key]['keycode']

Ignoring similar entries in a column of many rows

Ignoring similar entries in a column of many rows

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment