Ignoring similar entries in a column of many rows

Collapse
X
 
  • Time
  • Show
Clear All
new posts
  • Atrisa
    New Member
    • Sep 2010
    • 22

    Ignoring similar entries in a column of many rows

    Hi everyone;

    I have a file with one column the elements of which are separated by commas (,). The column has many rows, as in below:

    45,56,890
    67,92,4502
    76,367,89
    67,92,4502
    67,92,4502
    92,14,05
    02,56,125
    25,02,61
    02,56,125
    .
    .
    .

    I want to go through all the rows in the column ignoring the similar rows and put the result in an output file. For example I want to count 67,92,4502 once and ignore the other similar ones (rows 4 and 5 are the same).

    I hope I have explained the problem clearly.
    Thanks in advance, Atrisa
  • bvdet
    Recognized Expert Specialist
    • Oct 2006
    • 2851

    #2
    This can easily be accomplished with the in operator:
    Code:
    s = '''45,56,890
    67,92,4502
    76,367,89
    67,92,4502
    67,92,4502
    92,14,05
    02,56,125
    25,02,61
    02,56,125'''
    
    # create list of lines
    lineList = s.split("\n")
    output = []
    # simulate iteration on file object or list of lines
    for line in lineList:
        if line not in output:
            output.append(line)
    print "\n".join(output)
    Output:
    >>> 45,56,890
    67,92,4502
    76,367,89
    92,14,05
    02,56,125
    25,02,61
    >>>

    Comment

    • dwblas
      Recognized Expert Contributor
      • May 2008
      • 626

      #3
      If you have a large amount of data, use sets, i.e each group of three numbers would become a tuple added to a set. Sets don't add duplicates but do alter the order. If you want to retain the original file order, then use an OrderedDict for a large data set. If bvdet's solution runs in a reasonable amount of time then there is no reason to change. Post back if you want more info on either alternate solution.

      Comment

      • Atrisa
        New Member
        • Sep 2010
        • 22

        #4
        Thanks a lot both of you. I need to use bvdet's code like in the following:

        Code:
        outfile = open('capture25000-column3.txt', 'w')
        
        for (i, eachline) in enumerate(open('capture25000.txt','r')):
         column3 = eachline.split(' ')[2]
         newcolumn = column3.split('.')[-1]
         outfile.write(newcolumn + '\n')
         output = []
         lineList = newcolumn.split("\n")
         for line in lineList:
          if line not in output:
           output.append(line)
         print "\n".join(output)
        Basically what I have is an input file of 25000 rows, 'capture25000.t xt'. I extract the column that I need from this file and put that column in another file 'capture25000-column3.txt'. So how do I use this new output file in bvdet's code? I have done the way above, but it again prints the same values as in the 'newcolumn' list or 'capture25000-column3.txt' file. I want to get rid of the similar lines in the 'capture25000-column3.txt' file.

        Comment

        • bvdet
          Recognized Expert Specialist
          • Oct 2006
          • 2851

          #5
          Judging from your code, it appears that the file format is not like the sample data in your first post. Could you post a representative sample of your actual data?

          Comment

          • Atrisa
            New Member
            • Sep 2010
            • 22

            #6
            The actual data which is in a file called 'capture25000-column3.txt' is a set of port numbers like this:

            52141
            21283
            43439
            21283
            21283
            23479
            80
            10464
            80
            80
            21283
            43439
            80
            10464
            10464
            21283
            .
            .
            .

            and there are 25000 rows of those numbers. If you need all of them, I could send it to your email.

            Comment

            • bvdet
              Recognized Expert Specialist
              • Oct 2006
              • 2851

              #7
              So that's supposed to be the output? What does the corresponding data look like?
              Last edited by bvdet; Sep 20 '10, 05:46 PM. Reason: Misread OP's last post

              Comment

              • Atrisa
                New Member
                • Sep 2010
                • 22

                #8
                I first have an input file that has captured data. I need only the third column from that file, so I did it this way:
                Code:
                # for (i, eachline) in enumerate(open('capture25000.txt','r')):
                #  column3 = eachline.split(' ')[2]
                Then I get the list 'column3' which looks like this:

                27.20.29.12.212 83
                27.20.29.12.212 83
                65.24.112.69.23 479
                27.120.98.21.80
                32.16.45.74.104 64
                56.176.45.74.10 464
                80.290.20.12.21 283
                45.85.178.252.2 8302
                80.91.188.120.1 328
                27.144.102.81.1 3696
                62.244.260.6.33 344
                54.57.14.68.80
                200.97.17.48.80

                There 2500 rows of those. Then I use the following piece of code:
                Code:
                #  newcolumn = column3.split('.')[-1]
                to get only the port numbers as in the following:

                21283
                21283
                23479
                80
                10464
                1046
                1283
                28302
                1328
                13696
                33344
                80
                80
                .
                .
                .

                As you notice there are more than one occurrence of every port number. I want to leave only the first occurrence of any number and get rid of the rest, which your code is supposed to do. But since I have those port numbers in an external file, I didn't know how to use your code for my purpose.

                Thanks a lot again for your time
                Last edited by bvdet; Sep 20 '10, 06:29 PM.

                Comment

                • bvdet
                  Recognized Expert Specialist
                  • Oct 2006
                  • 2851

                  #9
                  See if this works for you:
                  Code:
                  outfile = open('capture25000-column3.txt', 'w')
                  
                  f = open('capture25000.txt')
                  output = []
                  for line in f:
                      column3 = line.strip().split()[2].split(".")[-1]
                      if column3 not in output:
                          output.append(column3)
                  
                  f.close()
                  outfile.write("\n".join(output))
                  outfile.close()

                  Comment

                  • Atrisa
                    New Member
                    • Sep 2010
                    • 22

                    #10
                    Thanks a lot bvdet. That worked well:)

                    Comment

                    • Atrisa
                      New Member
                      • Sep 2010
                      • 22

                      #11
                      I also want to count how many times the port numbers appeared in the 'column3' and put the port number and beside it the number of appearances of that port in a file, so that I have something like this:

                      21283 3
                      21283 2
                      23479 1
                      80 7
                      10464 2

                      and so on. This is my code that tries to do that, but it gives me only 1 appearance for each port:

                      Code:
                      outfile = open('capture25000-column3.txt', 'w')
                      
                      f = open('capture25000.txt')
                      
                      output = []
                      for line in f:
                         column3 = line.strip().split()[2].split(".")[-1].split("\n")
                         for val in column3:   
                            if not val in output:
                               print val, column3.count(val)
                               output.append(val)
                      
                      f.close()
                      outfile.write("\n".join(output))  
                      outfile.close()
                      It seems to count the appearances from the file where I have only the unique numbers, so basically each of those appear for once there. What am I missing in the above code?

                      Comment

                      • bvdet
                        Recognized Expert Specialist
                        • Oct 2006
                        • 2851

                        #12
                        Variable column3 is a string representing a number. Splitting column3 on the "\n" character does nothing except return a one element list. column3 is only added to the output list one time, so it is only counted once.

                        I would recommend storing the data in a dictionary. The port numbers would be the dictionary keys and the counts would be the values.
                        Code:
                        outfile = open('capture25000-column3.txt', 'w')
                        
                        f = open('capture25000.txt')
                        # initialize a dictionary
                        dd = {}
                        # iterate on the file object
                        for line in f:
                            # get the port number from the third item in line
                            column3 = line.strip().split()[2].split(".")[-1]
                            # if column3 not in dd, add to dd and set quantity to 0
                            dd.setdefault(column3, 0)
                            # increment dd[port number] by one
                            dd[column3] += 1
                        
                        f.close()
                        # write the dictionary to disk
                        outfile.write("\n".join(["%s %s" % (key, dd[key]) for key in dd]))
                        outfile.close()

                        Comment

                        • Atrisa
                          New Member
                          • Sep 2010
                          • 22

                          #13
                          Thanks a lot bvdet for your help and your time. That works perfectly. It's two weeks that I have started learning Python, that's why I am a bit lost.

                          Comment

                          • Atrisa
                            New Member
                            • Sep 2010
                            • 22

                            #14
                            Now I have created the following method in the code that you did earlier. The method creates a dictionary that has the port numbers and their corresponding port names:

                            Code:
                            def get_port_name (name, port_table):
                            # ports.txt is a file where I have put the IANA PORT NUMBERS
                                f= open('ports.txt','r')
                                content=f.readline()
                                keycount=1
                                while content:
                                    key = str(keycount)
                                    line1=content.split()   
                                    if len(line1)>3 and line1[0]!="#":
                                        port=line1[1].split('/')
                                        if len(port)==2:
                                            port_table[int(port[0])]={'keycode':line1[0],'description':line1[2]}
                                    keycount +=1
                                    content=f.readline()
                            
                            ports_name={}
                            get_port_name ('Ports', ports_name)
                            print ports_name;
                            A sample of the result of the code above is like this:

                            -----------------------------------------------------
                            80: {'keycode': 'www-http', 'description': 'World'},
                            82: {'keycode': 'xfer', 'description': 'XFER'},
                            83: {'keycode': 'mit-ml-dev', 'description': 'MIT'},
                            ---------------------------------------------------

                            e.g. the first line shows port 80 with the name being 'www-http'. I want to grab those keycode values and put them in the 3rd column of the output file that you gave earlier and have a result like this for all the port numbers:

                            80 115 HTTP
                            443 52 HTTPS

                            and so on, with the following line:
                            Code:
                            outfile.write("\n".join(["%s %s %s" % (key, dd[key], port_table['keycode'][key]) for key in dd]))
                            but it doesn't work. How to use the method here for the output? I am yet well familiar with the syntax of Python methods I guess.

                            Comment

                            • bvdet
                              Recognized Expert Specialist
                              • Oct 2006
                              • 2851

                              #15
                              It may be as simple as:
                              Code:
                              port_table[key]['keycode']

                              Comment

                              Working...