How to do a report on a .txt log file

Collapse
X
 
  • Time
  • Show
Clear All
new posts
  • alexphd
    New Member
    • Dec 2006
    • 19

    How to do a report on a .txt log file

    I have a .txt log file and must of it is crap. But there are parts that display when a user logs in, and at what time the logged in. Below is a portion of the log file. For example, "mwoelk" is a user logging in and "dcurtin" is another user logging in. So far I have created a python app that counts how many times a user logged in, but I'm a little clueless on how to pull when the user logged in. Any help on what I could do would help a lot.

    172.16.9.206 - mwoelk [01/Feb/2008:04:32:12 -0500] "GET /controller?meth od=getUser HTTP/1.0" 200 305
    172.16.9.166 - - [01/Feb/2008:04:57:38 -0500] "HEAD /images/DCI.gif HTTP/1.1" 200 -
    172.16.9.166 - - [01/Feb/2008:04:57:38 -0500] "HEAD /eagent.jnlp HTTP/1.1" 200 -
    172.16.9.166 - - [01/Feb/2008:04:57:38 -0500] "HEAD /jh.jnlp HTTP/1.1" 200 -
    172.16.9.166 - - [01/Feb/2008:04:57:38 -0500] "HEAD /smack.jar HTTP/1.1" 200 -
    172.16.9.166 - - [01/Feb/2008:04:57:38 -0500] "HEAD /jh.jar HTTP/1.1" 200 -
    172.16.9.166 - - [01/Feb/2008:04:57:39 -0500] "HEAD /images/DCI.gif HTTP/1.1" 200 -
    172.16.9.166 - noone [01/Feb/2008:04:57:40 -0500] "GET /controller?meth od=getNode&name =S14000068 HTTP/1.0" 200 499
    172.16.9.166 - - [01/Feb/2008:04:57:40 -0500] "GET /help/helpset.hs HTTP/1.1" 200 547
    172.16.9.166 - - [01/Feb/2008:04:57:43 -0500] "GET /help/map.jhm HTTP/1.1" 200 59650
    172.16.9.162 - dcurtin [01/Feb/2008:00:19:16 -0500] "GET /controller?meth od=getUser HTTP/1.0" 200 307

    Here is what I have done so far to count the frequency of a user logging in.

    Code:
    file = open("localhost_access_log.2008-02-01.txt", "r")
    text = file.read()
    file.close()
    
    word_list = text.lower().split(None)
    
    word_freq = {}
    for word in word_list:
        word_freq[word] = word_freq.get(word, 0) + 1
    
    keys = sorted(word_freq.keys())
    for word in keys:
        print "%-10s %d" % (word, word_freq[word])
  • bvdet
    Recognized Expert Specialist
    • Oct 2006
    • 2851

    #2
    alexphd,

    That's pretty good work you have done so far. The data is actually ordered very well for parsing. Since the log in time is enclosed in brackets, you can find the log in times by getting the index of the brackets (using the string method index()) and slicing the string ("the_string "[start:end]). You can also do it with a regular expression. While we're at it, why not get the user name at the same time? Let's assume the user name can only contain alphanumeric characters.[code=Python]
    import re

    fn = 'access_log.txt '
    pattLog = re.compile(r'([a-zA-Z0-9]+) \[(.+)\]')
    fileList = open(fn).readli nes()
    logdict = {}
    for item in fileList:
    m = pattLog.search( item)
    if m:
    logdict.setdefa ult(m.group(1), []).append(m.grou p(2))

    for key in logdict:
    n = len(logdict[key])
    print 'User %s logged in %d time%s:\n%s\n' % \
    (key, n, ['','s'][n > 1 or 0], '\n'.join(logdi ct[key]))[/code]
    Output:
    >>> User mwoelk logged in 2 times:
    01/Feb/2008:04:32:12 -0500
    03/Feb/2008:12:01:10 -0500

    User noone logged in 2 times:
    01/Feb/2008:04:57:40 -0500
    02/Feb/2008:14:00:40 -0500

    User dcurtin logged in 1 time:
    01/Feb/2008:00:19:16 -0500

    >>>

    HTH :)

    Comment

    • alexphd
      New Member
      • Dec 2006
      • 19

      #3
      Originally posted by bvdet
      alexphd,

      That's pretty good work you have done so far. The data is actually ordered very well for parsing. Since the log in time is enclosed in brackets, you can find the log in times by getting the index of the brackets (using the string method index()) and slicing the string ("the_string "[start:end]). You can also do it with a regular expression. While we're at it, why not get the user name at the same time? Let's assume the user name can only contain alphanumeric characters.[code=Python]
      import re

      fn = 'access_log.txt '
      pattLog = re.compile(r'([a-zA-Z0-9]+) \[(.+)\]')
      fileList = open(fn).readli nes()
      logdict = {}
      for item in fileList:
      m = pattLog.search( item)
      if m:
      logdict.setdefa ult(m.group(1), []).append(m.grou p(2))

      for key in logdict:
      n = len(logdict[key])
      print 'User %s logged in %d time%s:\n%s\n' % \
      (key, n, ['','s'][n > 1 or 0], '\n'.join(logdi ct[key]))[/code]
      Output:
      >>> User mwoelk logged in 2 times:
      01/Feb/2008:04:32:12 -0500
      03/Feb/2008:12:01:10 -0500

      User noone logged in 2 times:
      01/Feb/2008:04:57:40 -0500
      02/Feb/2008:14:00:40 -0500

      User dcurtin logged in 1 time:
      01/Feb/2008:00:19:16 -0500

      >>>

      HTH :)
      Okay, I see what you're doing but I have a few questions. What does logdict={} exactly do. Baiscally this code block of code that you wrote
      Code:
      for item in fileList:
          m = pattLog.search(item)
          if m:
              logdict.setdefault(m.group(1), []).append(m.group(2))
      I'm a little bit confused.

      Also, I if I wanted to find the most reoccurring user would I add a counter to that for loop you created? Or would it be different? Sorry , if I'm asking a lot of questions I'm juststarted to learn python. And trying to understand the syntax fully.

      Comment

      • bvdet
        Recognized Expert Specialist
        • Oct 2006
        • 2851

        #4
        Originally posted by alexphd
        Okay, I see what you're doing but I have a few questions. What does logdict={} exactly do. Baiscally this code block of code that you wrote
        Code:
        for item in fileList:
            m = pattLog.search(item)
            if m:
                logdict.setdefault(m.group(1), []).append(m.group(2))
        I'm a little bit confused.

        Also, I if I wanted to find the most reoccurring user would I add a counter to that for loop you created? Or would it be different? Sorry , if I'm asking a lot of questions I'm juststarted to learn python. And trying to understand the syntax fully.
        The keys in logdict are the user names and the values are lists of the log in times. The count of the number of log ins is the length of each log in list.[code=Python]>>> logdict
        {'mwoelk': ['01/Feb/2008:04:32:12 -0500', '03/Feb/2008:12:01:10 -0500'], 'noone': ['01/Feb/2008:04:57:40 -0500', '02/Feb/2008:14:00:40 -0500'], 'dcurtin': ['01/Feb/2008:00:19:16 -0500']}
        >>> [/code]From Python documentation:
        a.setdefault(k[, x]) returns a[k] if k in a, else x (also setting it)

        To determine the user with the most log ins:[code=Python]freqList = [[len(logdict[key]), key] for key in logdict]
        freqList.sort()

        print freqList
        print 'The user that logged in the most times is %s.' % (freqList[-1][1])
        [/code]Output:

        >>> [[1, 'dcurtin'], [2, 'mwoelk'], [4, 'noone']]
        The user that logged in the most times is noone.
        >>>

        Comment

        • alexphd
          New Member
          • Dec 2006
          • 19

          #5
          I actually down something very similar. Except I wanted to display the top three uses. So, I reversed the sort and sliced the list. Below is my code.

          Code:
          freqList = [[len(logdict[key]), key] for key in logdict]
          freqList.sort(reverse=True)
          
          print freqList
          # print 'The user that logged in the most times is %s.' % (freqList[-2][1])
          
          print 'The user that logged in the most times is %s.' % (freqList[1:4])
          Right now I'm working trying to display who logged in from a certain time frame. So, let's say I want to see you logged in from 8:00 to 10:00 and who logged in from 12:00 to 16:00. etc etc. Any idea on how that can be done?

          Comment

          • bvdet
            Recognized Expert Specialist
            • Oct 2006
            • 2851

            #6
            Originally posted by alexphd
            I actually down something very similar. Except I wanted to display the top three uses. So, I reversed the sort and sliced the list. Below is my code.

            Code:
            freqList = [[len(logdict[key]), key] for key in logdict]
            freqList.sort(reverse=True)
            
            print freqList
            # print 'The user that logged in the most times is %s.' % (freqList[-2][1])
            
            print 'The user that logged in the most times is %s.' % (freqList[1:4])
            Right now I'm working trying to display who logged in from a certain time frame. So, let's say I want to see you logged in from 8:00 to 10:00 and who logged in from 12:00 to 16:00. etc etc. Any idea on how that can be done?
            Check out the datetime module. It supports mathematical and comparison operations and is ideal for your application.

            Comment

            • alexphd
              New Member
              • Dec 2006
              • 19

              #7
              I added how many users logged in that day now I want to narrow it down to how many users are logging in every three hours. So I did how many logged in for that day by doing what's below. And I was able to do it this way because the log file is only for a day.

              Code:
              count = 0
              for key in logdict:
                  count += 1
              
              print '%s users logged in today' % (count)
              But I have having trouble doing the three hours. I tried the datetime module, but I cant figure it out. I tried to do something like this:
              Code:
              datetime.datetime.fromtimestamp(mod_time)
              What do you think?

              Comment

              • bvdet
                Recognized Expert Specialist
                • Oct 2006
                • 2851

                #8
                Actually the time module can be used to compare time objects to see if a specific time falls in a range. Example:
                [code=Python]
                import time

                d1 = '01/Feb/2008:04:57:40 -0500'
                d2 = '01/Feb/2008:15:57:40 -0500'

                def time_comp(upper , lower, d):
                # upper and lower format %H:%M:%S
                tu = time.strptime(u pper, '%H:%M:%S')
                tl = time.strptime(l ower, '%H:%M:%S')
                # parse d
                # example string: '01/Feb/2008:04:57:40 -0500'
                tm = time.strptime(d .split()[0].split(':',1)[1], '%H:%M:%S')
                if tl <= tm <= tu:
                return True
                return False

                print time_comp('16:0 0:00', '10:00:00', d1)
                print time_comp('16:0 0:00', '10:00:00', d2)

                if time_comp('16:0 0:00', '10:00:00', d2):
                print 'User logged in during the target time.'
                else:
                print 'Out of range'

                if time_comp('16:0 0:00', '10:00:00', d1):
                print 'User logged in during the target time.'
                else:
                print 'Out of range'[/code]

                Output:

                >>> False
                True
                User logged in during the target time.
                Out of range
                >>>

                Comment

                • alexphd
                  New Member
                  • Dec 2006
                  • 19

                  #9
                  I got another output printed out None. Do you know where that comes from?

                  Also, how can I make that work for my whole txt file?

                  Comment

                  • bvdet
                    Recognized Expert Specialist
                    • Oct 2006
                    • 2851

                    #10
                    Originally posted by alexphd
                    I got another output printed out None. Do you know where that comes from?

                    Also, how can I make that work for my whole txt file?
                    I don't know what output you are referring to. What do you want to do to your whole txt file?

                    Comment

                    • alexphd
                      New Member
                      • Dec 2006
                      • 19

                      #11
                      I get this output when I run it.

                      """
                      None
                      True
                      Out of range
                      User logged in during the target time.
                      """

                      What I want it to do is to be able to interpret the whole text file not just d1, and d2. And I also want to to display how many users logged in during that target time. I think I could do that with just a for loop.

                      I hope that helps clarify what I was trying to say. Thanks for your help.

                      Comment

                      • bvdet
                        Recognized Expert Specialist
                        • Oct 2006
                        • 2851

                        #12
                        Originally posted by alexphd
                        I get this output when I run it.

                        """
                        None
                        True
                        Out of range
                        User logged in during the target time.
                        """

                        What I want it to do is to be able to interpret the whole text file not just d1, and d2. And I also want to to display how many users logged in during that target time. I think I could do that with just a for loop.

                        I hope that helps clarify what I was trying to say. Thanks for your help.
                        You must be running the code I suggested. I'm not sure why your output is None (it should be False). My intent was not to provide you with a solution, but to give you a function for comparing dates so you can implement your own solution. You need to put some effort into it. I am not here to write your program for you.

                        Comment

                        • alexphd
                          New Member
                          • Dec 2006
                          • 19

                          #13
                          Sorry about that I was not trying to make you write my whole program there was just some parts I was confused about. I think I figured it out though. I actually ended up using the datetime module and using a for loop to count how many user were in the target time. I got a few errors because I was trying to use a list in the time function that you gave me.

                          Thanks, for all your help again I really appreciate. I just started to learn python and I'm sorry for asking so many questions.


                          Thanks again.

                          Comment

                          • bvdet
                            Recognized Expert Specialist
                            • Oct 2006
                            • 2851

                            #14
                            Originally posted by alexphd
                            Sorry about that I was not trying to make you write my whole program there was just some parts I was confused about. I think I figured it out though. I actually ended up using the datetime module and using a for loop to count how many user were in the target time. I got a few errors because I was trying to use a list in the time function that you gave me.

                            Thanks, for all your help again I really appreciate. I just started to learn python and I'm sorry for asking so many questions.


                            Thanks again.
                            Sorry if I misunderstood you. I appreciate someone putting forth effort and getting results. It looks like you are getting there. Please ask if you have questions.

                            Comment

                            Working...