tips requested for a log-processing script

Collapse
This topic is closed.
X
X
 
  • Time
  • Show
Clear All
new posts
  • Jaap

    tips requested for a log-processing script

    Python ers,
    As a relatively new user of Python I would like to ask your advice on
    the following script I want to create.

    I have a logfile which contains records. All records have the same
    layout, and are stored in a CSV-format. Each record is (non-uniquely)
    identified by a date and a itemID. Each itemID can occur 0 or more times
    per month. The item contains a figure/amount which I need to sum per
    month and per itemID. I have already managed to separate the individual
    parts of each logfile-record by using the csv-module from Python 2.5.
    very simple indeed.

    Apart from this I have a configuration file, which contains the list of
    itemID's i need to focus on per month. Not all itemID's are relevant for
    each month, but for example only every second or third month. All
    records in the logfile with other itemID's can be ignored. I have yet to
    define the format of this configuration file, but am thinking about a 0
    or 1 for each month, and then the itemID, like:
    "1 0 0 1 0 0 1 0 0 1 0 0 123456" for a itemID 123456 which only needs
    consideration at first month of each quarter.

    My question to this forum is: which data structure would you propose?
    The logfile is not very big (about 200k max, average 200k) so I assume I
    can store in internal memory/list?

    How would you propose I tackle the filtering of relevant/non-relevant
    items from logfile? Would you propose I use a filter(func, list) for
    this task or is another thing better?

    In the end I want to mail the outcome of my process, but this seems
    straitforward from the documentation I have found, although I must
    connect to an external SMTP-server.

    Any tips, views, advice is highly appreciated!


    Jaap

    PS: when I load the logfile in a spreadsheet I can create a pivot table
    which does about the same ;-] but that is not what I want; the
    processing must be automated in the end with a periodic script which
    e-mails the summary of the keyfigure every month.
  • martdi

    #2
    Re: tips requested for a log-processing script

    if you are running in windows you can use the win32com module to
    automate the process of generating a pivot table in excel and then code
    to send it via e-mail



    Jaap wrote:
    Python ers,
    As a relatively new user of Python I would like to ask your advice on
    the following script I want to create.
    >
    I have a logfile which contains records. All records have the same
    layout, and are stored in a CSV-format. Each record is (non-uniquely)
    identified by a date and a itemID. Each itemID can occur 0 or more times
    per month. The item contains a figure/amount which I need to sum per
    month and per itemID. I have already managed to separate the individual
    parts of each logfile-record by using the csv-module from Python 2.5.
    very simple indeed.
    >
    Apart from this I have a configuration file, which contains the list of
    itemID's i need to focus on per month. Not all itemID's are relevant for
    each month, but for example only every second or third month. All
    records in the logfile with other itemID's can be ignored. I have yet to
    define the format of this configuration file, but am thinking about a 0
    or 1 for each month, and then the itemID, like:
    "1 0 0 1 0 0 1 0 0 1 0 0 123456" for a itemID 123456 which only needs
    consideration at first month of each quarter.
    >
    My question to this forum is: which data structure would you propose?
    The logfile is not very big (about 200k max, average 200k) so I assume I
    can store in internal memory/list?
    >
    How would you propose I tackle the filtering of relevant/non-relevant
    items from logfile? Would you propose I use a filter(func, list) for
    this task or is another thing better?
    >
    In the end I want to mail the outcome of my process, but this seems
    straitforward from the documentation I have found, although I must
    connect to an external SMTP-server.
    >
    Any tips, views, advice is highly appreciated!
    >
    >
    Jaap
    >
    PS: when I load the logfile in a spreadsheet I can create a pivot table
    which does about the same ;-] but that is not what I want; the
    processing must be automated in the end with a periodic script which
    e-mails the summary of the keyfigure every month.

    Comment

    • George Sakkis

      #3
      Re: tips requested for a log-processing script

      Jaap wrote:
      Apart from this I have a configuration file, which contains the list of
      itemID's i need to focus on per month. Not all itemID's are relevant for
      each month, but for example only every second or third month. All
      records in the logfile with other itemID's can be ignored. I have yet to
      define the format of this configuration file, but am thinking about a 0
      or 1 for each month, and then the itemID, like:
      "1 0 0 1 0 0 1 0 0 1 0 0 123456" for a itemID 123456 which only needs
      consideration at first month of each quarter.
      It's probably not necessary if your records are in the order of 100K,
      but if you're dealing with millions and above, you can write your
      config file in binary using the struct module and condense it down to 6
      bytes per record (32 bits for the ID and 12 bits for the months
      occurences). Filtering will also be faster, as for each record you just
      have to do a bitwise AND with the 0..010...0 mask corresponding to a
      given month.

      George

      Comment

      • Hendrik van Rooyen

        #4
        Re: tips requested for a log-processing script

        "Jaap" <jaap@nospaml.c omwrote:

        Python ers,
        As a relatively new user of Python I would like to ask your advice on
        the following script I want to create.
        >
        I have a logfile which contains records. All records have the same
        layout, and are stored in a CSV-format. Each record is (non-uniquely)
        identified by a date and a itemID. Each itemID can occur 0 or more times
        per month. The item contains a figure/amount which I need to sum per
        month and per itemID. I have already managed to separate the individual
        parts of each logfile-record by using the csv-module from Python 2.5.
        very simple indeed.
        >
        Apart from this I have a configuration file, which contains the list of
        itemID's i need to focus on per month. Not all itemID's are relevant for
        each month, but for example only every second or third month. All
        records in the logfile with other itemID's can be ignored. I have yet to
        define the format of this configuration file, but am thinking about a 0
        or 1 for each month, and then the itemID, like:
        "1 0 0 1 0 0 1 0 0 1 0 0 123456" for a itemID 123456 which only needs
        consideration at first month of each quarter.
        >
        My question to this forum is: which data structure would you propose?
        The logfile is not very big (about 200k max, average 200k) so I assume I
        can store in internal memory/list?
        >
        How would you propose I tackle the filtering of relevant/non-relevant
        items from logfile? Would you propose I use a filter(func, list) for
        this task or is another thing better?
        >
        In the end I want to mail the outcome of my process, but this seems
        straitforward from the documentation I have found, although I must
        connect to an external SMTP-server.
        >
        Any tips, views, advice is highly appreciated!
        >
        >
        Jaap
        >
        PS: when I load the logfile in a spreadsheet I can create a pivot table
        which does about the same ;-] but that is not what I want; the
        processing must be automated in the end with a periodic script which
        e-mails the summary of the keyfigure every month.

        I would do something like this: (obviously untested)

        for line in readlines(open( logfile,r,1)):
        (code to get hold of item, date, amount)
        if item not in item_dict:
        item_dict[item] = [(date,amount)]
        else:
        item_dict[item].append(date,am ount)

        this will give you, for each unique item, a direct ref to wherever its been
        used.

        I would then work through the config file, and extract the items of interest for
        the run date...

        HTH - Hendrik



        Comment

        • Jaap

          #5
          Re: tips requested for a log-processing script

          Hendrik van Rooyen schreef:
          "Jaap" <jaap@nospaml.c omwrote:
          >
          >
          >Python ers,
          Thanks!
          all your replies have been both to the point and helpfull for me.

          You have proven both Python and it's community are open and welcoming to
          new users.

          Jaap

          Comment

          Working...