Finding messages in huge mboxes

Collapse
This topic is closed.
X
X
 
  • Time
  • Show
Clear All
new posts
  • Bastiaan Welmers

    Finding messages in huge mboxes

    Hi,

    I wondered if anyone has ever met this same mbox issue.

    I'm having the following problem:

    I need find messages in huge mbox files (50MB or more).
    The following way is (of course?) not very usable:

    fp = open("mbox", "r")
    archive = mailbox.UnixMai lbox(fp)
    i=0
    while i < message_number_ needed:
    i+=1
    archive.next()

    needed_message = archive.next()

    Especially because I often need messages at the end
    of the MBOX file.
    So I tried the following (scanning messages backwards
    on found "From " lines with readline())

    i=0
    j=0
    while 1:
    i+=1
    fp.seek(-i, SEEK_TO_END=2)
    line = fp.readline()
    if not line:
    break
    if line[:5] == 'From ':
    j+=1
    if j == total_messages - message_number_ needed:
    archive.seekp = fp.tell()
    message = archive.next()
    # message found

    But also seems to be slow and CPU consuming.

    Anyone who has a better idea?

    Regards,

    Bastiaan Welmers
  • Miklós

    #2
    Re: Finding messages in huge mboxes

    What about putting it into a database like MySQL? <pyWink>

    Miklós


    "Bastiaan Welmers" <haasje@welmers .net> wrote in message
    news:401eb54c$0 $315$e4fe514c@n ews.xs4all.nl.. .[color=blue]
    > Hi,
    >
    > I wondered if anyone has ever met this same mbox issue.
    >
    > I'm having the following problem:
    >
    > I need find messages in huge mbox files (50MB or more).
    > The following way is (of course?) not very usable:
    >
    > fp = open("mbox", "r")
    > archive = mailbox.UnixMai lbox(fp)
    > i=0
    > while i < message_number_ needed:
    > i+=1
    > archive.next()
    >
    > needed_message = archive.next()
    >
    > Especially because I often need messages at the end
    > of the MBOX file.
    > So I tried the following (scanning messages backwards
    > on found "From " lines with readline())
    >
    > i=0
    > j=0
    > while 1:
    > i+=1
    > fp.seek(-i, SEEK_TO_END=2)
    > line = fp.readline()
    > if not line:
    > break
    > if line[:5] == 'From ':
    > j+=1
    > if j == total_messages - message_number_ needed:
    > archive.seekp = fp.tell()
    > message = archive.next()
    > # message found
    >
    > But also seems to be slow and CPU consuming.
    >
    > Anyone who has a better idea?
    >
    > Regards,
    >
    > Bastiaan Welmers[/color]


    Comment

    • Diez B. Roggisch

      #3
      Re: Finding messages in huge mboxes

      > Anyone who has a better idea?

      AFAIK MUAs usually use a mbox.index-file for faster access. The index is
      computed once, and updated whenever a new message is added. You could
      create this index quite easily yourself by looping over the mbox and
      pickling a list of tell'ed positions. If you also store the creation-date
      of the index and the filesize of the mbox-file, you should be able to
      create a function that will update the index whenever the underlying mbox
      has changed. Another approach would be to perform index-creation on regular
      bases using cron.

      Regards,

      Diez

      Comment

      • Donn Cave

        #4
        Re: Finding messages in huge mboxes

        In article <401eb54c$0$315 $e4fe514c@news. xs4all.nl>,
        Bastiaan Welmers <haasje@welmers .net> wrote:
        ....[color=blue]
        > I need find messages in huge mbox files (50MB or more).[/color]
        ....[color=blue]
        > Especially because I often need messages at the end
        > of the MBOX file.
        > So I tried the following (scanning messages backwards
        > on found "From " lines with readline())[/color]

        readline() is not your friend here. I suggest that
        you read large blocks of data, like 8192 bytes for
        example, and search them iteratively. Like,
        next = block.find('\nF rom ', prev + 1)

        This will give you the location of each message in
        the current block, so you can split the block up
        into a list of messages. (There will be an extra
        chunk of data at the beginning of each block, before
        the first "From " - recycle that onto the end of the
        next block.)

        Since file object buffering is at best useless in this
        application, I would use posix.open, posix.lseek and
        posix.read. Taking this approach, I find that reading
        the last 10 messages in a 100 Mb folder takes 0.05 sec.

        Donn Cave, donn@u.washingt on.edu

        Comment

        • David M. Cooke

          #5
          Re: Finding messages in huge mboxes

          At some point, Donn Cave <donn@u.washing ton.edu> wrote:
          [color=blue]
          > In article <401eb54c$0$315 $e4fe514c@news. xs4all.nl>,
          > Bastiaan Welmers <haasje@welmers .net> wrote:
          > ...[color=green]
          >> I need find messages in huge mbox files (50MB or more).[/color]
          > ...[color=green]
          >> Especially because I often need messages at the end
          >> of the MBOX file.
          >> So I tried the following (scanning messages backwards
          >> on found "From " lines with readline())[/color]
          >
          > readline() is not your friend here. I suggest that
          > you read large blocks of data, like 8192 bytes for
          > example, and search them iteratively. Like,
          > next = block.find('\nF rom ', prev + 1)[/color]

          Unless, of course, you read '\nFr', then 'om ' in the next block...

          I can't think of a simple way around this (except for reading by
          lines). Concating the last two together means having to keep track of
          what you've seen in the last block. Maybe picking off the last line
          from the last block (using line.rfind('\n' )), and concatenating that
          to the beginning of the next.

          --
          |>|\/|<
          /--------------------------------------------------------------------------\
          |David M. Cooke
          |cookedm(at)phy sics(dot)mcmast er(dot)ca

          Comment

          • Donn Cave

            #6
            Re: Finding messages in huge mboxes

            Quoth cookedm+news@ph ysics.mcmaster. ca (David M. Cooke):
            | At some point, Donn Cave <donn@u.washing ton.edu> wrote:
            |> In article <401eb54c$0$315 $e4fe514c@news. xs4all.nl>,
            |> Bastiaan Welmers <haasje@welmers .net> wrote:
            |> ...
            |>> I need find messages in huge mbox files (50MB or more).
            |> ...
            |>> Especially because I often need messages at the end
            |>> of the MBOX file.
            |>> So I tried the following (scanning messages backwards
            |>> on found "From " lines with readline())
            |>
            |> readline() is not your friend here. I suggest that
            |> you read large blocks of data, like 8192 bytes for
            |> example, and search them iteratively. Like,
            |> next = block.find('\nF rom ', prev + 1)
            |
            | Unless, of course, you read '\nFr', then 'om ' in the next block...
            |
            | I can't think of a simple way around this (except for reading by
            | lines). Concating the last two together means having to keep track of
            | what you've seen in the last block. Maybe picking off the last line
            | from the last block (using line.rfind('\n' )), and concatenating that
            | to the beginning of the next.

            I'm reading from the end backwards, so the fragment is block[:start].
            Append that to the block before it, and each block always will end at
            a message boundary. If you start in the middle, you have to deal with
            an extra boundary problem. If reading forward from the beginning, it
            would be about as simple.

            If I have overlooked some obvious problem with this, it wouldn't be
            the first time, but I think it's as simple as it could be. The only
            inelegance to it is that you have to scan the fragment at least twice
            (one extra time for each time it's added to a new block.)

            Donn Cave, donn@drizzle.co m

            Comment

            • Miki Tebeka

              #7
              Re: Finding messages in huge mboxes

              Hell Bastiaan,
              [color=blue]
              > I need find messages in huge mbox files (50MB or more).
              > ...
              > Anyone who has a better idea?[/color]
              I find that sometime using the unix little utilties (which are
              available for M$ as well) gives very good performance.

              --- last.py ---
              #!/usr/bin/env python
              from os import popen
              from sys import argv

              # Find last "From:" line
              last = popen("grep -n 'From:' %s | tail -1" % argv[1]).read()
              last = int(last.split( ":")[0])
              # Find total number of lines
              size = popen("wc -l %s" % argv[1]).read()
              size = int(size.split( )[0].strip())
              # Print the message
              print popen("tail -%d %s" % (size - last, argv[1])).read()
              --- last.py ---
              Tool less than 1sec on my computer on a 11MB mailbox.

              HTH.
              Miki

              Comment

              • Cameron Laird

                #8
                Re: Finding messages in huge mboxes

                In article <4f0a9fdb.04020 22331.394b3002@ posting.google. com>,
                Miki Tebeka <miki.tebeka@zo ran.com> wrote:[color=blue]
                >Hell Bastiaan,
                >[color=green]
                >> I need find messages in huge mbox files (50MB or more).
                >> ...
                >> Anyone who has a better idea?[/color]
                >I find that sometime using the unix little utilties (which are
                >available for M$ as well) gives very good performance.
                >
                >--- last.py ---
                >#!/usr/bin/env python
                >from os import popen
                >from sys import argv
                >
                ># Find last "From:" line
                >last = popen("grep -n 'From:' %s | tail -1" % argv[1]).read()
                >last = int(last.split( ":")[0])
                ># Find total number of lines
                >size = popen("wc -l %s" % argv[1]).read()
                >size = int(size.split( )[0].strip())
                ># Print the message
                >print popen("tail -%d %s" % (size - last, argv[1])).read()
                >--- last.py ---
                >Tool less than 1sec on my computer on a 11MB mailbox.[/color]

                Comment

                • Erno Kuusela

                  #9
                  Re: Finding messages in huge mboxes

                  Bastiaan Welmers <haasje@welmers .net> writes:
                  [color=blue]
                  >
                  > Especially because I often need messages at the end
                  > of the MBOX file.
                  > So I tried the following (scanning messages backwards
                  > on found "From " lines with readline())
                  >
                  > i=0
                  > j=0
                  > while 1:
                  > i+=1
                  > fp.seek(-i, SEEK_TO_END=2)
                  > line = fp.readline()
                  > if not line:
                  > break
                  > if line[:5] == 'From ':
                  > j+=1
                  > if j == total_messages - message_number_ needed:
                  > archive.seekp = fp.tell()
                  > message = archive.next()
                  > # message found
                  >
                  > But also seems to be slow and CPU consuming.[/color]

                  something like this might work. the loop below scanned a 115MB mailbox
                  in about 1 second on a 1.2ghz k7. extracts the next-to-last message,
                  but you get the idea. if you don't want to read the file into cache,
                  you could adapt it to start with a smaller mmapped chunk from the end
                  of the file and enlarge it until you find what you want.


                  import os, re, mmap, sys
                  from cStringIO import StringIO
                  import email

                  fd = os.open(sys.arg v[1], os.O_RDONLY)
                  size = os.fstat(fd).st _size
                  print size
                  buf = mmap.mmap(fd, size, access=mmap.ACC ESS_READ)
                  message_offsets = []
                  for m in re.finditer(r'( ?s)\n\nFrom', buf):
                  message_offsets .append(m.start ())

                  msgfp = StringIO(buf[message_offsets[-2] + 2:message_offse ts[-1] + 2])
                  msg = email.message_f rom_file(msgfp)
                  print msg['to']

                  -- erno

                  Comment

                  • Bastiaan Welmers

                    #10
                    Re: Finding messages in huge mboxes

                    Miki Tebeka wrote:
                    [color=blue]
                    > Hell Bastiaan,
                    >[color=green]
                    >> I need find messages in huge mbox files (50MB or more).
                    >> ...
                    >> Anyone who has a better idea?[/color]
                    > I find that sometime using the unix little utilties (which are
                    > available for M$ as well) gives very good performance.
                    >[/color]
                    Sounds as a very good idea. Tanks.

                    /Bastiaan

                    Comment

                    • Bastiaan Welmers

                      #11
                      Re: Finding messages in huge mboxes

                      Miklós wrote:
                      [color=blue]
                      > What about putting it into a database like MySQL? <pyWink>
                      >[/color]

                      Too much work to archieve this. It's just a Mailman archieve mbox
                      which has to be opened. So then I have to rewrite
                      pipermail archiever.

                      /Bastiaan

                      Comment

                      • Bastiaan Welmers

                        #12
                        Re: Finding messages in huge mboxes

                        Diez B. Roggisch wrote:
                        [color=blue][color=green]
                        >> Anyone who has a better idea?[/color]
                        >
                        > AFAIK MUAs usually use a mbox.index-file for faster access. The index is
                        > computed once, and updated whenever a new message is added. You could
                        > create this index quite easily yourself by looping over the mbox and
                        > pickling a list of tell'ed positions. If you also store the creation-date
                        > of the index and the filesize of the mbox-file, you should be able to
                        > create a function that will update the index whenever the underlying mbox
                        > has changed. Another approach would be to perform index-creation on
                        > regular bases using cron.[/color]
                        Also good idea. It's a mailman archieve so then I have
                        to hack mailman for creating an index file besides the
                        mbox file.

                        /Bastiaan

                        Comment

                        Working...