streaming a file object through re.finditer

Collapse
This topic is closed.
X
X
 
  • Time
  • Show
Clear All
new posts
  • Erick

    streaming a file object through re.finditer

    Hello,

    I've been looking for a while for an answer, but so far I haven't been
    able to turn anything up yet. Basically, what I'd like to do is to use
    re.finditer to search a large file (or a file stream), but I haven't
    figured out how to get finditer to work without loading the entire file
    into memory, or just reading one line at a time (or more complicated
    buffering).

    For example, say I do this:
    cat a b c > blah

    Then run this python script:[color=blue][color=green][color=darkred]
    >>> import re
    >>> for m in re.finditer('\w +', buffer(file('bl ah'))):[/color][/color][/color]
    .... print m.group()
    ....
    Traceback (most recent call last):
    File "<stdin>", line 1, in ?
    TypeError: buffer object expected

    Of course, this works fine, but it loads the file completely into
    memory (right?):[color=blue][color=green][color=darkred]
    >>> for m in re.finditer('\w +', buffer(file('bl ah').read())):[/color][/color][/color]
    .... print m.group()
    ....
    a
    b
    c

    So, is there any way to do this?

    Thanks,

    -e

  • Erick

    #2
    Re: streaming a file object through re.finditer

    Ack, typo. What I meant was this:
    cat a b c > blah
    [color=blue][color=green][color=darkred]
    >>> import re
    >>> for m in re.finditer('\w +', file('blah')):[/color][/color][/color]

    .... print m.group()
    ....
    Traceback (most recent call last):
    File "<stdin>", line 1, in ?
    TypeError: buffer object expected

    Of course, this works fine, but it loads the file completely into
    memory (right?):[color=blue][color=green][color=darkred]
    >>> for m in re.finditer('\w +', file('blah').re ad()):[/color][/color][/color]
    .... print m.group()
    ....
    a
    b
    c

    Comment

    • Daniel Bickett

      #3
      Re: streaming a file object through re.finditer

      The following example loads the file into memory only one line at a
      time, so it should suit your purposes:
      [color=blue][color=green][color=darkred]
      >>> data = file( "important. dat" , "w" )
      >>> data.write("thi s\nis\nimportan t\ndata")
      >>> data.close()[/color][/color][/color]

      now read it....
      [color=blue][color=green][color=darkred]
      >>> import re
      >>> data = file( "important. dat" , "r" )
      >>> line = data.readline()
      >>> while line:[/color][/color][/color]
      for x in re.finditer( "\w+" , line):
      print x.group()
      line = data.readline()


      this
      is
      important
      data[color=blue][color=green][color=darkred]
      >>>[/color][/color][/color]


      --
      Daniel Bickett
      dbickett at gmail.com

      Comment

      • Erick

        #4
        Re: streaming a file object through re.finditer

        True, but it doesn't work with multiline regular expressions :(

        -e

        Comment

        • Erik Johnson

          #5
          Re: streaming a file object through re.finditer


          Is it not possible to wrap your loop below within a loop doing
          file.read([size]) (or readline() or readlines([size]),
          reading the file a chunk at a time then running your re on a per-chunk
          basis?

          -ej


          "Erick" <idadesub@gmail .com> wrote in message
          news:1107396614 .888869.94640@f 14g2000cwb.goog legroups.com...[color=blue]
          > Ack, typo. What I meant was this:
          > cat a b c > blah
          >[color=green][color=darkred]
          > >>> import re
          > >>> for m in re.finditer('\w +', file('blah')):[/color][/color]
          >
          > ... print m.group()
          > ...
          > Traceback (most recent call last):
          > File "<stdin>", line 1, in ?
          > TypeError: buffer object expected
          >
          > Of course, this works fine, but it loads the file completely into
          > memory (right?):[color=green][color=darkred]
          > >>> for m in re.finditer('\w +', file('blah').re ad()):[/color][/color]
          > ... print m.group()
          > ...
          > a
          > b
          > c
          >[/color]


          Comment

          • Daniel Bickett

            #6
            Re: streaming a file object through re.finditer

            Erick wrote:[color=blue]
            > True, but it doesn't work with multiline regular expressions :([/color]

            If your intent is for the expression to traverse multiple lines (and
            possibly match *across* multiple lines,) then, as far as I know, you
            have no choice but to load the whole file into memory.

            --
            Daniel Bickett
            dbickett at gmail.com

            Comment

            • Steven Bethard

              #7
              Re: streaming a file object through re.finditer

              Erick wrote:[color=blue]
              > Hello,
              >
              > I've been looking for a while for an answer, but so far I haven't been
              > able to turn anything up yet. Basically, what I'd like to do is to use
              > re.finditer to search a large file (or a file stream), but I haven't
              > figured out how to get finditer to work without loading the entire file
              > into memory, or just reading one line at a time (or more complicated
              > buffering).[/color]

              Can you use mmap?

              Availability: not WASI. This module does not work or is not available on WebAssembly. See WebAssembly platforms for more information. Memory-mapped file objects behave like both bytearray and like ...


              "You can use mmap objects in most places where strings are expected; for
              example, you can use the re module to search through a memory-mapped file."

              Seems applicable, and it should keep your memory use down, but I'm not
              very experienced with it...

              Steve

              Comment

              Working...