streaming a file object through re.finditer

**Erick** · Jul 18 '05, 09:01 PM

Re: streaming a file object through re.finditer

Ack, typo. What I meant was this:
cat a b c > blah
[color=blue][color=green][color=darkred]
>>> import re
>>> for m in re.finditer('\w +', file('blah')):[/color][/color][/color]

.... print m.group()
....
Traceback (most recent call last):
File "<stdin>", line 1, in ?
TypeError: buffer object expected

Of course, this works fine, but it loads the file completely into
memory (right?):[color=blue][color=green][color=darkred]
>>> for m in re.finditer('\w +', file('blah').re ad()):[/color][/color][/color]
.... print m.group()
....
a
b
c

**Daniel Bickett** · Jul 18 '05, 09:01 PM

Re: streaming a file object through re.finditer

The following example loads the file into memory only one line at a
time, so it should suit your purposes:
[color=blue][color=green][color=darkred]
>>> data = file( "important. dat" , "w" )
>>> data.write("thi s\nis\nimportan t\ndata")
>>> data.close()[/color][/color][/color]

now read it....
[color=blue][color=green][color=darkred]
>>> import re
>>> data = file( "important. dat" , "r" )
>>> line = data.readline()
>>> while line:[/color][/color][/color]
for x in re.finditer( "\w+" , line):
print x.group()
line = data.readline()

this
is
important
data[color=blue][color=green][color=darkred]
>>>[/color][/color][/color]

--
Daniel Bickett
dbickett at gmail.com

http://heureusement.org/

**Erick** · Jul 18 '05, 09:01 PM

Re: streaming a file object through re.finditer

True, but it doesn't work with multiline regular expressions :(

-e

**Erik Johnson** · Jul 18 '05, 09:01 PM

Re: streaming a file object through re.finditer

Is it not possible to wrap your loop below within a loop doing
file.read([size]) (or readline() or readlines([size]),
reading the file a chunk at a time then running your re on a per-chunk
basis?

-ej

"Erick" <idadesub@gmail .com> wrote in message
news:1107396614 .888869.94640@f 14g2000cwb.goog legroups.com...[color=blue]
> Ack, typo. What I meant was this:
> cat a b c > blah
>[color=green][color=darkred]
> >>> import re
> >>> for m in re.finditer('\w +', file('blah')):[/color][/color]
>
> ... print m.group()
> ...
> Traceback (most recent call last):
> File "<stdin>", line 1, in ?
> TypeError: buffer object expected
>
> Of course, this works fine, but it loads the file completely into
> memory (right?):[color=green][color=darkred]
> >>> for m in re.finditer('\w +', file('blah').re ad()):[/color][/color]
> ... print m.group()
> ...
> a
> b
> c
>[/color]

**Daniel Bickett** · Jul 18 '05, 09:01 PM

Re: streaming a file object through re.finditer

Erick wrote:[color=blue]
> True, but it doesn't work with multiline regular expressions :([/color]

If your intent is for the expression to traverse multiple lines (and
possibly match *across* multiple lines,) then, as far as I know, you
have no choice but to load the whole file into memory.

--
Daniel Bickett
dbickett at gmail.com

http://heureusement.org/

**Steven Bethard** · Jul 18 '05, 09:01 PM

Re: streaming a file object through re.finditer

Erick wrote:[color=blue]
> Hello,
>
> I've been looking for a while for an answer, but so far I haven't been
> able to turn anything up yet. Basically, what I'd like to do is to use
> re.finditer to search a large file (or a file stream), but I haven't
> figured out how to get finditer to work without loading the entire file
> into memory, or just reading one line at a time (or more complicated
> buffering).[/color]

Can you use mmap?

mmap — Memory-mapped file support

http://docs.python.org/lib/module-mmap.html

Availability: not WASI. This module does not work or is not available on WebAssembly. See WebAssembly platforms for more information. Memory-mapped file objects behave like both bytearray and like ...

"You can use mmap objects in most places where strings are expected; for
example, you can use the re module to search through a memory-mapped file."

Seems applicable, and it should keep your memory use down, but I'm not
very experienced with it...

Steve

streaming a file object through re.finditer

streaming a file object through re.finditer

Comment

Comment

Comment

Comment

Comment

Comment