Looking for very simple general purpose tokenizer

**Eric Brunel** · Jul 18 '05, 07:43 AM

Re: Looking for very simple general purpose tokenizer

Maarten van Reeuwijk wrote:[color=blue]
> Hi group,
>
> I need to parse various text files in python. I was wondering if there was a
> general purpose tokenizer available. I know about split(), but this
> (otherwise very handy method does not allow me to specify a list of
> splitting characters, only one at the time and it removes my splitting
> operators (OK for spaces and \n's but not for =, / etc. Furthermore I tried
> tokenize but this specifically for Python and is way too heavy for me. I am
> looking for something like this:
>
>
> splitchars = [' ', '\n', '=', '/', ....]
> tokenlist = tokenize(rawfil e, splitchars)
>
> Is there something like this available inside Python or did anyone already
> make this? Thank you in advance[/color]

You may use re.findall for that:
[color=blue][color=green][color=darkred]
>>> import re
>>> s = "a = b+c; z = 34;"
>>> pat = " |=|;|[^ =;]*"
>>> re.findall(pat, s)[/color][/color][/color]
['a', ' ', '=', ' ', 'b+c', ';', ' ', 'z', ' ', '=', ' ', '34', ';', '']

The pattern basically says: match either a space, a '=', a ';', or a sequence of
any characters that are not space, '=' or ';'. You may have to take care
beforehands about special characters like \n or \ (very special in regular
expressions)

HTH
--
- Eric Brunel <eric dot brunel at pragmadev dot com> -
PragmaDev : Real Time Software Development Tools - http://www.pragmadev.com

**Paul McGuire** · Jul 18 '05, 07:43 AM

Re: Looking for very simple general purpose tokenizer

"Maarten van Reeuwijk" <maarten@remove _this_ws.tn.tud elft.nl> wrote in
message news:bug9ij$30k $1@news.tudelft .nl...[color=blue]
> Hi group,
>
> I need to parse various text files in python. I was wondering if there was[/color]
a[color=blue]
> general purpose tokenizer available. I know about split(), but this
> (otherwise very handy method does not allow me to specify a list of
> splitting characters, only one at the time and it removes my splitting
> operators (OK for spaces and \n's but not for =, / etc. Furthermore I[/color]
tried[color=blue]
> tokenize but this specifically for Python and is way too heavy for me. I[/color]
am[color=blue]
> looking for something like this:
>
>
> splitchars = [' ', '\n', '=', '/', ....]
> tokenlist = tokenize(rawfil e, splitchars)
>
> Is there something like this available inside Python or did anyone already
> make this? Thank you in advance
>
> Maarten
> --
> =============== =============== =============== =============== =======
> Maarten van Reeuwijk Heat and Fluid Sciences
> Phd student dept. of Multiscale Physics
> www.ws.tn.tudelft.nl Delft University of Technology[/color]
Maarten -
Please give my pyparsing module a try. You can download it from SourceForge
at http://pyparsing.sourceforge.net. I wrote it for just this purpose, it
allows you to define your own parsing patterns for any text data file, and
the tokenized results are returned in a dictionary or list, as you prefer.
The download includes several examples also - one especially difficult file
parsing solution is shown in the dictExample.py script. And if you get
stuck, send me a sample of what you are trying to parse, and I can try to
give you some pointers (or even tell you if pyparsing isn't necessarily the
most appropriate tool for your job - it happens sometimes!).

-- Paul McGuire

Austin, Texas, USA

**Alan Kennedy** · Jul 18 '05, 07:43 AM

Re: Looking for very simple general purpose tokenizer

Maarten van Reeuwijk wrote:[color=blue]
> I need to parse various text files in python. I was wondering if
> there was a general purpose tokenizer available.[/color]

Indeed there is: python comes with batteries included. Try the shlex
module.

Welcome to Python.org

http://www.python.org/doc/lib/module-shlex.html

The official home of the Python Programming Language

Try the following code: it seems to do what you want. If it doesn't,
then please be more specific on your tokenisation rules.

#-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=
splitchars = [' ', '\n', '=', '/',]

source = """
thisshouldcome inthree parts
thisshould comeintwo
andso/shouldthis
and=this
"""

import shlex
import StringIO

def prepareToker(to ker, splitters):
for s in splitters: # resists People's Front of Judea joke ;-D
if toker.whitespac e.find(s) == -1:
toker.whitespac e = "%s%s" % (s, toker.whitespac e)
return toker

buf = StringIO.String IO(source)
toker = shlex.shlex(buf )
toker = prepareToker(to ker, splitchars)
for num, tok in enumerate(toker ):
print "%s:%s" % (num, tok)
#-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=

Note that the use of the iteration based interface in the above code
requires python 2.3. If you need it to run on previous versions,
specify which one.

regards,

--
alan kennedy
------------------------------------------------------
check http headers here: http://xhaus.com/headers
email alan: http://xhaus.com/contact/alan

**Maarten van Reeuwijk** · Jul 18 '05, 07:44 AM

Re: Looking for very simple general purpose tokenizer

Thank you all for your very useful comments. Below I have included my
source. Could you comment if there's a more elegant way of implementing the
continuation character &?

With the RE implementation I have noticed that the position of the '*' in
spclist is very delicate. This order works, but other orders throw
exceptions. Is this correct or is it a bug? Lastly, is there more
documentation and examples for the shlex module? Ideally I would like to
see a full scale example of how this module should be used to parse.

Maarten

import re
import shlex
import StringIO

def splitf90(source ):
buf = StringIO.String IO(source)
toker = shlex.shlex(buf )
toker.commenter s = "!"
toker.whitespac e = " \t\r"
return processTokens(t oker)

def splitf90_re(sou rce):
spclist = ['\*', '\+', '-', '/', '=','\[', '\]', '$', '$' \
'>', '<', '&', ';', ',', ':', '!', ' ', '\n']
pat = '|'.join(spclis t) + '|[^' + ''.join(spclist ) + ']+'
rawtokens = re.findall(pat, source)
return processTokens(r awtokens)

def processTokens(r awtokens):
# substitute characters
subst1 = []
prevtoken = None
for token in rawtokens:
if token == ';': token = '\n'
if token == ' ': token = ''
if token == '\n' and prevtoken == '&': token = ''
if not token == '':
subst1.append(t oken)
prevtoken = token

# remove continuation chars
subst2 = []
for token in subst1:
if token == '&': token = ''
if not token == '':
subst2.append(t oken)

# split into lines
final = []
curline = []
for token in subst2:
if not token == '\n':
curline.append( token)
else:
if not curline == []:
final.append(cu rline)
curline = []

return final

# Example session
src = """
MODULE modsize
implicit none

integer, parameter:: &
Nx = 256, &
Ny = 256, &
Nz = 256, &
nt = 1, & ! nr of (passive) scalars
Np = 16 ! nr of processors, should match mpirun -np .. command

END MODULE
"""
print splitf90(src)
print splitf90_re(src )

Output:
[['MODULE', 'modsize'], ['implicit', 'none'], ['integer', ',', 'parameter',
':', ':', 'Nx', '=', '256', ',', 'Ny', '=', '256', ',', 'Nz', '=', '256',
',', 'nt', '=', '1', ',', 'Np', '=', '16'], ['END', 'MODULE']]

[['MODULE', 'modsize'], ['implicit', 'none'], ['integer', ',', 'parameter',
':', ':', 'Nx', '=', '256', ',', 'Ny', '=', '256', ',', 'Nz', '=', '256',
',', 'nt', '=', '1', ',', '!', 'nr', 'of', '(', 'passive', 'scalars'],
['Np', '=', '16', '!', 'nr', 'of', 'processors', ',', 'should', 'match',
'mpirun', '-', 'np', 'command'], ['END', 'MODULE']]

--
=============== =============== =============== =============== =======
Maarten van Reeuwijk Heat and Fluid Sciences
Phd student dept. of Multiscale Physics
www.ws.tn.tudelft.nl Delft University of Technology

**Maarten van Reeuwijk** · Jul 18 '05, 07:45 AM

Re: Looking for very simple general purpose tokenizer

I found a complication with the shlex module. When I execute the following
fragment you'll notice that doubles are split. Is there any way to avoid
numbers this?

source = """
$NAMRUN
Lz = 0.15
nu = 1.08E-6
"""

import shlex
import StringIO

buf = StringIO.String IO(source)
toker = shlex.shlex(buf )
toker.comments = ""
toker.whitespac e = " \t\r"
print [tok for tok in toker]

Output:
['\n', '$', 'NAMRUN', '\n', 'Lz', '=', '0', '.', '15', '\n', 'nu', '=', '1',
'.', '08E', '-', '6', '\n']

--
=============== =============== =============== =============== =======
Maarten van Reeuwijk Heat and Fluid Sciences
Phd student dept. of Multiscale Physics
www.ws.tn.tudelft.nl Delft University of Technology

**JanC** · Jul 18 '05, 07:46 AM

Re: Looking for very simple general purpose tokenizer

Maarten van Reeuwijk <maarten@remove _this_ws.tn.tud elft.nl> schreef:
[color=blue]
> I found a complication with the shlex module. When I execute the
> following fragment you'll notice that doubles are split. Is there any way
> to avoid numbers this?[/color]

From the docs at <http://www.python.org/doc/current/lib/shlex-objects.html>

wordchars
The string of characters that will accumulate into multi-character
tokens. By default, includes all ASCII alphanumerics and underscore.
[color=blue]
> source = """
> $NAMRUN
> Lz = 0.15
> nu = 1.08E-6
> """
>
> import shlex
> import StringIO
>
> buf = StringIO.String IO(source)
> toker = shlex.shlex(buf )
> toker.comments = ""
> toker.whitespac e = " \t\r"[/color]

toker.wordchars = toker.wordchars + ".-$" # etc.
[color=blue]
> print [tok for tok in toker][/color]

Output:

['\n', '$NAMRUN', '\n', 'Lz', '=', '0.15', '\n', 'nu', '=', '1.08E-6', '\n']

Is this what you want?

--
JanC

"Be strict when sending and tolerant when receiving."
RFC 1958 - Architectural Principles of the Internet - section 3.9

Looking for very simple general purpose tokenizer

Looking for very simple general purpose tokenizer

Comment

Comment

Comment

Comment

Comment

Comment