My first Python program -- a lexer

**Arnaud Delobelle** · Nov 9 '08, 07:45 AM

Re: My first Python program -- a lexer

Thomas Mlynarczyk <thomas@mlynarc zyk-webdesign.dewri tes:

Hello,
>
I started to write a lexer in Python -- my first attempt to do
something useful with Python (rather than trying out snippets from
tutorials). It is not complete yet, but I would like some feedback --
I'm a Python newbie and it seems that, with Python, there is always a
simpler and better way to do it than you think.
>

Hi,

Adding to John's comments, I wouldn't have source as a member of the
Lexer object but as an argument of the tokenise() method (which I would
make public). The tokenise method would return what you currently call
self.result. So it would be used like this.

>>mylexer = Lexer(tokens)
>>mylexer.token ise(source)

# Later:

>>mylexer.token ise(another_sou rce)

--
Arnaud

**Thomas Mlynarczyk** · Nov 9 '08, 03:05 PM

Re: My first Python program -- a lexer

Arnaud Delobelle schrieb:

Adding to John's comments, I wouldn't have source as a member of the
Lexer object but as an argument of the tokenise() method (which I would
make public). The tokenise method would return what you currently call
self.result. So it would be used like this.

>>>mylexer = Lexer(tokens)
>>>mylexer.toke nise(source)
>>>mylexer.toke nise(another_so urce)

At a later stage, I intend to have the source tokenised not all at once,
but token by token, "just in time" when the parser (yet to be written)
accesses the next token:

token = mylexer.next( 'FOO_TOKEN' )
if not token: raise Exception( 'FOO token expected.' )
# continue doing something useful with token

Where next() would return the next token (and advance an internal
pointer) *if* it is a FOO_TOKEN, otherwise it would return False. This
way, the total number of regex matchings would be reduced: Only that
which is expected is "tried out".

But otherwise, upon reflection, I think you are right and it would
indeed be more appropriate to do as you suggest.

Thanks for your feedback.

Greetings,
Thomas

--
Ce n'est pas parce qu'ils sont nombreux à avoir tort qu'ils ont raison!
(Coluche)

**Paul McGuire** · Nov 10 '08, 04:35 AM

Re: My first Python program -- a lexer

On Nov 9, 8:34 pm, Dennis Lee Bieber <wlfr...@ix.net com.comwrote:

On Sun, 09 Nov 2008 23:33:30 +0100, Thomas Mlynarczyk
<tho...@mlynarc zyk-webdesign.dedec laimed the following in
comp.lang.pytho n:
>
>
>

Of course. For the actual message I would use at least the line number.
Still, the offset could be used to compute line/column in case of an
error, so I wouldn't really need to store line/column with each token,
but only the offset. And provide a method to "convert" offset values
into line/column tuples.

>
Are you forcing the use of fixed length lines then?
>
Otherwise, by what algorithm will you convert:
>

>data = """

>
... one
... two
... three
... four
... five
... supercalifragil isticexpialidoc ious
... seven
... eight
... nine""">>ix = data.index("lis t")

>ix

39

loc = data.index("lis t")
print data[:loc].count("\n")-1
print loc-data[:loc].rindex("\n")-1

prints 5,14

I'm sure it's non-optimal, but it *is* an algorithm that does not
require keeping track of the start of every line...

-- Paul

**Robert Lehmann** · Nov 10 '08, 08:25 AM

Re: My first Python program -- a lexer

On Sun, 09 Nov 2008 15:53:01 +0100, Thomas Mlynarczyk wrote:

Arnaud Delobelle schrieb:
>

>Adding to John's comments, I wouldn't have source as a member of the
>Lexer object but as an argument of the tokenise() method (which I would
>make public). The tokenise method would return what you currently call
>self.result. So it would be used like this.

>

>>>>mylexer = Lexer(tokens)
>>>>mylexer.tok enise(source)
>>>>mylexer.tok enise(another_s ource)

>
At a later stage, I intend to have the source tokenised not all at once,
but token by token, "just in time" when the parser (yet to be written)
accesses the next token:

You don't have to introduce a `next` method to your Lexer class. You
could just transform your `tokenize` method into a generator by replacing
``self.result.a ppend`` with `yield`. It gives you the just in time part
for free while not picking your algorithm into tiny unrelated pieces.

token = mylexer.next( 'FOO_TOKEN' )
if not token: raise Exception( 'FOO token expected.' ) # continue
doing something useful with token
>
Where next() would return the next token (and advance an internal
pointer) *if* it is a FOO_TOKEN, otherwise it would return False. This
way, the total number of regex matchings would be reduced: Only that
which is expected is "tried out".

Python generators recently (2.5) grew a `send` method. You could use
`next` for unconditional tokenization and ``mytokenizer.s end("expected
token")`` whenever you expect a special token.

See http://www.python.org/dev/peps/pep-0342/ for details.

HTH,

--
Robert "Stargaming " Lehmann

**Thomas Mlynarczyk** · Nov 10 '08, 01:15 PM

Re: My first Python program -- a lexer

Robert Lehmann schrieb:

You don't have to introduce a `next` method to your Lexer class. You
could just transform your `tokenize` method into a generator by replacing
``self.result.a ppend`` with `yield`. It gives you the just in time part
for free while not picking your algorithm into tiny unrelated pieces.

Python generators recently (2.5) grew a `send` method. You could use
`next` for unconditional tokenization and ``mytokenizer.s end("expected
token")`` whenever you expect a special token.

See http://www.python.org/dev/peps/pep-0342/ for details.

I will try this. Thank you for the suggestion.

Greetings,
Thomas

--
Ce n'est pas parce qu'ils sont nombreux Ã avoir tort qu'ils ont raison!
(Coluche)

**Thomas Mlynarczyk** · Nov 10 '08, 01:35 PM

Re: My first Python program -- a lexer

Paul McGuire schrieb:

loc = data.index("lis t")
print data[:loc].count("\n")-1
print loc-data[:loc].rindex("\n")-1
>
prints 5,14
>
I'm sure it's non-optimal, but it *is* an algorithm that does not
require keeping track of the start of every line...

Yes, I was thinking of something like this. As long as the line/column
are only needed in case of an error (i.e. at most once per script run),
I consider this more performant than keeping track of line/column for
every token.

Greetings,
Thomas

--
Ce n'est pas parce qu'ils sont nombreux à avoir tort qu'ils ont raison!
(Coluche)

**Paul McGuire** · Nov 10 '08, 02:15 PM

Re: My first Python program -- a lexer

On Nov 10, 7:29 am, Thomas Mlynarczyk <tho...@mlynarc zyk-webdesign.de>
wrote:

Paul McGuire schrieb:
>

loc = data.index("lis t")
print data[:loc].count("\n")-1
print loc-data[:loc].rindex("\n")-1

>

prints 5,14

>

I'm sure it's non-optimal, but it *is* an algorithm that does not
require keeping track of the start of every line...

>
Yes, I was thinking of something like this. As long as the line/column
are only needed in case of an error (i.e. at most once per script run),
I consider this more performant than keeping track of line/column for
every token.
>
Greetings,
Thomas
>
--
Ce n'est pas parce qu'ils sont nombreux à avoir tort qu'ils ont raison!
(Coluche)

Just be sure to account for tabs when computing the column, which this
simple-minded algorithm does not do.

-- Paul

**Steve Holden** · Nov 10 '08, 03:05 PM

Re: My first Python program -- a lexer

Some pratt wrote:

BLAST YOUR AD [...]

and curse yours

**Thomas Mlynarczyk** · Nov 10 '08, 03:25 PM

Re: My first Python program -- a lexer

Paul McGuire schrieb:

Just be sure to account for tabs when computing the column, which this
simple-minded algorithm does not do.

Another thing I had not thought of -- thanks for the hint.

Greetings,
Thomas

--
Ce n'est pas parce qu'ils sont nombreux à avoir tort qu'ils ont raison!
(Coluche)

**Thomas Mlynarczyk** · Nov 10 '08, 09:45 PM

Re: My first Python program -- a lexer

John Machin schrieb:

Single-character tokens like "<" may be more efficiently handled by
doing a dict lookup after failing to find a match in the list of
(name, regex) tuples.

Yes, I will keep that in mind. For the time being, I will use only
regexes to keep the code simpler. Later, or when the need for a speedup
arises, I can use your suggestion for optimizing my code. Depending on
the tokens, it might even be better to use the dict lookup right away
and the regex as a secondary means for more complex stuff.

[Mutually exclusive tokens = no ambiguities = same input -same output]

So what? That is useless knowledge.

For the lexer, perhaps. Not for the user. An ambiguous lexer will be of
no use.

It is the ambiguous cases that you
need to be concerned with.

Exactly. In which way does that contradict what I said?

[Using dict]

No, not at all. The point is that you were not *using* any of the
mapping functionality of the dict object, only ancillary methods like
iteritems -- hence, you should not have been using a dict at all.

I /could/ have done it with a list of tuples. I use no functionality
that /only/ a dict can do. So using a dict here is like using a truck
for transporting a single sheet of paper?

Greetings,
Thomas

--
Ce n'est pas parce qu'ils sont nombreux à avoir tort qu'ils ont raison!
(Coluche)

**John Machin** · Nov 10 '08, 11:15 PM

Re: My first Python program -- a lexer

On Nov 11, 8:35 am, Thomas Mlynarczyk <tho...@mlynarc zyk-webdesign.de>
wrote:

[Using dict]
>

No, not at all. The point is that you were not *using* any of the
mapping functionality of the dict object, only ancillary methods like
iteritems -- hence, you should not have been using a dict at all.

>
I /could/ have done it with a list of tuples. I use no functionality
that /only/ a dict can do. So using a dict here is like using a truck
for transporting a single sheet of paper?

You are getting closer. A better analogy is that using a dict is like
transporting passengers along an autobahn in an aeroplane or
helicopter that never leaves the ground.

**Thomas Mlynarczyk** · Nov 11 '08, 04:05 PM

Re: My first Python program -- a lexer

John Machin schrieb:

You are getting closer. A better analogy is that using a dict is like
transporting passengers along an autobahn in an aeroplane or
helicopter that never leaves the ground.

It is not a bad idea to transport passengers in an airplane, but then
the airplane should not follow an autobahn, but use the shortest way --
at an appropriate altitude. Translated back to Python dicts, this would
mean that using a dict for my purposes is a good idea, but that I do not
make use of its full capabilities -- in other words, I should rewrite my
code -- still using dict but in a better way. Or do you mean that for 10
kilometers of autobahn, an airplane would be overkill?

Maybe I am a bit biased by my PHP background, but { name: regex, ... }
looks simpler to me than [ ( name, regex ), ... ], because the former is
not a nested structure, while the latter would be a 2D-array in PHP.

Suppose I use the dict and I want to access the regex associatetd with
the token named "tokenname" (that is, no iteration, but a single
access). I could simple write tokendict["tokenname"]. But with the list
of tuples, I can't think of an equally easy way to do that. But then, as
a beginner, I might be underestimating Python.

Greetings,
Thomas

--
Ce n'est pas parce qu'ils sont nombreux à avoir tort qu'ils ont raison!
(Coluche)

**Steve Holden** · Nov 11 '08, 04:25 PM

Re: My first Python program -- a lexer

Thomas Mlynarczyk wrote:

John Machin schrieb:
>

>You are getting closer. A better analogy is that using a dict is like
>transporting passengers along an autobahn in an aeroplane or
>helicopter that never leaves the ground.

>
It is not a bad idea to transport passengers in an airplane, but then
the airplane should not follow an autobahn, but use the shortest way --
at an appropriate altitude. Translated back to Python dicts, this would
mean that using a dict for my purposes is a good idea, but that I do not
make use of its full capabilities -- in other words, I should rewrite my
code -- still using dict but in a better way. Or do you mean that for 10
kilometers of autobahn, an airplane would be overkill?
>

The latter.

Maybe I am a bit biased by my PHP background, but { name: regex, ... }
looks simpler to me than [ ( name, regex ), ... ], because the former is
not a nested structure, while the latter would be a 2D-array in PHP.
>

But it's a question of the properties of the objects. Because lists are
ordered it's appropriate to use them when you want to iterate over them
and no random access is required.

If you need random access by key and no particular ordering is required
for the iteration then you should use a dict.

Forget "simplicity " until you know how the different objects are
implemented under the hood.

A good first rule is that simple, readable Python will usually be
efficient. As far as execution efficiency goes I'd be very surprised if
the dict solution was faster, but even if it were I would still prefer
the list-based solution because it's more appropriate to the problem.

"Premature optimization is the root of all evil" ...

Suppose I use the dict and I want to access the regex associatetd with
the token named "tokenname" (that is, no iteration, but a single
access). I could simple write tokendict["tokenname"]. But with the list
of tuples, I can't think of an equally easy way to do that. But then, as
a beginner, I might be underestimating Python.
>

But when do you want to do that? There's no point inventing use cases -
they should be demonstrated needs.

The only time your original code needs to use the token name to retrieve
the value is for your compile() method, but assuming the passed-in
tokens was of the form [(name, pattern), ...] you could just as easily
throw the method away and in __init__() write

self.tokens = [(name, re.compile(pat) ) for name, pat in tokens]

Or simply pass compiled token patterns in in the first place when they
are necessary ... then the caller has the option of not bothering to
optimize in the first place!

The advantage of the list-based approach is that it at least allows of
the possibility that you can observe the relative frequencies of the
tokens' appearance in real code, and then optimize the ordering to give
best (mean) performance (by putting the commonest token first) should
tokenization become a program hotspot.

With a dict you have no such opportunity, because the ordering is
determined by the implementation and not by your data structure.

regards
Steve
--
Steve Holden +1 571 484 6266 +1 800 494 3119
Holden Web LLC http://www.holdenweb.com/

**Thomas Mlynarczyk** · Nov 12 '08, 07:25 PM

Re: My first Python program -- a lexer

Steve Holden schrieb:

>Suppose I use the dict and I want to access the regex associatetd with
>the token named "tokenname" (that is, no iteration, but a single
>access). I could simple write tokendict["tokenname"]. But with the list
>of tuples, I can't think of an equally easy way to do that. But then, as
>a beginner, I might be underestimating Python.

But when do you want to do that? There's no point inventing use cases -
they should be demonstrated needs.

Well, I had been thinking about further reducing the number of regex
matchings needed. So I wanted to modify my lexer not to tokenize the
whole input at once, but only try to grab the next token from the input
"just in time" / on demand. For that I was thinking of having a next()
method like this:

def next( self, nameOfExpectedT oken ):
regex = self.getRegexBy TokenName( nameOfExpectedT oken )
match = regex.match( self.source, self.offset )
if not match: return False
line = self.line
self.line += match.group(0). count( "\n" )
self.offset += len( match.group(0) )
return ( nameOfExpectedT oken, match, line )

I'm not sure if this is a good idea, but it looks like one to me. The
problem is the first line of the method which retrieves the regex
associated with the given token name. Using a dict, I could simply write

regex = self.tokendict[nameOfExpectedT oken]

But with a list I suppose I wouldn't get away without a loop. Which I
assume is more expensive that the dict.

Or simply pass compiled token patterns in in the first place when they
are necessary ... then the caller has the option of not bothering to
optimize in the first place!

That would be an option. But shouldn't it be the lexer who takes care of
optimizing its own work as much as it can do without the caller's
assistance? After all, the caller should not need to know about the
internal workings of the lexer.

[Optimizing performance by putting most frequent tokens first]

With a dict you have no such opportunity, because the ordering is
determined by the implementation and not by your data structure.

True. Still, I should be able to gain even better performance with my
above approach using a next() function, as this would completely
eliminate all "useless" matching (like trying to match FOO where no foo
is allowed).

Greetings,
Thomas

--
Ce n'est pas parce qu'ils sont nombreux à avoir tort qu'ils ont raison!
(Coluche)

My first Python program -- a lexer

My first Python program -- a lexer

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment