My first Python program -- a lexer

Collapse
This topic is closed.
X
X
 
  • Time
  • Show
Clear All
new posts
  • Thomas Mlynarczyk

    My first Python program -- a lexer

    Hello,

    I started to write a lexer in Python -- my first attempt to do something
    useful with Python (rather than trying out snippets from tutorials). It
    is not complete yet, but I would like some feedback -- I'm a Python
    newbie and it seems that, with Python, there is always a simpler and
    better way to do it than you think.

    ### Begin ###

    import re

    class Lexer(object):
    def __init__( self, source, tokens ):
    self.source = re.sub( r"\r?\n|\r\n ", "\n", source )
    self.tokens = tokens
    self.offset = 0
    self.result = []
    self.line = 1
    self._compile()
    self._tokenize( )

    def _compile( self ):
    for name, regex in self.tokens.ite ritems():
    self.tokens[name] = re.compile( regex, re.M )

    def _tokenize( self ):
    while self.offset < len( self.source ):
    for name, regex in self.tokens.ite ritems():
    match = regex.match( self.source, self.offset )
    if not match: continue
    self.offset += len( match.group(0) )
    self.result.app end( ( name, match, self.line ) )
    self.line += match.group(0). count( "\n" )
    break
    else:
    raise Exception(
    'Syntax error in source at offset %s' %
    str( self.offset ) )

    def __str__( self ):
    return "\n".join(
    [ "[L:%s]\t[O:%s]\t[%s]\t'%s'" %
    ( str( line ), str( match.pos ), name, match.group(0) )
    for name, match, line in self.result ] )

    # Test Example

    source = r"""
    Name: "Thomas", # just a comment
    Age: 37
    """

    tokens = {
    'T_IDENTIFIER' : r'[A-Za-z_][A-Za-z0-9_]*',
    'T_NUMBER' : r'[+-]?\d+',
    'T_STRING' : r'"(?:\\.|[^\\"])*"',
    'T_OPERATOR' : r'[=:,;]',
    'T_NEWLINE' : r'\n',
    'T_LWSP' : r'[ \t]+',
    'T_COMMENT' : r'(?:\#|//).*$' }

    print Lexer( source, tokens )

    ### End ###


    Greetings,
    Thomas

    --
    Ce n'est pas parce qu'ils sont nombreux à avoir tort qu'ils ont raison!
    (Coluche)
  • Arnaud Delobelle

    #2
    Re: My first Python program -- a lexer

    Thomas Mlynarczyk <thomas@mlynarc zyk-webdesign.dewri tes:
    Hello,
    >
    I started to write a lexer in Python -- my first attempt to do
    something useful with Python (rather than trying out snippets from
    tutorials). It is not complete yet, but I would like some feedback --
    I'm a Python newbie and it seems that, with Python, there is always a
    simpler and better way to do it than you think.
    >
    Hi,

    Adding to John's comments, I wouldn't have source as a member of the
    Lexer object but as an argument of the tokenise() method (which I would
    make public). The tokenise method would return what you currently call
    self.result. So it would be used like this.
    >>mylexer = Lexer(tokens)
    >>mylexer.token ise(source)
    # Later:
    >>mylexer.token ise(another_sou rce)
    --
    Arnaud

    Comment

    • Thomas Mlynarczyk

      #3
      Re: My first Python program -- a lexer

      Arnaud Delobelle schrieb:
      Adding to John's comments, I wouldn't have source as a member of the
      Lexer object but as an argument of the tokenise() method (which I would
      make public). The tokenise method would return what you currently call
      self.result. So it would be used like this.
      >>>mylexer = Lexer(tokens)
      >>>mylexer.toke nise(source)
      >>>mylexer.toke nise(another_so urce)
      At a later stage, I intend to have the source tokenised not all at once,
      but token by token, "just in time" when the parser (yet to be written)
      accesses the next token:

      token = mylexer.next( 'FOO_TOKEN' )
      if not token: raise Exception( 'FOO token expected.' )
      # continue doing something useful with token

      Where next() would return the next token (and advance an internal
      pointer) *if* it is a FOO_TOKEN, otherwise it would return False. This
      way, the total number of regex matchings would be reduced: Only that
      which is expected is "tried out".

      But otherwise, upon reflection, I think you are right and it would
      indeed be more appropriate to do as you suggest.

      Thanks for your feedback.

      Greetings,
      Thomas

      --
      Ce n'est pas parce qu'ils sont nombreux à avoir tort qu'ils ont raison!
      (Coluche)

      Comment

      • Paul McGuire

        #4
        Re: My first Python program -- a lexer

        On Nov 9, 8:34 pm, Dennis Lee Bieber <wlfr...@ix.net com.comwrote:
        On Sun, 09 Nov 2008 23:33:30 +0100, Thomas Mlynarczyk
        <tho...@mlynarc zyk-webdesign.dedec laimed the following in
        comp.lang.pytho n:
        >
        >
        >
        Of course. For the actual message I would use at least the line number.
        Still, the offset could be used to compute line/column in case of an
        error, so I wouldn't really need to store line/column with each token,
        but only the offset. And provide a method to "convert" offset values
        into line/column tuples.
        >
                Are you forcing the use of fixed length lines then?
        >
                Otherwise, by what algorithm will you convert:
        >
        >data = """
        >
        ... one
        ... two
        ... three
        ... four
        ... five
        ... supercalifragil isticexpialidoc ious
        ... seven
        ... eight
        ... nine""">>ix = data.index("lis t")
        >ix
        39
        loc = data.index("lis t")
        print data[:loc].count("\n")-1
        print loc-data[:loc].rindex("\n")-1

        prints 5,14

        I'm sure it's non-optimal, but it *is* an algorithm that does not
        require keeping track of the start of every line...

        -- Paul

        Comment

        • Robert Lehmann

          #5
          Re: My first Python program -- a lexer

          On Sun, 09 Nov 2008 15:53:01 +0100, Thomas Mlynarczyk wrote:
          Arnaud Delobelle schrieb:
          >
          >Adding to John's comments, I wouldn't have source as a member of the
          >Lexer object but as an argument of the tokenise() method (which I would
          >make public). The tokenise method would return what you currently call
          >self.result. So it would be used like this.
          >
          >>>>mylexer = Lexer(tokens)
          >>>>mylexer.tok enise(source)
          >>>>mylexer.tok enise(another_s ource)
          >
          At a later stage, I intend to have the source tokenised not all at once,
          but token by token, "just in time" when the parser (yet to be written)
          accesses the next token:
          You don't have to introduce a `next` method to your Lexer class. You
          could just transform your `tokenize` method into a generator by replacing
          ``self.result.a ppend`` with `yield`. It gives you the just in time part
          for free while not picking your algorithm into tiny unrelated pieces.
          token = mylexer.next( 'FOO_TOKEN' )
          if not token: raise Exception( 'FOO token expected.' ) # continue
          doing something useful with token
          >
          Where next() would return the next token (and advance an internal
          pointer) *if* it is a FOO_TOKEN, otherwise it would return False. This
          way, the total number of regex matchings would be reduced: Only that
          which is expected is "tried out".
          Python generators recently (2.5) grew a `send` method. You could use
          `next` for unconditional tokenization and ``mytokenizer.s end("expected
          token")`` whenever you expect a special token.

          See http://www.python.org/dev/peps/pep-0342/ for details.

          HTH,

          --
          Robert "Stargaming " Lehmann

          Comment

          • Thomas Mlynarczyk

            #6
            Re: My first Python program -- a lexer

            Robert Lehmann schrieb:
            You don't have to introduce a `next` method to your Lexer class. You
            could just transform your `tokenize` method into a generator by replacing
            ``self.result.a ppend`` with `yield`. It gives you the just in time part
            for free while not picking your algorithm into tiny unrelated pieces.
            Python generators recently (2.5) grew a `send` method. You could use
            `next` for unconditional tokenization and ``mytokenizer.s end("expected
            token")`` whenever you expect a special token.
            I will try this. Thank you for the suggestion.

            Greetings,
            Thomas

            --
            Ce n'est pas parce qu'ils sont nombreux à avoir tort qu'ils ont raison!
            (Coluche)

            Comment

            • Thomas Mlynarczyk

              #7
              Re: My first Python program -- a lexer

              Paul McGuire schrieb:
              loc = data.index("lis t")
              print data[:loc].count("\n")-1
              print loc-data[:loc].rindex("\n")-1
              >
              prints 5,14
              >
              I'm sure it's non-optimal, but it *is* an algorithm that does not
              require keeping track of the start of every line...
              Yes, I was thinking of something like this. As long as the line/column
              are only needed in case of an error (i.e. at most once per script run),
              I consider this more performant than keeping track of line/column for
              every token.

              Greetings,
              Thomas

              --
              Ce n'est pas parce qu'ils sont nombreux à avoir tort qu'ils ont raison!
              (Coluche)

              Comment

              • Paul McGuire

                #8
                Re: My first Python program -- a lexer

                On Nov 10, 7:29 am, Thomas Mlynarczyk <tho...@mlynarc zyk-webdesign.de>
                wrote:
                Paul McGuire schrieb:
                >
                loc = data.index("lis t")
                print data[:loc].count("\n")-1
                print loc-data[:loc].rindex("\n")-1
                >
                prints 5,14
                >
                I'm sure it's non-optimal, but it *is* an algorithm that does not
                require keeping track of the start of every line...
                >
                Yes, I was thinking of something like this. As long as the line/column
                are only needed in case of an error (i.e. at most once per script run),
                I consider this more performant than keeping track of line/column for
                every token.
                >
                Greetings,
                Thomas
                >
                --
                Ce n'est pas parce qu'ils sont nombreux à avoir tort qu'ils ont raison!
                (Coluche)
                Just be sure to account for tabs when computing the column, which this
                simple-minded algorithm does not do.

                -- Paul

                Comment

                • Steve Holden

                  #9
                  Re: My first Python program -- a lexer

                  Some pratt wrote:
                  BLAST YOUR AD [...]
                  and curse yours

                  Comment

                  • Thomas Mlynarczyk

                    #10
                    Re: My first Python program -- a lexer

                    Paul McGuire schrieb:
                    Just be sure to account for tabs when computing the column, which this
                    simple-minded algorithm does not do.
                    Another thing I had not thought of -- thanks for the hint.

                    Greetings,
                    Thomas

                    --
                    Ce n'est pas parce qu'ils sont nombreux à avoir tort qu'ils ont raison!
                    (Coluche)

                    Comment

                    • Thomas Mlynarczyk

                      #11
                      Re: My first Python program -- a lexer

                      John Machin schrieb:
                      Single-character tokens like "<" may be more efficiently handled by
                      doing a dict lookup after failing to find a match in the list of
                      (name, regex) tuples.
                      Yes, I will keep that in mind. For the time being, I will use only
                      regexes to keep the code simpler. Later, or when the need for a speedup
                      arises, I can use your suggestion for optimizing my code. Depending on
                      the tokens, it might even be better to use the dict lookup right away
                      and the regex as a secondary means for more complex stuff.

                      [Mutually exclusive tokens = no ambiguities = same input -same output]
                      So what? That is useless knowledge.
                      For the lexer, perhaps. Not for the user. An ambiguous lexer will be of
                      no use.
                      It is the ambiguous cases that you
                      need to be concerned with.
                      Exactly. In which way does that contradict what I said?

                      [Using dict]
                      No, not at all. The point is that you were not *using* any of the
                      mapping functionality of the dict object, only ancillary methods like
                      iteritems -- hence, you should not have been using a dict at all.
                      I /could/ have done it with a list of tuples. I use no functionality
                      that /only/ a dict can do. So using a dict here is like using a truck
                      for transporting a single sheet of paper?

                      Greetings,
                      Thomas

                      --
                      Ce n'est pas parce qu'ils sont nombreux à avoir tort qu'ils ont raison!
                      (Coluche)

                      Comment

                      • John Machin

                        #12
                        Re: My first Python program -- a lexer

                        On Nov 11, 8:35 am, Thomas Mlynarczyk <tho...@mlynarc zyk-webdesign.de>
                        wrote:
                        [Using dict]
                        >
                        No, not at all. The point is that you were not *using* any of the
                        mapping functionality of the dict object, only ancillary methods like
                        iteritems -- hence, you should not have been using a dict at all.
                        >
                        I /could/ have done it with a list of tuples. I use no functionality
                        that /only/ a dict can do. So using a dict here is like using a truck
                        for transporting a single sheet of paper?
                        You are getting closer. A better analogy is that using a dict is like
                        transporting passengers along an autobahn in an aeroplane or
                        helicopter that never leaves the ground.

                        Comment

                        • Thomas Mlynarczyk

                          #13
                          Re: My first Python program -- a lexer

                          John Machin schrieb:
                          You are getting closer. A better analogy is that using a dict is like
                          transporting passengers along an autobahn in an aeroplane or
                          helicopter that never leaves the ground.
                          It is not a bad idea to transport passengers in an airplane, but then
                          the airplane should not follow an autobahn, but use the shortest way --
                          at an appropriate altitude. Translated back to Python dicts, this would
                          mean that using a dict for my purposes is a good idea, but that I do not
                          make use of its full capabilities -- in other words, I should rewrite my
                          code -- still using dict but in a better way. Or do you mean that for 10
                          kilometers of autobahn, an airplane would be overkill?

                          Maybe I am a bit biased by my PHP background, but { name: regex, ... }
                          looks simpler to me than [ ( name, regex ), ... ], because the former is
                          not a nested structure, while the latter would be a 2D-array in PHP.

                          Suppose I use the dict and I want to access the regex associatetd with
                          the token named "tokenname" (that is, no iteration, but a single
                          access). I could simple write tokendict["tokenname"]. But with the list
                          of tuples, I can't think of an equally easy way to do that. But then, as
                          a beginner, I might be underestimating Python.

                          Greetings,
                          Thomas

                          --
                          Ce n'est pas parce qu'ils sont nombreux à avoir tort qu'ils ont raison!
                          (Coluche)

                          Comment

                          • Steve Holden

                            #14
                            Re: My first Python program -- a lexer

                            Thomas Mlynarczyk wrote:
                            John Machin schrieb:
                            >
                            >You are getting closer. A better analogy is that using a dict is like
                            >transporting passengers along an autobahn in an aeroplane or
                            >helicopter that never leaves the ground.
                            >
                            It is not a bad idea to transport passengers in an airplane, but then
                            the airplane should not follow an autobahn, but use the shortest way --
                            at an appropriate altitude. Translated back to Python dicts, this would
                            mean that using a dict for my purposes is a good idea, but that I do not
                            make use of its full capabilities -- in other words, I should rewrite my
                            code -- still using dict but in a better way. Or do you mean that for 10
                            kilometers of autobahn, an airplane would be overkill?
                            >
                            The latter.
                            Maybe I am a bit biased by my PHP background, but { name: regex, ... }
                            looks simpler to me than [ ( name, regex ), ... ], because the former is
                            not a nested structure, while the latter would be a 2D-array in PHP.
                            >
                            But it's a question of the properties of the objects. Because lists are
                            ordered it's appropriate to use them when you want to iterate over them
                            and no random access is required.

                            If you need random access by key and no particular ordering is required
                            for the iteration then you should use a dict.

                            Forget "simplicity " until you know how the different objects are
                            implemented under the hood.

                            A good first rule is that simple, readable Python will usually be
                            efficient. As far as execution efficiency goes I'd be very surprised if
                            the dict solution was faster, but even if it were I would still prefer
                            the list-based solution because it's more appropriate to the problem.

                            "Premature optimization is the root of all evil" ...
                            Suppose I use the dict and I want to access the regex associatetd with
                            the token named "tokenname" (that is, no iteration, but a single
                            access). I could simple write tokendict["tokenname"]. But with the list
                            of tuples, I can't think of an equally easy way to do that. But then, as
                            a beginner, I might be underestimating Python.
                            >
                            But when do you want to do that? There's no point inventing use cases -
                            they should be demonstrated needs.

                            The only time your original code needs to use the token name to retrieve
                            the value is for your compile() method, but assuming the passed-in
                            tokens was of the form [(name, pattern), ...] you could just as easily
                            throw the method away and in __init__() write

                            self.tokens = [(name, re.compile(pat) ) for name, pat in tokens]

                            Or simply pass compiled token patterns in in the first place when they
                            are necessary ... then the caller has the option of not bothering to
                            optimize in the first place!

                            The advantage of the list-based approach is that it at least allows of
                            the possibility that you can observe the relative frequencies of the
                            tokens' appearance in real code, and then optimize the ordering to give
                            best (mean) performance (by putting the commonest token first) should
                            tokenization become a program hotspot.

                            With a dict you have no such opportunity, because the ordering is
                            determined by the implementation and not by your data structure.

                            regards
                            Steve
                            --
                            Steve Holden +1 571 484 6266 +1 800 494 3119
                            Holden Web LLC http://www.holdenweb.com/

                            Comment

                            • Thomas Mlynarczyk

                              #15
                              Re: My first Python program -- a lexer

                              Steve Holden schrieb:
                              >Suppose I use the dict and I want to access the regex associatetd with
                              >the token named "tokenname" (that is, no iteration, but a single
                              >access). I could simple write tokendict["tokenname"]. But with the list
                              >of tuples, I can't think of an equally easy way to do that. But then, as
                              >a beginner, I might be underestimating Python.
                              But when do you want to do that? There's no point inventing use cases -
                              they should be demonstrated needs.
                              Well, I had been thinking about further reducing the number of regex
                              matchings needed. So I wanted to modify my lexer not to tokenize the
                              whole input at once, but only try to grab the next token from the input
                              "just in time" / on demand. For that I was thinking of having a next()
                              method like this:

                              def next( self, nameOfExpectedT oken ):
                              regex = self.getRegexBy TokenName( nameOfExpectedT oken )
                              match = regex.match( self.source, self.offset )
                              if not match: return False
                              line = self.line
                              self.line += match.group(0). count( "\n" )
                              self.offset += len( match.group(0) )
                              return ( nameOfExpectedT oken, match, line )

                              I'm not sure if this is a good idea, but it looks like one to me. The
                              problem is the first line of the method which retrieves the regex
                              associated with the given token name. Using a dict, I could simply write

                              regex = self.tokendict[nameOfExpectedT oken]

                              But with a list I suppose I wouldn't get away without a loop. Which I
                              assume is more expensive that the dict.
                              Or simply pass compiled token patterns in in the first place when they
                              are necessary ... then the caller has the option of not bothering to
                              optimize in the first place!
                              That would be an option. But shouldn't it be the lexer who takes care of
                              optimizing its own work as much as it can do without the caller's
                              assistance? After all, the caller should not need to know about the
                              internal workings of the lexer.

                              [Optimizing performance by putting most frequent tokens first]
                              With a dict you have no such opportunity, because the ordering is
                              determined by the implementation and not by your data structure.
                              True. Still, I should be able to gain even better performance with my
                              above approach using a next() function, as this would completely
                              eliminate all "useless" matching (like trying to match FOO where no foo
                              is allowed).

                              Greetings,
                              Thomas

                              --
                              Ce n'est pas parce qu'ils sont nombreux à avoir tort qu'ils ont raison!
                              (Coluche)

                              Comment

                              Working...