Looking for very simple general purpose tokenizer

Collapse
This topic is closed.
X
X
 
  • Time
  • Show
Clear All
new posts
  • Maarten van Reeuwijk

    Looking for very simple general purpose tokenizer

    Hi group,

    I need to parse various text files in python. I was wondering if there was a
    general purpose tokenizer available. I know about split(), but this
    (otherwise very handy method does not allow me to specify a list of
    splitting characters, only one at the time and it removes my splitting
    operators (OK for spaces and \n's but not for =, / etc. Furthermore I tried
    tokenize but this specifically for Python and is way too heavy for me. I am
    looking for something like this:


    splitchars = [' ', '\n', '=', '/', ....]
    tokenlist = tokenize(rawfil e, splitchars)

    Is there something like this available inside Python or did anyone already
    make this? Thank you in advance

    Maarten
    --
    =============== =============== =============== =============== =======
    Maarten van Reeuwijk Heat and Fluid Sciences
    Phd student dept. of Multiscale Physics
    www.ws.tn.tudelft.nl Delft University of Technology
  • Eric Brunel

    #2
    Re: Looking for very simple general purpose tokenizer

    Maarten van Reeuwijk wrote:[color=blue]
    > Hi group,
    >
    > I need to parse various text files in python. I was wondering if there was a
    > general purpose tokenizer available. I know about split(), but this
    > (otherwise very handy method does not allow me to specify a list of
    > splitting characters, only one at the time and it removes my splitting
    > operators (OK for spaces and \n's but not for =, / etc. Furthermore I tried
    > tokenize but this specifically for Python and is way too heavy for me. I am
    > looking for something like this:
    >
    >
    > splitchars = [' ', '\n', '=', '/', ....]
    > tokenlist = tokenize(rawfil e, splitchars)
    >
    > Is there something like this available inside Python or did anyone already
    > make this? Thank you in advance[/color]

    You may use re.findall for that:
    [color=blue][color=green][color=darkred]
    >>> import re
    >>> s = "a = b+c; z = 34;"
    >>> pat = " |=|;|[^ =;]*"
    >>> re.findall(pat, s)[/color][/color][/color]
    ['a', ' ', '=', ' ', 'b+c', ';', ' ', 'z', ' ', '=', ' ', '34', ';', '']

    The pattern basically says: match either a space, a '=', a ';', or a sequence of
    any characters that are not space, '=' or ';'. You may have to take care
    beforehands about special characters like \n or \ (very special in regular
    expressions)

    HTH
    --
    - Eric Brunel <eric dot brunel at pragmadev dot com> -
    PragmaDev : Real Time Software Development Tools - http://www.pragmadev.com

    Comment

    • Paul McGuire

      #3
      Re: Looking for very simple general purpose tokenizer

      "Maarten van Reeuwijk" <maarten@remove _this_ws.tn.tud elft.nl> wrote in
      message news:bug9ij$30k $1@news.tudelft .nl...[color=blue]
      > Hi group,
      >
      > I need to parse various text files in python. I was wondering if there was[/color]
      a[color=blue]
      > general purpose tokenizer available. I know about split(), but this
      > (otherwise very handy method does not allow me to specify a list of
      > splitting characters, only one at the time and it removes my splitting
      > operators (OK for spaces and \n's but not for =, / etc. Furthermore I[/color]
      tried[color=blue]
      > tokenize but this specifically for Python and is way too heavy for me. I[/color]
      am[color=blue]
      > looking for something like this:
      >
      >
      > splitchars = [' ', '\n', '=', '/', ....]
      > tokenlist = tokenize(rawfil e, splitchars)
      >
      > Is there something like this available inside Python or did anyone already
      > make this? Thank you in advance
      >
      > Maarten
      > --
      > =============== =============== =============== =============== =======
      > Maarten van Reeuwijk Heat and Fluid Sciences
      > Phd student dept. of Multiscale Physics
      > www.ws.tn.tudelft.nl Delft University of Technology[/color]
      Maarten -
      Please give my pyparsing module a try. You can download it from SourceForge
      at http://pyparsing.sourceforge.net. I wrote it for just this purpose, it
      allows you to define your own parsing patterns for any text data file, and
      the tokenized results are returned in a dictionary or list, as you prefer.
      The download includes several examples also - one especially difficult file
      parsing solution is shown in the dictExample.py script. And if you get
      stuck, send me a sample of what you are trying to parse, and I can try to
      give you some pointers (or even tell you if pyparsing isn't necessarily the
      most appropriate tool for your job - it happens sometimes!).

      -- Paul McGuire

      Austin, Texas, USA


      Comment

      • Alan Kennedy

        #4
        Re: Looking for very simple general purpose tokenizer

        Maarten van Reeuwijk wrote:[color=blue]
        > I need to parse various text files in python. I was wondering if
        > there was a general purpose tokenizer available.[/color]

        Indeed there is: python comes with batteries included. Try the shlex
        module.

        The official home of the Python Programming Language


        Try the following code: it seems to do what you want. If it doesn't,
        then please be more specific on your tokenisation rules.

        #-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=
        splitchars = [' ', '\n', '=', '/',]

        source = """
        thisshouldcome inthree parts
        thisshould comeintwo
        andso/shouldthis
        and=this
        """

        import shlex
        import StringIO

        def prepareToker(to ker, splitters):
        for s in splitters: # resists People's Front of Judea joke ;-D
        if toker.whitespac e.find(s) == -1:
        toker.whitespac e = "%s%s" % (s, toker.whitespac e)
        return toker

        buf = StringIO.String IO(source)
        toker = shlex.shlex(buf )
        toker = prepareToker(to ker, splitchars)
        for num, tok in enumerate(toker ):
        print "%s:%s" % (num, tok)
        #-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=

        Note that the use of the iteration based interface in the above code
        requires python 2.3. If you need it to run on previous versions,
        specify which one.

        regards,

        --
        alan kennedy
        ------------------------------------------------------
        check http headers here: http://xhaus.com/headers
        email alan: http://xhaus.com/contact/alan

        Comment

        • Maarten van Reeuwijk

          #5
          Re: Looking for very simple general purpose tokenizer

          Thank you all for your very useful comments. Below I have included my
          source. Could you comment if there's a more elegant way of implementing the
          continuation character &?

          With the RE implementation I have noticed that the position of the '*' in
          spclist is very delicate. This order works, but other orders throw
          exceptions. Is this correct or is it a bug? Lastly, is there more
          documentation and examples for the shlex module? Ideally I would like to
          see a full scale example of how this module should be used to parse.

          Maarten

          import re
          import shlex
          import StringIO

          def splitf90(source ):
          buf = StringIO.String IO(source)
          toker = shlex.shlex(buf )
          toker.commenter s = "!"
          toker.whitespac e = " \t\r"
          return processTokens(t oker)

          def splitf90_re(sou rce):
          spclist = ['\*', '\+', '-', '/', '=','\[', '\]', '\(', '\)' \
          '>', '<', '&', ';', ',', ':', '!', ' ', '\n']
          pat = '|'.join(spclis t) + '|[^' + ''.join(spclist ) + ']+'
          rawtokens = re.findall(pat, source)
          return processTokens(r awtokens)

          def processTokens(r awtokens):
          # substitute characters
          subst1 = []
          prevtoken = None
          for token in rawtokens:
          if token == ';': token = '\n'
          if token == ' ': token = ''
          if token == '\n' and prevtoken == '&': token = ''
          if not token == '':
          subst1.append(t oken)
          prevtoken = token

          # remove continuation chars
          subst2 = []
          for token in subst1:
          if token == '&': token = ''
          if not token == '':
          subst2.append(t oken)

          # split into lines
          final = []
          curline = []
          for token in subst2:
          if not token == '\n':
          curline.append( token)
          else:
          if not curline == []:
          final.append(cu rline)
          curline = []

          return final

          # Example session
          src = """
          MODULE modsize
          implicit none

          integer, parameter:: &
          Nx = 256, &
          Ny = 256, &
          Nz = 256, &
          nt = 1, & ! nr of (passive) scalars
          Np = 16 ! nr of processors, should match mpirun -np .. command

          END MODULE
          """
          print splitf90(src)
          print splitf90_re(src )

          Output:
          [['MODULE', 'modsize'], ['implicit', 'none'], ['integer', ',', 'parameter',
          ':', ':', 'Nx', '=', '256', ',', 'Ny', '=', '256', ',', 'Nz', '=', '256',
          ',', 'nt', '=', '1', ',', 'Np', '=', '16'], ['END', 'MODULE']]

          [['MODULE', 'modsize'], ['implicit', 'none'], ['integer', ',', 'parameter',
          ':', ':', 'Nx', '=', '256', ',', 'Ny', '=', '256', ',', 'Nz', '=', '256',
          ',', 'nt', '=', '1', ',', '!', 'nr', 'of', '(', 'passive', 'scalars'],
          ['Np', '=', '16', '!', 'nr', 'of', 'processors', ',', 'should', 'match',
          'mpirun', '-', 'np', 'command'], ['END', 'MODULE']]

          --
          =============== =============== =============== =============== =======
          Maarten van Reeuwijk Heat and Fluid Sciences
          Phd student dept. of Multiscale Physics
          www.ws.tn.tudelft.nl Delft University of Technology

          Comment

          • Maarten van Reeuwijk

            #6
            Re: Looking for very simple general purpose tokenizer

            I found a complication with the shlex module. When I execute the following
            fragment you'll notice that doubles are split. Is there any way to avoid
            numbers this?


            source = """
            $NAMRUN
            Lz = 0.15
            nu = 1.08E-6
            """

            import shlex
            import StringIO

            buf = StringIO.String IO(source)
            toker = shlex.shlex(buf )
            toker.comments = ""
            toker.whitespac e = " \t\r"
            print [tok for tok in toker]

            Output:
            ['\n', '$', 'NAMRUN', '\n', 'Lz', '=', '0', '.', '15', '\n', 'nu', '=', '1',
            '.', '08E', '-', '6', '\n']


            --
            =============== =============== =============== =============== =======
            Maarten van Reeuwijk Heat and Fluid Sciences
            Phd student dept. of Multiscale Physics
            www.ws.tn.tudelft.nl Delft University of Technology

            Comment

            • JanC

              #7
              Re: Looking for very simple general purpose tokenizer

              Maarten van Reeuwijk <maarten@remove _this_ws.tn.tud elft.nl> schreef:
              [color=blue]
              > I found a complication with the shlex module. When I execute the
              > following fragment you'll notice that doubles are split. Is there any way
              > to avoid numbers this?[/color]

              From the docs at <http://www.python.org/doc/current/lib/shlex-objects.html>

              wordchars
              The string of characters that will accumulate into multi-character
              tokens. By default, includes all ASCII alphanumerics and underscore.
              [color=blue]
              > source = """
              > $NAMRUN
              > Lz = 0.15
              > nu = 1.08E-6
              > """
              >
              > import shlex
              > import StringIO
              >
              > buf = StringIO.String IO(source)
              > toker = shlex.shlex(buf )
              > toker.comments = ""
              > toker.whitespac e = " \t\r"[/color]

              toker.wordchars = toker.wordchars + ".-$" # etc.
              [color=blue]
              > print [tok for tok in toker][/color]


              Output:

              ['\n', '$NAMRUN', '\n', 'Lz', '=', '0.15', '\n', 'nu', '=', '1.08E-6', '\n']

              Is this what you want?

              --
              JanC

              "Be strict when sending and tolerant when receiving."
              RFC 1958 - Architectural Principles of the Internet - section 3.9

              Comment

              Working...