pyparsing: match empty line

Collapse
This topic is closed.
X
X
 
  • Time
  • Show
Clear All
new posts
  • Marek Kubica

    pyparsing: match empty line

    Hi,

    I am trying to get this stuff working, but I still fail.

    I have a format which consists of three elements:
    \d{4}M?-\d (4 numbers, optional M, dash, another number)
    EMPTY (the <EMPTYtoken)
    [Empty line] (the <PAGEBREAKtoken . The line may contain whitespaces,
    but nothing else)

    While the ``watchname`` and ``leaveempty`` were trivial, I cannot get
    ``pagebreak`` to work properly.

    #!/usr/bin/env python
    # -*- coding: UTF-8 -*-

    from pyparsing import (Word, Literal, Optional, Group, OneOrMore, Regex,
    Combine, ParserElement, nums, LineStart, LineEnd, White,
    replaceWith)

    ParserElement.s etDefaultWhites paceChars(' \t\r')

    watchseries = Word(nums, exact=4)
    watchrev = Word(nums, exact=1)

    watchname = Combine(watchse ries + Optional('M') + '-' + watchrev)

    leaveempty = Literal('EMPTY' )

    def breaks(s, loc, tokens):
    print repr(tokens[0])
    #return ['<PAGEBREAK>' for token in tokens[0]]
    return ['<PAGEBREAK>']

    #pagebreak = Regex('^\s*$'). setParseAction( breaks)
    pagebreak = LineStart() + LineEnd().setPa rseAction(repla ceWith
    ('<PAGEBREAK>') )

    parser = OneOrMore(watch name ^ pagebreak ^ leaveempty)

    tests = [
    "2134M-2",
    """3245-3
    3456M-5""",
    """3256-4

    4563-4""",
    """4562M-6
    EMPTY
    3246-5"""
    ]

    for test in tests:
    print parser.parseStr ing(test)

    The output should be:
    ['2134M-2']
    ['3245-3', '3456M-5']
    ['3256-4', '<PAGEBREAK>' '4563-4']
    ['4562M-6', '<EMPTY>', '3246-5']

    Thanks in advance!
    regards,
    Marek
  • Paul McGuire

    #2
    Re: pyparsing: match empty line

    On Sep 2, 11:38 am, Marek Kubica <ma...@xiviliza tion.netwrote:
    Hi,
    >
    I am trying to get this stuff working, but I still fail.
    >
    I have a format which consists of three elements:
    \d{4}M?-\d (4 numbers, optional M, dash, another number)
    EMPTY (the <EMPTYtoken)
    [Empty line] (the <PAGEBREAKtoken . The line may contain whitespaces,
    but nothing else)
    >
    <snip>

    Marek -

    Here are some refinements to your program that will get you closer to
    your posted results.

    1) Well done in resetting the default whitespace characters, since you
    are doing some parsing that is dependent on the presence of line
    ends. When you do this, it is useful to define an expression for end
    of line so that you can reference it where you explicitly expect to
    find line ends:

    EOL = LineEnd().suppr ess()


    2) Your second test fails because there is an EOL between the two
    watchnames. Since you have removed EOL from the set of default
    whitespace characters (that is, whitespace that pyparsing will
    automatically skip over), then pyparsing will stop after reading the
    first watchname. I think that you want EOLs to get parsed if nothing
    else matches, so you can add it to the end of your grammar definition:

    parser = OneOrMore(watch name ^ pagebreak ^ leaveempty ^ EOL)

    This will now permit the second test to pass.


    3) Your definition of pagebreak looks okay now, but I don't understand
    why your test containing 2 blank lines is only supposed to generate a
    single <PAGEBREAK>.

    pagebreak = LineStart() +
    LineEnd().setPa rseAction(repla ceWith('<PAGEBR EAK>'))

    If you really want to only get a single <PAGEBREAKfro m your test
    case, than change pagebreak to:

    pagebreak = OneOrMore(LineS tart() +
    LineEnd()).setP arseAction(repl aceWith('<PAGEB REAK>'))


    4) leaveempty probably needs this parse action to be attached to it:

    leaveempty =
    Literal('EMPTY' ).setParseActio n(replaceWith(' <EMPTY>'))


    5) (optional) Your definition of parser uses '^' operators, which
    translate into Or expressions. Or expressions evaluate all the
    alternatives, and then choose the longest match. The expressions you
    have don't really have any ambiguity to them, and could be evaluated
    using:

    parser = OneOrMore(watch name | pagebreak | leaveempty | EOL)

    '|' operators generate MatchFirst expressions. MatchFirst will do
    short-circuit evaluation - the first expression that matches will be
    the one chosen as the matching alternative.


    If you have more pyparsing questions, you can also post them on the
    pyparsing wiki - the Discussion tab on the wiki Home page has become a
    running support forum - and there is also a Help/Discussion mailing
    list.

    Cheers,
    -- Paul

    Comment

    • Marek Kubica

      #3
      Re: pyparsing: match empty line

      Hi,

      First of all a big thank you for your excellent library and of course
      also for your extensive and enlightening answer!
      1) Well done in resetting the default whitespace characters, since you
      are doing some parsing that is dependent on the presence of line ends.
      When you do this, it is useful to define an expression for end of line
      so that you can reference it where you explicitly expect to find line
      ends:
      >
      EOL = LineEnd().suppr ess()
      Ok, I didn't think about this. But as my program is not only a parser but
      a long-running process and setDefaultWhite space modifies a global
      variable I don't feel too comfortable with it. I could set the whitespace
      on every element, but that is as you surely agree quite ugly. Do you
      accept patches? I'm thinking about some kind of factory-class which would
      automatically set the whitespaces:
      >>factory = TokenFactory(' \t\r')
      >>word = Factory.Word(al phas)
      >>>
      That way, one wouldn't need to set a grobal value which might interfere
      with other pyparsers running in the same process.
      parser = OneOrMore(watch name ^ pagebreak ^ leaveempty ^ EOL)
      >
      This will now permit the second test to pass.
      Right. Seems that working with whitespace requires a bit better
      understanding than I had.
      3) Your definition of pagebreak looks okay now, but I don't understand
      why your test containing 2 blank lines is only supposed to generate a
      single <PAGEBREAK>.
      No, it should be one <PAGEBREAKper blank line, now it works as expected.
      4) leaveempty probably needs this parse action to be attached to it:
      >
      leaveempty =
      Literal('EMPTY' ).setParseActio n(replaceWith(' <EMPTY>'))
      I added this in the meantime. replaceWith is really a handy helper.
      parser = OneOrMore(watch name | pagebreak | leaveempty | EOL)
      >
      '|' operators generate MatchFirst expressions. MatchFirst will do
      short-circuit evaluation - the first expression that matches will be the
      one chosen as the matching alternative.
      Okay, adjusted it.
      If you have more pyparsing questions, you can also post them on the
      pyparsing wiki - the Discussion tab on the wiki Home page has become a
      running support forum - and there is also a Help/Discussion mailing
      list.
      Which of these two would you prefer?

      Thanks again, it works now just as I imagined!

      regards,
      Marek

      Comment

      • Paul McGuire

        #4
        Re: pyparsing: match empty line

        On Sep 3, 4:26 am, Marek Kubica <ma...@xiviliza tion.netwrote:
        Hi,
        >
        First of all a big thank you for your excellent library and of course
        also for your extensive and enlightening answer!
        >
        I'm glad pyparsing has been of help to you. Pyparsing is building its
        own momentum these days. I have a new release in SVN that I'll put
        out in the next week or so.

        Ok, I didn't think about this. But as my program is not only a parser but
        a long-running process and setDefaultWhite space modifies a global
        variable I don't feel too comfortable with it.
        Pyparsing isn't really all that thread-friendly. You definitely
        should not have multiple threads using the same grammar. The
        approaches I've seen people use in multithread applications are: 1)
        synchronize access to a single parser across multiple threads, and 2)
        create a parser per-thread, or use a pool of parsers. Pyparsing
        parsers can be pickled, so a quick way to reconstitute a parser is to
        create the parser at startup time and pickle it to a string, then
        unpickle a new parser as needed.

        I could set the whitespace
        on every element, but that is as you surely agree quite ugly. Do you
        accept patches? I'm thinking about some kind of factory-class which would
        automatically set the whitespaces:
        >
        >factory = TokenFactory(' \t\r')
        >word = Factory.Word(al phas)
        >
        That way, one wouldn't need to set a grobal value which might interfere
        with other pyparsers running in the same process.
        I tried to prototype up your TokenFactory class, but once I got as far
        as implementing __getattribute_ _ to return the corresponding pyparsing
        class, I couldn't see how to grab the object generated for that class,
        and modify its whitespace values. I did cook up this, though:

        class SetWhitespace(o bject):
        def __init__(self, whitespacechars ):
        self.whitespace chars = whitespacechars

        def __call__(self,p yparsing_expr):
        pyparsing_expr. setWhitespace(s elf.whitespacec hars)
        return pyparsing_expr

        noNLskipping = SetWhitespace(' \t\r')
        word = noNLskipping(Wo rd(alphas))

        I'll post this on the wiki and see what kind of comments we get.

        By the way, setDefaultWhite space only updates global variables that
        are used at parser definition time, *not* at parser parse time. So,
        again, you can manage this class attribute at the initialization of
        your program, before any incoming requests need to make use of one
        parser or another.

        4) leaveempty probably needs this parse action to be attached to it:
        >
        leaveempty =
        Literal('EMPTY' ).setParseActio n(replaceWith(' <EMPTY>'))
        >
        I added this in the meantime. replaceWith is really a handy helper.
        After I released replaceWith, I received a parser from someone who
        hadn't read down to the 'R's yet in the documentation, and he
        implemented the same thing with this simple format:

        leaveempty = Literal('EMPTY' ).setParseActio n(lambda : '<EMPTY>')

        These are pretty much equivalent, I was just struck at how easy Python
        makes things for us, too!

        If you have more pyparsing questions, you can also post them on the
        pyparsing wiki - the Discussion tab on the wiki Home page has become a
        running support forum - and there is also a Help/Discussion mailing
        list.
        >
        Which of these two would you prefer?
        >
        They are equivalent, I monitor them both, and you can browse through
        previous discussions using the Discussion tab online threads, or the
        mailing list archive on SF. Use whichever is easier for you to work
        with.

        Cheers, and Welcome to Pyparsing!
        -- Paul

        Comment

        • Marek Kubica

          #5
          Re: pyparsing: match empty line

          On Wed, 03 Sep 2008 06:12:47 -0700, Paul McGuire wrote:
          On Sep 3, 4:26 am, Marek Kubica <ma...@xiviliza tion.netwrote:
          >I could set the whitespace
          >on every element, but that is as you surely agree quite ugly. Do you
          >accept patches? I'm thinking about some kind of factory-class which
          >would automatically set the whitespaces:
          >>
          >>factory = TokenFactory(' \t\r')
          >>word = Factory.Word(al phas)
          >>
          >That way, one wouldn't need to set a grobal value which might interfere
          >with other pyparsers running in the same process.
          >
          I tried to prototype up your TokenFactory class, but once I got as far
          as implementing __getattribute_ _ to return the corresponding pyparsing
          class, I couldn't see how to grab the object generated for that class,
          and modify its whitespace values.
          I have had the same problem, until I remembered that I can fake __init__
          using a function closure.

          I have imported pyparsing.py into a hg repository with a patchstack, here
          is my first patch:

          diff -r 12e2bbff259e pyparsing.py
          --- a/pyparsing.py Wed Sep 03 09:40:09 2008 +0000
          +++ b/pyparsing.py Wed Sep 03 14:08:15 2008 +0000
          @@ -1400,9 +1400,38 @@
          def __req__(self,ot her):
          return self == other

          +class TokenFinder(typ e):
          + """Collects all classes that are derived from Token"""
          + token_classes = dict()
          + def __init__(cls, name, bases, dict):
          + # save the class
          + TokenFinder.tok en_classes[cls.__name__] = cls
          +
          +class WhitespaceToken Factory(object) :
          + def __init__(self, whitespace):
          + self._whitespac e = whitespace
          +
          + def __getattr__(sel f, name):
          + """Get an attribute of this class"""
          + # check whether there is such a Token
          + if name in TokenFinder.tok en_classes:
          + token = TokenFinder.tok en_classes[name]
          + # construct a closure which fakes the constructor
          + def _callable(*args , **kwargs):
          + obj = token(*args, **kwargs)
          + # set the whitespace on the token
          + obj.setWhitespa ceChars(self._w hitespace)
          + return obj
          + # return the function which returns an instance of the Token
          + return _callable
          + else:
          + raise AttributeError( "'%s' object has no attribute '%s'" % (
          + WhitespaceToken Factory.__name_ _, name))

          class Token(ParserEle ment):
          """Abstract ParserElement subclass, for defining atomic matching
          patterns."""
          + __metaclass__ = TokenFinder
          +
          def __init__( self ):

          I used metaclasses for getting all Token-subclasses so new classes that
          are created are automatically accessible via the factory, without any
          additional registration.

          Oh and yes, more patches will follow. I'm currently editing the second
          patch, but I better mail it directly to you as it is not really
          interesting for this list.

          regards,
          Marek

          Comment

          Working...