Using Groups inside Braces with Regular Expressions

Collapse
This topic is closed.
X
X
 
  • Time
  • Show
Clear All
new posts
  • Chris

    Using Groups inside Braces with Regular Expressions

    I'm trying to delimit sentences in a block of text by defining the
    end-of-sentence marker as a period followed by a space followed by an
    uppercase letter or end-of-string.

    I'd imagine the regex for that would look something like:
    [^(?:[A-Z]|$)]\.\s+(?=[A-Z]|$)

    However, Python keeps giving me an "unbalanced parenthesis" error for
    the [^] part. If this isn't valid regex syntax, how else would I match
    a block of text that doesn't the delimiter pattern?

    Thanks,
    Chris
  • MRAB

    #2
    Re: Using Groups inside Braces with Regular Expressions

    On Jul 14, 12:05 am, Chris <chriss...@gmai l.comwrote:
    I'm trying to delimit  sentences in a block of text by defining the
    end-of-sentence marker as a period followed by a space followed by an
    uppercase letter or end-of-string.
    >
    I'd imagine the regex for that would look something like:
    [^(?:[A-Z]|$)]\.\s+(?=[A-Z]|$)
    >
    However, Python keeps giving me an "unbalanced parenthesis" error for
    the [^] part. If this isn't valid regex syntax, how else would I match
    a block of text that doesn't the delimiter pattern?
    >
    What is the [^(?:[A-Z]|$)] part meant to be doing? Is it meant to be
    matching everything up to the end of the sentence?

    [...] is a character class, so Python is parsing the character class
    as:

    [^(?:[A-Z]|$)]
    ^^^^^^^^^^

    Comment

    • Chris

      #3
      Re: Using Groups inside Braces with Regular Expressions

      On Jul 13, 8:14 pm, MRAB <goo...@mrabarn ett.plus.comwro te:
      On Jul 14, 12:05 am, Chris <chriss...@gmai l.comwrote:I'm trying to delimit sentences in a block of text by defining the
      end-of-sentence marker as a period followed by a space followed by an
      uppercase letter or end-of-string.
      >
      I'd imagine the regex for that would look something like:
      [^(?:[A-Z]|$)]\.\s+(?=[A-Z]|$)
      >
      However, Python keeps giving me an "unbalanced parenthesis" error for
      the [^] part. If this isn't valid regex syntax, how else would I match
      a block of text that doesn't the delimiter pattern?
      >
      What is the [^(?:[A-Z]|$)] part meant to be doing? Is it meant to be
      matching everything up to the end of the sentence?
      >
      [...] is a character class, so Python is parsing the character class
      as:
      >
      [^(?:[A-Z]|$)]
      ^^^^^^^^^^
      It was meant to include everything except the end-of-sentence pattern.
      However, I just realized that I can simply replace it with ".*?"

      Comment

      • John Machin

        #4
        Re: Using Groups inside Braces with Regular Expressions

        On Jul 14, 9:05 am, Chris <chriss...@gmai l.comwrote:

        Misleading subject.

        [] brackets or "square brackets"
        {} braces or "curly brackets"
        () parentheses or "round brackets"
        I'm trying to delimit sentences in a block of text by defining the
        end-of-sentence marker as a period followed by a space followed by an
        uppercase letter or end-of-string.
        .... which has at least two problems:

        (1) You are insisting on at least one space between the period and the
        end-of-string (this can be overcome, see later).
        (2) Periods are often dropped in after abbreviations and contractions
        e.g. "Mr. Geo. Smith". You will get three "sentences" out of that.
        >
        I'd imagine the regex for that would look something like:
        [^(?:[A-Z]|$)]\.\s+(?=[A-Z]|$)
        >
        However, Python keeps giving me an "unbalanced parenthesis" error for
        the [^] part.
        It's nice to know that Python is consistent with its error messages.
        If this isn't valid regex syntax,
        If? It definitely isn't valid syntax. The brackets should delimit a
        character class. You are trying to cram a somewhat complicated
        expression into a character class, or you should be using parentheses.
        However it's a bit hard to determine what you really meant that part
        of the pattern to achieve.
        how else would I match
        a block of text that doesn't the delimiter pattern?
        Start from the top down:
        A sentence is:
        anything (with some qualifications)
        followed by (but not including):
        a period
        followed by
        either
        1 or more whitespaces then a capital letter
        or
        0 or more whitespaces then end-of-string

        So something like this might do the trick:
        >>sep = re.compile(r'\. (?:\s+(?=[A-Z])|\s*(?=\Z))')
        >>sep.split('He llo. Mr. Chris X\nis here.\nIP addr 1.2.3.4. ')
        ['Hello', 'Mr', 'Chris X\nis here', 'IP addr 1.2.3.4', '']

        Comment

        Working...