Identifying unicode punctuation characters with Python regex

Collapse
This topic is closed.
X
X
 
  • Time
  • Show
Clear All
new posts
  • Shiao

    Identifying unicode punctuation characters with Python regex

    Hello,
    I'm trying to build a regex in python to identify punctuation
    characters in all the languages. Some regex implementations support an
    extended syntax \p{P} that does just that. As far as I know, python re
    doesn't. Any idea of a possible alternative?

    Apart from manually including the punctuation character range for each
    and every language, I don't see how this can be done.

    Thank in advance for any suggestions.

    John
  • =?ISO-8859-1?Q?=22Martin_v=2E_L=F6wis=22?=

    #2
    Re: Identifying unicode punctuation characters with Python regex

    I'm trying to build a regex in python to identify punctuation
    characters in all the languages. Some regex implementations support an
    extended syntax \p{P} that does just that. As far as I know, python re
    doesn't. Any idea of a possible alternative?
    You should use character classes. You can generate them automatically
    from the unicodedata module: check whether unicodedata.cat egory(c)
    starts with "P".

    Regards,
    Martin

    Comment

    • Shiao

      #3
      Re: Identifying unicode punctuation characters with Python regex

      On Nov 14, 11:27 am, "Martin v. Löwis" <mar...@v.loewi s.dewrote:
      I'm trying to build a regex in python to identify punctuation
      characters in all the languages. Some regex implementations support an
      extended syntax \p{P} that does just that. As far as I know, python re
      doesn't. Any idea of a possible alternative?
      >
      You should use character classes. You can generate them automatically
      from the unicodedata module: check whether unicodedata.cat egory(c)
      starts with "P".
      >
      Regards,
      Martin
      Thanks Martin. I'll do this.

      Comment

      • Mark Tolonen

        #4
        Re: Identifying unicode punctuation characters with Python regex


        "Shiao" <multiseed@gmai l.comwrote in message
        news:3a95a51c-cc4f-45ff-ae4d-c596c7bfab72@l3 3g2000pri.googl egroups.com...
        Hello,
        I'm trying to build a regex in python to identify punctuation
        characters in all the languages. Some regex implementations support an
        extended syntax \p{P} that does just that. As far as I know, python re
        doesn't. Any idea of a possible alternative?
        >
        Apart from manually including the punctuation character range for each
        and every language, I don't see how this can be done.
        >
        Thank in advance for any suggestions.
        >
        John
        You can always build your own pattern. Something like (Python 3.0rc2):
        >>import unicodedata
        Po=''.join(chr( x) for x in range(65536) if unicodedata.cat egory(chr(x)) ==
        'Po')
        >>import re
        >>r=re.compile( '['+Po+']')
        >>x='æˆ‘æ˜¯ç¾Žå œ‹äººã€‚'
        >>x
        'æˆ‘æ˜¯ç¾Žåœ‹äº ºã€‚'
        >>r.findall(x )
        ['。']

        -Mark

        Comment

        • Mark Tolonen

          #5
          Re: Identifying unicode punctuation characters with Python regex


          "Mark Tolonen" <M8R-yfto6h@mailinat or.comwrote in message
          news:xsydnXWBAr iky4DUnZ2dnUVZ_ jCdnZ2d@comcast .com...
          >
          "Shiao" <multiseed@gmai l.comwrote in message
          news:3a95a51c-cc4f-45ff-ae4d-c596c7bfab72@l3 3g2000pri.googl egroups.com...
          >Hello,
          >I'm trying to build a regex in python to identify punctuation
          >characters in all the languages. Some regex implementations support an
          >extended syntax \p{P} that does just that. As far as I know, python re
          >doesn't. Any idea of a possible alternative?
          >>
          >Apart from manually including the punctuation character range for each
          >and every language, I don't see how this can be done.
          >>
          >Thank in advance for any suggestions.
          >>
          >John
          >
          You can always build your own pattern. Something like (Python 3.0rc2):
          >
          >>>import unicodedata
          Po=''.join(chr( x) for x in range(65536) if unicodedata.cat egory(chr(x)) ==
          'Po')
          >>>import re
          >>>r=re.compile ('['+Po+']')
          >>>x='我是美 國人。'
          >>>x
          'æˆ‘æ˜¯ç¾Žåœ‹äº ºã€‚'
          >>>r.findall( x)
          ['。']
          >
          -Mark
          >
          This was an interesting problem. Need to escape \ and ] to find all the
          punctuation correctly, and it turns out those characters are sequential in
          the Unicode character set, so ] was coincidentally escaped in my first
          attempt.

          IDLE 3.0rc2
          >>import unicodedata as u
          >>A=''.join(chr (i) for i in range(65536))
          >>P=''.join(chr (i) for i in range(65536) if u.category(chr( i))[0]=='P')
          >>len(A)
          65536
          >>len(P)
          491
          >>len(re.findal l('['+P+']',A)) # ] was naturally
          >>escaped
          490
          >>set(P)-set(re.findall( '['+P+']',A)) # so only missing \
          {'\\'}
          >>P=P.replace(' \\','\\\\').rep lace(']','\\]') # escape both of them.
          >>len(re.findal l('['+P+']',A))
          491

          -Mark

          Comment

          • Shiao

            #6
            Re: Identifying unicode punctuation characters with Python regex

            On Nov 14, 12:30 pm, "Mark Tolonen" <M8R-yft...@mailinat or.comwrote:
            "Mark Tolonen" <M8R-yft...@mailinat or.comwrote in message
            >
            news:xsydnXWBAr iky4DUnZ2dnUVZ_ jCdnZ2d@comcast .com...
            >
            >
            >
            >
            >
            "Shiao" <multis...@gmai l.comwrote in message
            news:3a95a51c-cc4f-45ff-ae4d-c596c7bfab72@l3 3g2000pri.googl egroups.com....
            Hello,
            I'm trying to build a regex in python to identify punctuation
            characters in all the languages. Some regex implementations support an
            extended syntax \p{P} that does just that. As far as I know, python re
            doesn't. Any idea of a possible alternative?
            >
            Apart from manually including the punctuation character range for each
            and every language, I don't see how this can be done.
            >
            Thank in advance for any suggestions.
            >
            John
            >
            You can always build your own pattern. Something like (Python 3.0rc2):
            >
            >>import unicodedata
            Po=''.join(chr( x) for x in range(65536) if unicodedata.cat egory(chr(x)) ==
            'Po')
            >>import re
            >>r=re.compile( '['+Po+']')
            >>x='§Ú¬O¬ü°ê¤H ¡C'
            >>x
            '§Ú¬O¬ü°ê¤H¡C'
            >>r.findall(x )
            ['¡C']
            >
            -Mark
            >
            This was an interesting problem. Need to escape \ and ] to find all the
            punctuation correctly, and it turns out those characters are sequential in
            the Unicode character set, so ] was coincidentally escaped in my first
            attempt.
            >
            IDLE 3.0rc2>>import unicodedata as u
            >A=''.join(chr( i) for i in range(65536))
            >P=''.join(chr( i) for i in range(65536) if u.category(chr( i))[0]=='P')
            >len(A)
            65536
            >len(P)
            491
            >len(re.findall ('['+P+']',A)) # ] was naturally
            >escaped
            490
            >set(P)-set(re.findall( '['+P+']',A)) # so only missing \
            {'\\'}
            >P=P.replace('\ \','\\\\').repl ace(']','\\]') # escape both of them..
            >len(re.findall ('['+P+']',A))
            >
            491
            >
            -Mark
            Mark,
            Many thanks. I feel almost ashamed I got away with it so easily :-)

            Comment

            • jhermann

              #7
              Re: Identifying unicode punctuation characters with Python regex

              >P=P.replace('\ \','\\\\').repl ace(']','\\]')   # escape both of them.

              re.escape() does this w/o any assumptions by your code about the regex
              implementation.

              Comment

              Working...