Using re to find unicode ranges

Collapse
This topic is closed.
X
X
 
  • Time
  • Show
Clear All
new posts
  • Eric Abrahamsen

    Using re to find unicode ranges

    Is it possible to use the re module to find runs of characters within
    a certain Unicode range?

    I'm writing a Markdown extension to go over text and wrap blocks of
    consecutive Chinese characters in <span class="char"></spantags for
    nice styling in an HTML page. The available hooks appear to be a pre-
    processor (which is a "for line in lines" situation) or an inline
    pattern (which uses regular expressions). The regular expression
    solution would be much simpler and faster, but something tells me
    there's no way to use a regex to find character ranges... Chinese
    characters appear to fall between 19968 and 40959 using ord(), and I
    suppose I can go that route if necessary, but I think it would be ugly.

    Any hints or suggestions would be appreciated!

    Eric
  • Paul McGuire

    #2
    Re: Using re to find unicode ranges

    On Sep 29, 8:17 am, Eric Abrahamsen <e...@ericabrah amsen.netwrote:
    Is it possible to use the re module to find runs of characters within  
    a certain Unicode range?
    >
    I'm writing a Markdown extension to go over text and wrap blocks of  
    consecutive Chinese characters in <span class="char"></spantags for  
    nice styling in an HTML page. The available hooks appear to be a pre-
    processor (which is a "for line in lines" situation) or an inline  
    pattern (which uses regular expressions). The regular expression  
    solution would be much simpler and faster, but something tells me  
    there's no way to use a regex to find character ranges... Chinese  
    characters appear to fall between 19968 and 40959 using ord(), and I  
    suppose I can go that route if necessary, but I think it would be ugly.
    >
    Any hints or suggestions would be appreciated!
    >
    Eric
    Eric -

    This sounds similar to what zhpy (http://pyparsing.wikispaces.com/
    WhosUsingPypars ing#Zhpy) does to extract Chinese words from code, to
    generate executable English Python. You might give that a look.

    -- Paul

    Comment

    • Mark Tolonen

      #3
      Re: Using re to find unicode ranges


      "Eric Abrahamsen" <eric@ericabrah amsen.netwrote in message
      news:mailman.16 74.1222694261.3 487.python-list@python.org ...
      Is it possible to use the re module to find runs of characters within a
      certain Unicode range?
      >
      I'm writing a Markdown extension to go over text and wrap blocks of
      consecutive Chinese characters in <span class="char"></spantags for
      nice styling in an HTML page. The available hooks appear to be a pre-
      processor (which is a "for line in lines" situation) or an inline pattern
      (which uses regular expressions). The regular expression solution would
      be much simpler and faster, but something tells me there's no way to use
      a regex to find character ranges... Chinese characters appear to fall
      between 19968 and 40959 using ord(), and I suppose I can go that route if
      necessary, but I think it would be ugly.
      # coding: utf-8
      import re
      sample = u'My name is 马克. I am 美国人.'
      for n in re.findall(ur'[\u4e00-\u9fff]+',sample):
      print n

      output:

      马克
      美国人

      --Mark

      Comment

      • Eric Abrahamsen

        #4
        Re: Using re to find unicode ranges

        On Sep 29, 11:03 pm, "Mark Tolonen" <M8R-yft...@mailinat or.comwrote:
        "Eric Abrahamsen" <e...@ericabrah amsen.netwrote in message
        >
        news:mailman.16 74.1222694261.3 487.python-list@python.org ...
        >
        Is it possible to use the re module to find runs of characters within a
        certain Unicode range?
        >
        I'm writing a Markdown extension to go over text and wrap blocks of
        consecutive Chinese characters in <span class="char"></spantags for
        nice styling in an HTML page. The available hooks appear to be a pre-
        processor (which is a "for line in lines" situation) or an inline pattern
        (which uses regular expressions). The regular expression solution would
        be much simpler and faster, but something tells me there's no way to use
        a regex to find character ranges... Chinese characters appear to fall
        between 19968 and 40959 using ord(), and I suppose I can go that routeif
        necessary, but I think it would be ugly.
        >
        # coding: utf-8
        import re
        sample = u'My name is Âí¿Ë. I am ÃÀ¹úÈË.'
        for n in re.findall(ur'[\u4e00-\u9fff]+',sample):
        print n
        Of course! And obvious, once you point it out. Thanks for the help.


        This sounds similar to what zhpy (http://pyparsing.wikispaces.com/
        WhosUsingPypars ing#Zhpy) does to extract Chinese words from code, to
        generate executable English Python. You might give that a look.
        --Mark
        Mark - not quite what I'm after here, but pretty interesting
        nonetheless...

        E

        Comment

        Working...