Need a Regular expression to remove a char for Unicode text

Collapse
This topic is closed.
X
X
 
  • Time
  • Show
Clear All
new posts
  • శ్రీనివాస

    Need a Regular expression to remove a char for Unicode text

    Hai friends,
    Can any one tell me how can i remove a character from a unocode text.
    కల్‌&à° ¹à°¾à°° is a Telugu word in Unicode. Here i want to
    remove '&' but not replace with a zero width char. And one more thing,
    if any whitespaces are there before and after '&' char, the text should
    be kept as it is. Please tell me how can i workout this with regular
    expressions.

    Thanks and regards
    Srinivasa Raju Datla

  • harvey.thomas@informa.com

    #2
    Re: Need a Regular expression to remove a char for Unicode text


    శ్రీన ివాస wrote:
    Hai friends,
    Can any one tell me how can i remove a character from a unocode text.
    కల్‌&à° ¹à°¾à°° is a Telugu word in Unicode. Here i want to
    remove '&' but not replace with a zero width char. And one more thing,
    if any whitespaces are there before and after '&' char, the text should
    be kept as it is. Please tell me how can i workout this with regular
    expressions.
    >
    Thanks and regards
    Srinivasa Raju Datla
    Don't know anything about Telugu, but is this the approach you want?
    >>x=u'\xfe\xf f & \xfe\xff \xfe\xff&\xfe\x ff'
    >>noampre = re.compile('(?< !\s)&(?!\s)', re.UNICODE).sub
    >>noampre('', x)
    u'\xfe\xff & \xfe\xff \xfe\xff\xfe\xf f'

    The regular expression has negative look behind and look ahead
    assertions to check that there is no whitespace surrounding the '&'
    character. Each match then found is then replaced with the empty string

    Comment

    • Sybren Stuvel

      #3
      Re: Need a Regular expression to remove a char for Unicode text

      శ్రీన ివాస enlightened us with:
      Can any one tell me how can i remove a character from a unocode
      text. కల్<200c> &హార is a Telugu word in Unicode. Here i want to
      remove '&' but not replace with a zero width char. And one more
      thing, if any whitespaces are there before and after '&' char, the
      text should be kept as it is.
      So basically, you want to match <200c>& and replace it with <200c>,
      but only if it's not surrounded by whitespace, right?

      r"(?<!\s)\x200c &(?!\s)" should match. I'm sure you'll be able to take
      it from there.

      Sybren
      --
      Sybren Stüvel
      Stüvel IT - http://www.stuvel.eu/

      Comment

      • Leo Kislov

        #4
        Re: Need a Regular expression to remove a char for Unicode text



        On Oct 13, 4:44 am, harvey.tho...@i nforma.com wrote:
        శ్రీన ివాస wrote:
        Hai friends,
        Can any one tell me how can i remove a character from a unocode text.
        కల్‌&à° ¹à°¾à°° is a Telugu word in Unicode. Here i want to
        remove '&' but not replace with a zero width char. And one more thing,
        if any whitespaces are there before and after '&' char, the text should
        be kept as it is. Please tell me how can i workout this with regular
        expressions.
        >
        Thanks and regards
        Srinivasa Raju DatlaDon't know anything about Telugu, but is this the approach you want?
        >
        >x=u'\xfe\xff & \xfe\xff \xfe\xff&\xfe\x ff'
        >noampre = re.compile('(?< !\s)&(?!\s)', re.UNICODE).sub
        >noampre('', x)
        He wants to replace & with zero width joiner so the last call should be
        noampre(u"\u200 D", x)

        Comment

        • Leo Kislov

          #5
          Re: Need a Regular expression to remove a char for Unicode text

          On Oct 13, 4:55 am, "Leo Kislov" <Leo.Kis...@gma il.comwrote:
          On Oct 13, 4:44 am, harvey.tho...@i nforma.com wrote:
          >
          శ్రీన ివాస wrote:
          Hai friends,
          Can any one tell me how can i remove a character from a unocode text.
          కల్‌&à° ¹à°¾à°° is aTelugu word in Unicode. Here i want to
          remove '&' but not replace with a zero width char. And one more thing,
          if any whitespaces are there before and after '&' char, the text should
          be kept as it is. Please tell me how can i workout this with regular
          expressions.
          >
          Thanks and regards
          Srinivasa Raju DatlaDon't know anything about Telugu, but is this theapproach you want?
          >
          >>x=u'\xfe\xf f & \xfe\xff \xfe\xff&\xfe\x ff'
          >>noampre = re.compile('(?< !\s)&(?!\s)', re.UNICODE).sub
          >>noampre('', x)
          He wants to replace & with zero width joiner so the last call should be
          noampre(u"\u200 D", x)
          Pardon my poor reading comprehension, OP doesn't want zero width
          joiner. Though I'm confused why he mentioned it at all.

          Comment

          Working...