How to remove words from a text file using re

Collapse
X
 
  • Time
  • Show
Clear All
new posts
  • kshw
    New Member
    • Jun 2010
    • 3

    How to remove words from a text file using re

    Hi,

    I'm trying to remove non-stop words from a text file using regular expresions but it is not working. I used something like ('^[a-z]?or') in order to avoid removing (or) from the mibble of words e.g. morning.

    Code:
    Temp = [] 
    Original_File = open('out.txt', 'r') 
    Original_File_Content = Original_File.read() Original_File.close() Temp.append("".join(Original_File_Content)) 
    FileString = "".join(Temp) 
    
    p = re.compile( "^[a-z]?is|^[a-z]?or|^[a-z]?in") 
    RemoveWords = p.sub( '', FileString)
    
    print RemoveWords
    Thanks
  • bvdet
    Recognized Expert Specialist
    • Oct 2006
    • 2851

    #2
    Use backslash sequence "\b" to limit the matches at word boundaries.
    Code:
    >>> s = "Or do it now or in the morning or tomorrow?"
    >>> import re
    >>> wordtoremove = "or"
    >>> patt = re.compile(r'\b%s\b' % (wordtoremove))
    >>> s1 = patt.sub("", s)
    Output:
    Code:
    >>> s1
    ' do it now  in the morning  tomorrow?'
    >>>

    Comment

    • dwblas
      Recognized Expert Contributor
      • May 2008
      • 626

      #3
      FileString will not contain anything because Temp is empty. You have to spend time on the basics first; there are a lot of tutorials on the web.

      Comment

      • kshw
        New Member
        • Jun 2010
        • 3

        #4
        Thanks all..

        dwblas, Temp is not empty:

        Temp.append("". join(Original_F ile_Content))

        Comment

        • dwblas
          Recognized Expert Contributor
          • May 2008
          • 626

          #5
          Temp is "empty" because the append statement will never be reached because several statements are on the same line, and that particular statement was not even visible unless you scrolled over. It was a hint. Your code should read
          Code:
          Temp = [] 
          Original_File = open('out.txt', 'r') 
          Original_File_Content = Original_File.read() 
          Original_File.close() 
          Temp.append("".join(Original_File_Content)) 
          FileString = "".join(Temp)
          Also, read() reads the entire file into one string, which I assume was intentional, so FileString will contain one string, Temp will contain the same string, and Original_File_C ontent will also contain the same, single string, so join() is unnecessary since there is only one string.

          Comment

          Working...