How to split a text to words and then filter it in Python?

Collapse
X
 
  • Time
  • Show
Clear All
new posts
  • Jonas Peter

    How to split a text to words and then filter it in Python?

    Hi guys,
    I have got a question that I would love for you guys to give me an idea on how to get started with.

    First of all, I'm using Windows 7 and Python 2.7

    I've got a text file and im trying to read the text from that file and then check every word with 40 words around that word, to make sure the word in question has not been repeated more than once.

    In other words, I want to first split the text into words, put them in a list and then check [0] against [1] all the way to [39]. Then I want to check [1] against [40], then check [2] against [41] etc.

    Splitting the words is not that hard I think, I just need to split at every space and every dot. What I am not sure how to do is check the words against the other words in the text..
    Any ideas guys on how that could be done? =)
  • bvdet
    Recognized Expert Specialist
    • Oct 2006
    • 2851

    #2
    Yes, I have an idea on how it could be done. Split the text into a list of words, convert to lower case and strip any punctuation. Iterate on the list and create a sublist by slicing the list as in words[lowIdx:highIdx]. Adjust the low and high indices as required when near the start and end of the word list. Pop the current word from the sublist. Iterate on the remaining members of the sublist to compare to the current word.

    The best way to learn how to program in Python is to write programs. Try writing the code and post back with your questions.

    Comment

    • Jonas P

      #3
      Hi again
      I'll explain further what I want to do. I want to read in a text file and check if a word appears more than once in the last 40 words, in other words; I want to filter out words by adding doing this *RandomWordThat AppearedMoreTha nOnceInTheLast4 0Words*.

      Here is the code I've been working on so far. Currently I'm ignoring all the dots, semicolons etc. I just want to get the basics done.


      Code:
      infil = open ('story.txt')
      
      line = infil.readlines()
      
      wordlist = list()
      
      allTheWords = line.split()
      
      
      if string in dictionary:
          dictionary(string) += 1
          else:
              dictionary(string) = 1
      
      
      if len(wordlist) > 40:
          del wordlist[0]
          
       
      
      
      
      
      
      
      
      
      
      finishedText = (' ').join(allTheWords)

      Comment

      • dwblas
        Recognized Expert Contributor
        • May 2008
        • 626

        #4
        You should print "line", "allTheWord s", and "wordlist" to see if they contain what you think they do. Also, the indentation for the if and else is incorrect. You can get the last 40 lines with
        wordlist[0:40]
        See Section 14.5 here for an example of reading a file, and then substitute that name of the file you wish to read.
        Some info on lists http://www.greenteapress.com/thinkpy...l/book011.html

        Comment

        Working...