How to read the text from file and check for repeated words

Collapse
X
 
  • Time
  • Show
Clear All
new posts
  • Peter Peter

    How to read the text from file and check for repeated words

    Hi guys,

    I've got a text file and im trying to read the text from that file and then check every word with 40 words around that word, to make sure the word in question has not been repeated more than once.

    In other words, I want to first split the text into words, put them in a list and then check [0] against [1] all the way to [39]. Then I want to check [1] against [40], then check [2] against [41] etc.

    Splitting the words is not that hard I think, I just need to split at every space and every dot. What I am not sure how to do is check the words against the other words in the text..
    Any ideas guys on how that could be done? =)
    Last edited by MMcCarthy; Nov 2 '10, 07:05 AM. Reason: moved from old thread
  • Arty Breen

    #2
    The count function should work. If count > 1 then the word is repeated.

    import string
    string.count("w ord you want to check", start, end)

    I believe you can use a loop and string indexing to define the start and end.

    (But all of this could be utter nonsense, I'm pretty new to python. Sorry if I've led you astray....)

    Comment

    • dwblas
      Recognized Expert Contributor
      • May 2008
      • 626

      #3
      in other words, I want to first split the text into words, put them in a list and then check [0] against [1] all the way to [39]. Then I want to check [1] against [40], then check [2] against [41] etc.
      You want to split the text into a list of words (all lower case), sort the list of words, and compare this_word with next_word. An easier way is to convert to a set, as a set does not allow duplicate keys, and then check the length of the set compared to the length of the original list. You can also use a dictionary, with the word as key, pointing to an integer that is used to count how many times the word appears.

      Comment

      Working...