I am trying to figure out how, given a text file, you can determine if there are repeating phrases of any given length. I can get the word count, line count and other statistical information but no idea where to even start for trying to find repeating phrases. Any ideas? It does not have to be in C++, I just need some help with picking an algorithm. Looked around on google but nothing showed up. Thought you guys might be able to help. Thanks.
text analysis repeating phrases
Collapse
X
-
Tags: None
-
Originally posted by maxinuruguayI am trying to figure out how, given a text file, you can determine if there are repeating phrases of any given length. I can get the word count, line count and other statistical information but no idea where to even start for trying to find repeating phrases. Any ideas? It does not have to be in C++, I just need some help with picking an algorithm. Looked around on google but nothing showed up. Thought you guys might be able to help. Thanks.
the entire string? Finding a pattern/phrase in a string can be done in O(n) where
n is the length of the pattern. Checking all possible patterns in a string takes
O(n^3) for non-adjacent strings (O(n) for finding one phrase and O(n^2) for checking
all the phrases).
If the phrases are adjacent, this number goes down to O(n^2).
kind regards,
Jos -
Originally posted by JosAHDo these repeating phrases have to be adjacent or can they occur anywhere in
the entire string? Finding a pattern/phrase in a string can be done in O(n) where
n is the length of the pattern. Checking all possible patterns in a string takes
O(n^3) for non-adjacent strings (O(n) for finding one phrase and O(n^2) for checking
all the phrases).
If the phrases are adjacent, this number goes down to O(n^2).
kind regards,
Jos
I think the answer is non adjacent phrases. Let me explain with an example.
the fat cat jumped over the fat cat
Result would be:
the
the fat
the fat cat
I have been looking at regex library trying to find something that could make this task easier but no go. My attempts to work it out by hand have failed. My first attempt was:
take one word, search rest of string for match and if a match is found, search again for word + 1 == match + 1. You loop and stop when when there is no match for word + 1 == match + 1. What is left over is the longest repeating phrase starting at first word.
This process, however, needs to be done with every word in the sentence so it is not efficient at all
My second attempt was to do it incrementally but I got confused and quit L
Do you know of a better approach? Optimizations to the described approach would be useful (using regex to search will probably guarantee me the best search algorithm so I do not have to worry about search optimization). Regards, MaxComment
Comment