text analysis repeating phrases

Collapse
X
 
  • Time
  • Show
Clear All
new posts
  • maxinuruguay
    New Member
    • Jun 2007
    • 2

    text analysis repeating phrases

    I am trying to figure out how, given a text file, you can determine if there are repeating phrases of any given length. I can get the word count, line count and other statistical information but no idea where to even start for trying to find repeating phrases. Any ideas? It does not have to be in C++, I just need some help with picking an algorithm. Looked around on google but nothing showed up. Thought you guys might be able to help. Thanks.
  • JosAH
    Recognized Expert MVP
    • Mar 2007
    • 11453

    #2
    Originally posted by maxinuruguay
    I am trying to figure out how, given a text file, you can determine if there are repeating phrases of any given length. I can get the word count, line count and other statistical information but no idea where to even start for trying to find repeating phrases. Any ideas? It does not have to be in C++, I just need some help with picking an algorithm. Looked around on google but nothing showed up. Thought you guys might be able to help. Thanks.
    Do these repeating phrases have to be adjacent or can they occur anywhere in
    the entire string? Finding a pattern/phrase in a string can be done in O(n) where
    n is the length of the pattern. Checking all possible patterns in a string takes
    O(n^3) for non-adjacent strings (O(n) for finding one phrase and O(n^2) for checking
    all the phrases).

    If the phrases are adjacent, this number goes down to O(n^2).

    kind regards,

    Jos

    Comment

    • maxinuruguay
      New Member
      • Jun 2007
      • 2

      #3
      Originally posted by JosAH
      Do these repeating phrases have to be adjacent or can they occur anywhere in
      the entire string? Finding a pattern/phrase in a string can be done in O(n) where
      n is the length of the pattern. Checking all possible patterns in a string takes
      O(n^3) for non-adjacent strings (O(n) for finding one phrase and O(n^2) for checking
      all the phrases).

      If the phrases are adjacent, this number goes down to O(n^2).

      kind regards,

      Jos

      I think the answer is non adjacent phrases. Let me explain with an example.

      the fat cat jumped over the fat cat

      Result would be:

      the
      the fat
      the fat cat

      I have been looking at regex library trying to find something that could make this task easier but no go. My attempts to work it out by hand have failed. My first attempt was:

      take one word, search rest of string for match and if a match is found, search again for word + 1 == match + 1. You loop and stop when when there is no match for word + 1 == match + 1. What is left over is the longest repeating phrase starting at first word.

      This process, however, needs to be done with every word in the sentence so it is not efficient at all

      My second attempt was to do it incrementally but I got confused and quit L

      Do you know of a better approach? Optimizations to the described approach would be useful (using regex to search will probably guarantee me the best search algorithm so I do not have to worry about search optimization). Regards, Max

      Comment

      Working...