Diff for words in string

Collapse
This topic is closed.
X
X
 
  • Time
  • Show
Clear All
new posts
  • Csaba Gabor

    Diff for words in string

    I'm comparing the text of (snippets of) web pages which I expect to be
    quite different or quite similar. In the case where they are similar,
    I would like to display the more recent one and say something like:
    Word 2 added [before word 2 in original]: "Jack be nimble"

    Words 10-11 changed to: "the quick brown fox"
    [from words 9-11 in original]: "the brown fast quick fox"

    Words [20-22 in original] before word 20 removed: "sat in a corner on"


    One way to do this is to replace all spacing chars with \n, write the
    strings to two files, and then run diff (FC on my Win XP Pro = file
    compare), collecting the output. Does anyone happen to have PHP code
    for this where I don't have to write files? Note that the diff is on
    words of the strings and not characters.

    In particular, the normal algorithms for this (longest common
    subsequence) only produce a number, and don't note the differences.

    Also, I wanted to give a threshhold (about 10, say) and if the longest
    common subsequence differs from the shorter string (strings are on the
    order of 100 words) by at least this amount, then simply fail (since
    the difference between the two strings would be deemed to great). This
    should make the algorithm far more efficient. The corresponding
    argument in FC would be /LB10.

    Thanks,
    Csaba Gabor from Vienna

    Ref: http://www.ics.uci.edu/~dan/class/16...6/Dynamic.html

    Note that this problem is distinct from the longest common substring
    problem.

  • Andy Jeffries

    #2
    Re: Diff for words in string

    On Wed, 21 Jun 2006 02:56:37 -0700, Csaba Gabor wrote:[color=blue]
    > One way to do this is to replace all spacing chars with \n, write the
    > strings to two files, and then run diff (FC on my Win XP Pro = file
    > compare), collecting the output. Does anyone happen to have PHP code for
    > this where I don't have to write files? Note that the diff is on words of
    > the strings and not characters.[/color]



    The code in that PEAR library may give you a starting point to
    implementing it in PHP rather than calling fc.

    Cheers,


    Andy

    --
    Andy Jeffries MBCS CITP ZCE | gPHPEdit Lead Developer
    http://www.gphpedit.org | PHP editor for Gnome 2
    http://www.andyjeffries.co.uk | Personal site and photos

    Comment

    • the DtTvB

      #3
      Re: Diff for words in string



      Comment

      • Csaba Gabor

        #4
        Re: Diff for words in string

        Thanks both Andy and Jeff for your suggestions regarding diff / longest
        common subsequence. It got my creative juices flowing and I rolled my
        own, which I didn't see in the literature. There may be some overlap
        with Ukkonen (see
        http://www.csse.monash.edu.au/~lloyd.../Dynamic/Edit/) since
        the complexity appears to have similar overtones. The running time for
        sequences which are highly similar should be linear in the length of
        the strings. The one thing that is missing is an efficient
        dissimilarity test so that one may bail out of the routine early.


        Step 1: For each word in the first string, have an array of where it
        is in the first string

        Step 2: From this it is possible to construct what is known as a
        permutation graph:
        Start $ctr at 0; for each word in string 2 find the array of indeces
        in string 1 and (iterating backwards through that array), unshift
        ++$ctr onto an array at that index point.

        The array of arrays built up in this step should now be collapsed,
        forming a permuation of the integers from 1 ... $ctr. The
        corresponding indeces are assumed to be numbered consecutively starting
        at 1 for string 2. This constitutes a permutation graph (a permutation
        graph is one where the corresponding integers are connected, which
        connections correspond to vertices; and vertices are adjacent iff their
        corresponding segments intersect. This happens iff the order of the
        two integers corresponding to the two segments differs above vs.
        below).

        The relevance to the problem is that any segment from the graph
        indicates a common word in the original sequences. Two segments may be
        in a common subsequence as long as they don't intersect.

        Step 3:
        Thus, the problem is reduced to finding the longest increasing
        monotonic subsequence of the integers constructed in Step 2. I do this
        by dynamic programming, working backwards through the sequence of
        integers. For each position, I keep track of the longest monotonic
        increasing subsequence (rightwards) from that point. I also keep track
        of the maximum longest monotonic increasing subsequence (rightwards)
        from that point as a speedup convenience.

        Step 4:
        Working from the left, reconstruct the sequence (longest increasing
        monotonic subsequence).

        Step 5:
        Recover the mapping that the sequence from Step 4 represents (along
        with the indexes into the original strings for each word). This is the
        longest common subsequence between the two original strings.

        Step 6.
        Using Step 5, we can also determine which words from string 1 didn't
        make it and which words from string 2 were not included, so we can
        easily construct the diff where the words from string one that didn't
        make it will have a strikethrough and the words from string 2 will be
        bolded.

        Csaba Gabor from Vienna

        Comment

        Working...