Diff for words in string

**Andy Jeffries** · Jun 21 '06, 09:55 AM

Re: Diff for words in string

On Wed, 21 Jun 2006 02:56:37 -0700, Csaba Gabor wrote:[color=blue]
> One way to do this is to replace all spacing chars with \n, write the
> strings to two files, and then run diff (FC on my Win XP Pro = file
> compare), collecting the output. Does anyone happen to have PHP code for
> this where I don't have to write files? Note that the diff is on words of
> the strings and not characters.[/color]

Text_Diff

http://pear.php.net/package/Text_Diff/

The code in that PEAR library may give you a starting point to
implementing it in PHP rather than calling fc.

Cheers,

Andy

--
Andy Jeffries MBCS CITP ZCE | gPHPEdit Lead Developer
http://www.gphpedit.org | PHP editor for Gnome 2
http://www.andyjeffries.co.uk | Personal site and photos

**the DtTvB** · Jun 21 '06, 10:05 AM

Re: Diff for words in string

diff example

http://www.holomind.de/phpnet/diff.src.php

**Csaba Gabor** · Jun 22 '06, 11:25 AM

Re: Diff for words in string

Thanks both Andy and Jeff for your suggestions regarding diff / longest
common subsequence. It got my creative juices flowing and I rolled my
own, which I didn't see in the literature. There may be some overlap
with Ukkonen (see
http://www.csse.monash.edu.au/~lloyd.../Dynamic/Edit/) since
the complexity appears to have similar overtones. The running time for
sequences which are highly similar should be linear in the length of
the strings. The one thing that is missing is an efficient
dissimilarity test so that one may bail out of the routine early.

Step 1: For each word in the first string, have an array of where it
is in the first string

Step 2: From this it is possible to construct what is known as a
permutation graph:
Start $ctr at 0; for each word in string 2 find the array of indeces
in string 1 and (iterating backwards through that array), unshift
++$ctr onto an array at that index point.

The array of arrays built up in this step should now be collapsed,
forming a permuation of the integers from 1 ... $ctr. The
corresponding indeces are assumed to be numbered consecutively starting
at 1 for string 2. This constitutes a permutation graph (a permutation
graph is one where the corresponding integers are connected, which
connections correspond to vertices; and vertices are adjacent iff their
corresponding segments intersect. This happens iff the order of the
two integers corresponding to the two segments differs above vs.
below).

The relevance to the problem is that any segment from the graph
indicates a common word in the original sequences. Two segments may be
in a common subsequence as long as they don't intersect.

Step 3:
Thus, the problem is reduced to finding the longest increasing
monotonic subsequence of the integers constructed in Step 2. I do this
by dynamic programming, working backwards through the sequence of
integers. For each position, I keep track of the longest monotonic
increasing subsequence (rightwards) from that point. I also keep track
of the maximum longest monotonic increasing subsequence (rightwards)
from that point as a speedup convenience.

Step 4:
Working from the left, reconstruct the sequence (longest increasing
monotonic subsequence).

Step 5:
Recover the mapping that the sequence from Step 4 represents (along
with the indexes into the original strings for each word). This is the
longest common subsequence between the two original strings.

Step 6.
Using Step 5, we can also determine which words from string 1 didn't
make it and which words from string 2 were not included, so we can
easily construct the diff where the words from string one that didn't
make it will have a strikethrough and the words from string 2 will be
bolded.

Csaba Gabor from Vienna

Diff for words in string

Diff for words in string

Comment

Comment

Comment