LiveCode Forums.

Posted: **Tue Jul 18, 2017 11:12 am**

I am attempting to create a script that will compare two texts and show the differences. The texts will be lines of code and I wish to detect lines that have been added, deleted or changed. I have looked on the internet and think I need to be using something like a "longest common substring" algorithm . The information I have found tends to show routines that return counts and I am having difficulty in understanding how the routine should be applied. However, based on what I have read I believe that I need to build an array of word comparison counts. I have created a table that compares two phrases:

: LCS_Grid.png (15.33 KiB) Viewed 2978 times

The numbers in the table are generated by comparing each word of the original phrase with every word in the revised phrase. When a match is found a count is entered into the cell. This count is the value of the cell diagonally up and left (cell x-1, y-1) plus 1. So "cat" scores 0+1, "on" scores 2+1. The value of 3 means that there is a run of three words common to both phrases ending at word x in the revised phrase. A column of all zeros indicates a new word and a row of zeros indicates a deleted word.

The problem occurs with the false scores assigned to both "A" and "the" shown in red. The "A" has been replaced with the word "The" but the scores are confused by other changes made later on. Does anyone know how to eliminate these false scores: looking at the table it seems that I may have to divide the grid up based on the scores and re-test.

Any thoughts?

Posted: **Tue Jul 18, 2017 2:24 pm**

Hi.

I see how the values along the diagonals are created, as you structured it, but do not know

but the scores are confused by other changes made later on.

what engendered the false positives.

Craig Newman

Posted: **Tue Jul 18, 2017 2:39 pm**

Looking at the first word of the texts the "A" has been replaced with "The" but the original all ready contains the word "the" so a false positive occurs when the check is made.

I have been doing some more reading and realise that I have not used the correct name for the algorithm, it should be "Longest common string". Also, it seems that the way to get the results I require is to iterate over the strings removing the highest common string from the phrases on each iteration and parsing the remaining parts before and after the Longest Common String as two new comparisons.

I'm working a stack which I will post once it starts working on short texts.

Simon

LiveCode Forums.

Comparing two texts

Comparing two texts

Re: Comparing two texts

Re: Comparing two texts