Comparing two texts

Anything beyond the basics in using the LiveCode language. Share your handlers, functions and magic here.

Moderators: FourthWorld, heatherlaine, Klaus, kevinmiller, robinmiller

Post Reply
Simon Knight
Posts: 854
Joined: Wed Nov 04, 2009 11:41 am
Location: Gunthorpe, North Lincs, UK

Comparing two texts

Post by Simon Knight » Tue Jul 18, 2017 11:12 am

I am attempting to create a script that will compare two texts and show the differences. The texts will be lines of code and I wish to detect lines that have been added, deleted or changed. I have looked on the internet and think I need to be using something like a "longest common substring" algorithm . The information I have found tends to show routines that return counts and I am having difficulty in understanding how the routine should be applied. However, based on what I have read I believe that I need to build an array of word comparison counts. I have created a table that compares two phrases:
LCS_Grid.png
LCS_Grid.png (15.33 KiB) Viewed 2450 times
The numbers in the table are generated by comparing each word of the original phrase with every word in the revised phrase. When a match is found a count is entered into the cell. This count is the value of the cell diagonally up and left (cell x-1, y-1) plus 1. So "cat" scores 0+1, "on" scores 2+1. The value of 3 means that there is a run of three words common to both phrases ending at word x in the revised phrase. A column of all zeros indicates a new word and a row of zeros indicates a deleted word.

The problem occurs with the false scores assigned to both "A" and "the" shown in red. The "A" has been replaced with the word "The" but the scores are confused by other changes made later on. Does anyone know how to eliminate these false scores: looking at the table it seems that I may have to divide the grid up based on the scores and re-test.

Any thoughts?
Last edited by Simon Knight on Tue Jul 18, 2017 2:40 pm, edited 1 time in total.
best wishes
Skids

dunbarx
VIP Livecode Opensource Backer
VIP Livecode Opensource Backer
Posts: 9663
Joined: Wed May 06, 2009 2:28 pm
Location: New York, NY

Re: Comparing two texts

Post by dunbarx » Tue Jul 18, 2017 2:24 pm

Hi.

I see how the values along the diagonals are created, as you structured it, but do not know
but the scores are confused by other changes made later on.
what engendered the false positives.

Craig Newman

Simon Knight
Posts: 854
Joined: Wed Nov 04, 2009 11:41 am
Location: Gunthorpe, North Lincs, UK

Re: Comparing two texts

Post by Simon Knight » Tue Jul 18, 2017 2:39 pm

Looking at the first word of the texts the "A" has been replaced with "The" but the original all ready contains the word "the" so a false positive occurs when the check is made.

I have been doing some more reading and realise that I have not used the correct name for the algorithm, it should be "Longest common string". Also, it seems that the way to get the results I require is to iterate over the strings removing the highest common string from the phrases on each iteration and parsing the remaining parts before and after the Longest Common String as two new comparisons.

I'm working a stack which I will post once it starts working on short texts.

Simon
best wishes
Skids

Post Reply

Return to “Talking LiveCode”