Comparing two texts
Posted: Tue Jul 18, 2017 11:12 am
I am attempting to create a script that will compare two texts and show the differences. The texts will be lines of code and I wish to detect lines that have been added, deleted or changed. I have looked on the internet and think I need to be using something like a "longest common substring" algorithm . The information I have found tends to show routines that return counts and I am having difficulty in understanding how the routine should be applied. However, based on what I have read I believe that I need to build an array of word comparison counts. I have created a table that compares two phrases:
The problem occurs with the false scores assigned to both "A" and "the" shown in red. The "A" has been replaced with the word "The" but the scores are confused by other changes made later on. Does anyone know how to eliminate these false scores: looking at the table it seems that I may have to divide the grid up based on the scores and re-test.
Any thoughts?
The numbers in the table are generated by comparing each word of the original phrase with every word in the revised phrase. When a match is found a count is entered into the cell. This count is the value of the cell diagonally up and left (cell x-1, y-1) plus 1. So "cat" scores 0+1, "on" scores 2+1. The value of 3 means that there is a run of three words common to both phrases ending at word x in the revised phrase. A column of all zeros indicates a new word and a row of zeros indicates a deleted word.The problem occurs with the false scores assigned to both "A" and "the" shown in red. The "A" has been replaced with the word "The" but the scores are confused by other changes made later on. Does anyone know how to eliminate these false scores: looking at the table it seems that I may have to divide the grid up based on the scores and re-test.
Any thoughts?