My first question is: is there an algorithm that already exists for this? If not any thoughts and ideas are appreciated.
Let’s say I have two large text files (original file A and new file B). Each file is English prose text (including dialogue) with a typical size 256K to 500K characters.
I want to compare them to find out how similar the contents of B are to A.
Similar in this case means: all, or a significant part, of B exists in A, with the condition that there may be subtle differences, words changed here and there, or even globally.
In all cases we have to remember that this is looking for similarity, not (necessarily) identity.
Preprocessing for the text:
- Remove all punctuation (and close up gaps “didn’t” -> “didnt”);
- Lowercase everything;
- Remove common words;
- Reduce all whitespace to single space only, but keep paragraphs;
Other possible optimisations to reduce future workload:
Ignore any paragraph of less than a certain length. Why? Because there’s a higher probability of natural duplication in shorter paragraphs (though arguably not in the same overall position).
Have an arbitrary cut-off length on the paragraphs. Why? Mostly because it reduces workload.
For every word, turn it into a Metaphone. So instead of every paragraph being composed of normal words it becomes a list a metaphones which help in comparing slightly modified words.
We end up with paragraphs that look like this (each of these lines is a separate paragraph):
WNT TR0 ABT E0L JRTN TTKTF INSPK WLMS E0L UTRL OBNKS JRL TM RL SRPRS LKT TRKTL KM N SX WRT LT ASK W0R RT WRKS T ST N WLTNT RT 0M AL
But I admit, when it comes to the comparison I’m not sure how to approach it, beyond a brute force take the first encoded paragraph from B (B) and check every paragraph in A looking for a high match (maybe identical, maybe very similar). Perhaps we use Levenshtein to find a match percentage on the paragraphs.
If we find a match at A[n], then check B against A[n+1] and maybe a couple further A[n+2] and A[n+3] just in case something was inserted.
And proceed that way.
What should be detected:
- Near-identical text
- Global proper noun changes
- B is a subset of A