How do we detect plagiarism?

There’s probably many state of the art ways on how one could approach this problem; in this post we’ll explore how we can use standard Python library to do some simple diffs and comparison across large corpuses to fine out when things overlap or don’t overlap.

To use the difflib library to match arbritary strings is quite easy:

import difflib

text1 = 'text1 says hello world because there is only one!'
text2 = 'text2 says hello worlds because there are multiple'

seq = difflib.SequenceMatcher(None, text1, text2)
match_blocks = [x for x in seq.get_matching_blocks() if x.size > 2]

When using get_matching_blocks() method make sure that size is used as a filter otherwise, you’ll get some “no matches” within the results which is pretty pointless. From here one can retrieve the substrings which matched!

for m in match_blocks:
    print(m)
    print("{}:{}:{}".format(text1[:m.a], text1[m.a:m.a+m.size], text1[m.a+m.size:]))
    print("{}:{}:{}".format(text2[:m.b], text2[m.b:m.b+m.size], text2[m.b+m.size:]))
    print("--------------------\n")

This leads to the output:

Match(a=0, b=0, size=4)
:text:1 says hello world because there is only one!
:text:2 says hello worlds because there are multiple
--------------------

Match(a=5, b=5, size=17)
text1: says hello world: because there is only one!
text2: says hello world:s because there are multiple
--------------------

Match(a=22, b=23, size=15)
text1 says hello world: because there :is only one!
text2 says hello worlds: because there :are multiple
--------------------

From here we can begin to extract similar phrases and text simply using the standard library in Python.