Self-hosted Plagiarism Detection with OpenSearch
Building an LMS, needed plagiarism detection without external APIs.
Two-stage approach
First, find candidates with more_like_this:
search = cls.search().filter(
"nested", path="answers",
query={"term": {"answers.question_id": str(question_id)}}
)
search = search.exclude("term", user_id=user_id)
search = search.query(
"nested",
path="answers",
query={
"more_like_this": {
"fields": ["answers.answer"],
"like": text,
"min_term_freq": 1,
"minimum_should_match": "1%",
}
},
)
response = search.execute()
Then re-rank with character n-grams:
def normalize(t):
return re.sub(r"\s+", "", t.strip())
def char_ngrams(t, n=3):
return set(t[i:i+n] for i in range(len(t)-n+1))
norm_text = normalize(text)
text_ngrams = char_ngrams(norm_text)
for hit in response.hits:
norm_answer = normalize(hit.answer)
answer_ngrams = char_ngrams(norm_answer)
intersection = len(text_ngrams & answer_ngrams)
union = len(text_ngrams | answer_ngrams)
ratio = int((intersection / union) * 100)
if ratio >= 60:
# flag as similar
Works decently. 60% from trial and error.
Self-hosted, simple ops, reuses existing search infrastructure.
Top comments (0)