DEV Community

Discussion on: How do I implement a plagiarism detector completely from scratch and it has to be fast as well?

Collapse
 
joelbonetr profile image
JoelBonetR 🥇 • Edited

If you are going to check for copy-pasted things, algorithms for text-comparison would be ok, just extract the amount of differences between a paragraph and what you crawled to your DB and set a contingency percentile to spot something as plagiarism.
If you want better performance spot keywords to index, make an in-memory DB copy (if it suits your budget) and so on, pretty standard.

I remember reading something about that involving ML but that's a bit tricky because there's a limited amount of ways you can explain something.

I.e. There's a limited amount of synonyms for a word, there's a limited amount of meanings for an idiom, a word and so on.

Then each discipline has it's own flexibility; you can explain in more ways how gravity works than the amount of ways you can think of while explaining a cooking recipe step by step.

That's to say that training an AI to correctly spot plagiarism without reporting falsy positives is pretty hard.

Hope it helps somehow

Collapse
 
saptakbhoumik profile image
SaptakBhoumik

Thanks for the suggestion:)