My girlfriend teaches Portuguese to teens ranging from 10 to 14 years old (we live in Brazil, mind you, so, Portuguese classes here are analogous to English classes in US/UK).
Due to this whole pandemic situation, there are some students still taking classes at home, which eventually also includes doing tests at home as well.
As it isn't feasible to supervise students when they are taking these tests, cheating has become a much bigger problem, not only because students may look for answers at the internet or share them with one another, but also because as these tests are applied via online forms, copying and pasting answers are much easier.
Now here comes the funny part: you'd think that these students, when sharing answers with one another, would at the very least rewrite them with their own words as a way to disguise their cheating attempt, but what they did instead was mostly copy and paste answers with little to no modifications!
By noticing this pattern she could then start to identify cheaters.
If you want to find out whether a given student has shared his answers with other colleagues, you'd have to compare his test with each one of the other students' tests, and then repeat this process for each student.
There are about 50 students that took these tests, which means that the first test is compared with the other 49 tests, the second test doesn't need to be compared with the first one, but still has to be compared with the remaining 48 tests, and so on.
We're basically dealing with the sum of the terms of an arithmetic progression with ratio 1 so, if we have n students, then we'll need to perform (n^2 - n)/2 comparisons.
For 50 students this amounts to 1225 comparisons and considering the test has 5 questions, then we're talking 6125 comparisons.
Cleary, doing all these comparisons by hand would be a very tedious and time consuming task, but thankfully she is dating a programmer.
At first sight, comparing sentences may seem quite simple as our first instinct might be to compare them character by character, this however, is not a good approach as even if we treated these sentences before comparing them (by removing leading/trailing spaces between words, converting them to lowercase, etc) the only thing it can do is either tell us whether the two answers are the same or not.
This naive approach has no gradient, no nuance, it only gives us a "yes" or "no" answer to the question of whether the sentences are equal, but can't tell us how similar they are, and it only takes a miniscule difference between them to turn a "yes" into a "no".
Thus, we need a better way to do this.
Turns out there's a very neat algorithm called LCS (Longest Common Subsequence) which is perfect for this job and that programmers use almost every day as it is the basis for the diff algorithm, which is used extensively in git.
For instance, suppose the question is "Why do we need to wash our hands before eating?" and one answer is "Because there are microorganisms that when ingested may harm us" and the other is "We wash our hands because there are microorganisms that when ingested might harm us".
These two answers are worded in a very similar way, which might indicate a cheating attempt has taken place; however they are not exactly equal as the second answer is prepended by the question and exchanges the word "may" for "might".
The naive approach would flag these answers as different and wouldn't be able to notice that the second answer, even though it is not exactly equal to the first, it is a slight variation of it.
That is why LCS come in handy, as it has a much cleverer way of comparing sentences, and when run at word level for the aforementioned sentences, yields this:
The words in black are the ones that are common to both answers, the ones in green are the ones that were added to the second answer and the ones in red are the ones that were removed.
Notice that this algorithm not only tells us which words are common, were added or removed, but it also tells us where these insertions and deletions took place.
Back to the tests, on the one hand, applying them via online forms makes it easier to cheat but on the other hand it also makes it possible to export answers as a CSV file, which for the uninitiated is in many ways like an excel/spreadsheet file, and, most importantly, very easy to manipulate using programming.
With this in mind, once I had all the answers exported as a CSV file all I did was run the LCS algorithm and then present the results in a very simple web page so that it was easier to visually inspect them.
In order to respect the students privacy I won't be able to show you the actual data, but still there are a lot of interesting things that can be said without incurring privacy violations.
The first thing we noticed is that there are mainly two kinds of questions: the ones that have an "exact answer" and thus whose answers have little to no variation and the ones whose answer may be worded in various different ways.
Of course, there's no way to know if there were cheating attempts for the first kind so we'll concentrate on the second.
For questions of the second kind even though most times there is a right answer for them (i.e. they are not open ended questions) almost all answers were worded in a completely different way.
Obviously there are some sequences of words that are common to all of them but nevertheless the "core" of each answer varies greatly from one another.
In the cases where answers were almost or exactly equal, we were afraid of relying on this fact alone as it could be pointing us to a false positive, given that there was the very unlikely yet non zero possibility that these answers were equal by pure chance.
However, upon further investigation, it was discovered that every time a set of equal answers was found, the authors of these answers were closely related (e.g. brothers, sisters, close friends) and some of them were already known for previous cheating attempts.
With all this information my girlfriend was then able to successfully identify all cheating attempts (or at least the most blatant ones).
Now, regarding what happened to the cheaters, I'm not allowed to disclose this information, but let's just say that they weren't very happy, nor were their parents.