By Joshua Benton
It’s the sort of case you might expect Encyclopedia Brown to tackle.
Two kids seem to have cheated on Professor Harpp’s final exam. Can he prove the culprits did it – before it’s too late?
But when McGill University professor David Harpp suspected some of his students were up to no good, he didn’t hire a boy detective for a shiny new quarter. He did the job himself.
He devised a statistical method to determine whether two students were copying test answers from each other. He found that, on a 98-question multiple-choice test, the pair of students had 97 answers exactly the same – including 23 wrong answers.
Confronted with the evidence, the students confessed.
To the untrained observer, it may seem strange that cheating can be reliably detected with statistics, formulas and math, as Texas officials have hired an outside firm to do. But decades of research around the world have produced methods that prove quite effective at smoking out cheaters in ways even the best proctors often can’t.
“We had always worried that cheating was happening, but we had to find a way to figure out who was doing it,” said Chris McManus, a professor of psychology and medical education at University College London who hunts cheaters.
In Texas, the test-security firm Caveon has identified 699 Texas schools – nearly 10 percent of the state’s total – where cheating may have occurred. The Texas Education Agency is planning how it will deal with those schools, some of which will be the targets of a full state investigation.
Caveon used several methods to look for bad behavior. But the most common problem it found was classrooms or schools where a group of students had identical or nearly identical answer sheets, suggesting they may have copied from one another.
That method of detecting cheating has a long history and, according to researchers in the field, does a good job of identifying the right suspects. And those researchers say Texas could do a better job of preventing students from cheating in the first place.
Academics have come up with dozens of methods, dating back nearly a century. They differ in details, but nearly all are founded on one key principle: It’s rare for a pair of students to make exactly the same mistakes on a multiple-choice test.
Having lots of identical correct answers, of course, doesn’t raise red flags. If the correct answer to Question 21 is “B,” you’d expect many students to choose it. But students who get many questions wrong in exactly the same way – particularly if few other classmates made the same mistakes – can be a sign something is up.
The earliest known statistical test for cheating was explained in a 1927 scholarly paper. During an exam, four students were observed acting suspiciously. The instructor thought to examine the number of wrong answers the four had in common with each other on the 149-item test.
The suspects had 31, 28, 25 and 17 wrong answers, respectively, in common. The average for the rest of the class: only four.
The students denied cheating at first, but three “quickly confessed guilt when confronted with the evidence,” the paper’s author wrote.
Dr. Harpp is a chemistry professor at McGill, one of the most prestigious universities in Canada. He didn’t know much about cheating research in 1989, when one of his students told him that two peers had shared answers on a final exam.
“I felt I was being ripped off, which I was,” Dr. Harpp said.
The student worried about being a rat. But he agreed to tell Dr. Harpp the suspects’ names on the condition that no disciplinary action would be taken based solely on his accusation. Dr. Harpp had to find a statistical way to detect copycats.
He and a colleague, Jim Hogan, wrote a computer program that, for every possible pair of students, compared the number of identical wrong answers with the number of questions the pair answered differently. Sure enough, the results for the two students stood out.
But the analysis found more – 18 suspect students in all. Dr. Harpp then obtained exam seating charts from the registrar’s office. It turned out that every suspect pair had sat together on test day.
“I was flummoxed,” he said. “I thought: What am I going to do?”
He gathered data from dozens of other exams and found the same patterns. His method consistently found that the results for between 3 percent and 8 percent of students were suspicious – a range that is common in cheating research. And in each instance, the pairs were seated together.
The results were considered strong enough that McGill now uses Dr. Harpp’s method to check for cheating on all the university’s multiple-choice final exams.
“We want to create an ethos of academic integrity on campus,” said Morton Mendelson, McGill’s deputy provost. “Our goal is not to catch cheaters. Our goal is to persuade students they really do want to be honest. And that if they’re not honest, they will be caught.”
Like Dr. Harpp, Chris McManus does his cheating research as a sideline to his main academic work, which is training British doctors. For a study published in The British Medical Journal last year, Dr. McManus looked at more than 11,000 people who had taken the postgraduate exams given to aspiring British pediatricians.
The exams were taken at dozens of testing sites in Britain and around the world, on many dates. The study used a computer program of Dr. McManus’ design, Acinonyx. (Showing no field is immune from puns, the name derives from the Latin acinonyx jubatus, the scientific name for the common cheetah.)
The analysis found 13 pairs of answer sheets that suggested copying. The computer had no way of knowing where or when the tests were taken. But test records confirmed that all 13 pairs had taken the examination at the same locations and on the same days.
Seating charts had been kept for only six of the 13 pairs, but they showed that all six were sitting next to each other on test day. It was slam-dunk evidence of cheating.
“We went back and sat in some of the rooms, and it was very obvious that one could cheat if one wanted to,” Dr. McManus said.
The great risk of a statistical tool is the possibility of false positives. The best way to protect against them is to seek supporting evidence, such as seating charts. Texas does not require schools to record where students sit on test day, nor does it mandate how they must be arranged.
To guard against mistakes, researchers with different methods compare their findings to test their validity. While little is known about the specifics of Caveon’s methods in Texas, most approaches have broad agreement on the students they identify.
“All of these methods are very much consistent for students with a high level of suspicion,” said George Wesolowsky, a professor of management at McMaster University who has created his own formula. “It’s on the marginal pairs where methods might disagree.”
For instance, Dr. Harpp has run much of his testing data through a popular formula designed by a professor at Virginia Tech. The two methods are substantially different, he said, but they nearly always flag the same students as suspicious.
“We wound up with the same conclusions the vast majority of the time,” he said. “I can’t say it’s totally foolproof, but it’s pretty, pretty close.”
At McGill, Dr. Harpp has encountered many cheating schemes in the years since 1989. One echoed the famous restaurant shooting scene in The Godfather: A student recorded an exam’s answers on his graphing calculator, then hid it in a bathroom for another student to retrieve.
But the number of cheaters showing up in Dr. Harpp’s statistical samples has plummeted over the past decade. That doesn’t necessarily mean students have become more honest. Instead, the university has used Dr. Harpp’s findings to change the way final exams are administered.
First, student seating is assigned randomly on exam day, so students can’t pick their neighbors. That almost completely eliminates cooperative cheating, in which one student agrees to secretly supply answers to a friend.
In some cases, exams in different subjects are given in the same room, with each subject assigned to alternating rows of desks. That way, the people seated next to you aren’t even being tested on the same subject.
Second, professors prepare different, scrambled versions of their exams. The questions are identical, but they appear in a different order. That makes a glance at another student’s answer sheet useless to an aspiring cheater, because his neighbor’s Question 6 may be his Question 39.
“A scrambled exam isn’t that difficult to do,” Dr. Harpp said. “The temptation to cheat is removed, in a quiet, dignified way.”
Since those reforms were put into place, answer copying has nearly disappeared from his statistical samples, he said.
At McGill, Dr. Harpp said, a statistical finding is considered strong evidence in disciplinary proceedings. But for a student to earn official sanction from the university, it generally must be supported by other evidence such as a seating chart showing the students next to each other.
But not every place is as aggressive as McGill. On the British pediatrics test Dr. McManus studied, none of the students identified as cheaters were punished – despite strong statistical evidence and proof the offenders were seated next to one another.
“Colleges are very unwilling to act on the data,” he said. “If you don’t get a confession, there’s not much you can do about it.”
One problem: Knowing that two answer sheets were identical doesn’t tell you whether both parties were in on the cheating. Did one simply have a good view of an innocent’s answer sheet?
Students are also quick to argue that their similar wrong answers are simply the result of studying with their friends – or being taught poorly by the same teacher.
“That’s a crock,” Dr. Harpp said. If that were true, he said, you’d find suspiciously similar answer sheets for friends sitting apart as often as for friends sitting together.
That’s not what the numbers show, he said. Best friends, identical twins, and even a student taking the same exam twice – they all produce answer sheets different enough to avoid a flag, as long as they aren’t sitting next to each other.
In addition, few organizations that rely on testing are excited about uncovering cheaters in their midst. Dr. McManus initially wanted to study a different medical exam. But the organization that produced it was unwilling to let the results be published, he said, forcing his shift to the group that gives the pediatrics test.
“Most exam boards see it as an indictment of themselves,” he said. “They don’t like to publicize it.”
It can be an uphill battle to convince people that statistics can spot human foibles, often better than human investigators. “If you’re trying to investigate immoral or illegal activities,” he said, “people don’t tell you the truth.”