Page 20A

Can you really detect cheating through statistics?

The science of detecting too-similar answer sheets is decades old and well accepted among psychometricians – the people who design standardized tests. It has been the subject of dozens of academic papers in respected journals. Statistical methods similar to The News’ are used to detect cheating on major national tests like the SAT and are sometimes used to invalidate a test taker’s scores. At least one university (McGill in Montreal) has invalidated scores on final exams based on a statistical cheating analysis. Different statistical methods generally flag the same students for suspicious scores. And experts say the methods are reliable windows into the scale of cheating on school campuses.

One weakness of the statistical analysis is that it cannot detect in what direction the cheating occurred. In other words, if Johnny and Jimmy are flagged, the analysis cannot tell which one cheated off the other. Ideally, in an investigation, schools would have supporting evidence, such as a seating chart showing that a flagged pair sat next to one another on test day. Current Texas testing rules do not require schools to keep seating charts – or even to record in which classrooms students took the TAKS.

How do we know to trust the data? Couldn’t you just be flagging random kids? Isn’t there a risk of a false positive?

That’s always a risk. Looking for cheaters is a bit like DNA analysis: It can’t identify a match with 100 percent certainty. It can only say that the chance of a false positive is very, very small. The News’ analysis was designed to minimize that chance. It was based on the detection methodology of cheating researcher George Wesolowsky and used very conservative assumptions. According to Dr. Wesolowsky, those assumptions should result in a completely innocent school being falsely flagged less than once out of every 10,000 cases.

Here’s another test of the effectiveness of the methodology. At one point, Dr. Wesolowsky purposefully entered answer sheets from more than 100 schools into his computer program – without telling it which students went to which schools. He then asked his program to determine which pairs of students had answer sheets that suggested cheating.

If the program was flagging kids willy-nilly – that is, if it wasn’t catching cases of true collusion between students or adults – you’d expect only a small fraction of the pairs it found to be from within the same school. But that wasn’t the case. The program flagged 8,548 different pairs of students out of that data. Of those, only 57 featured students from different schools. In other words, without knowing where students were, the program flagged pairs within the same school 99.3 percent of the time. (None of those cross-school pairs are included in The News’ analysis – although many of them connected pairs of students from nearby schools, leaving open the possibility that text messaging was used to cheat in those cases.)

Couldn’t these kids have all the same answers because they studied together? Or couldn’t they have had a bad teacher who taught them all the wrong answers?

Experts say those aren’t valid reasons for the sort of identical answers found in The News’ study. First, kids study together in every Texas school – but two-thirds of all Texas schools had not even a single student flagged for cheating. If studying together led to flagging, you’d expect flagging to be much more common than it is. In fact, a number of studies have found that studying together does not actually lead to markedly increased similarity among students’ answer sheets.

Second, if teachers were teaching the material incorrectly, you’d expect the entire class (or close to it) to get those questions wrong. That’s not true in the vast majority of cases found in The News’ analysis. The most common form of cheating entailed a small group of students who had identical wrong answers that differed significantly from the rest of their class.

Perhaps most important, prior studies have shown that statistical detection correlates almost perfectly with where students sit. When seating positions are known, students with too-similar answer sheets are found to be seated next to one another in every or nearly every case. In other words, studying together or improper teaching don’t lead to flags – but sitting within cheating range does.