Faking the Grade: About the analysis

By Holly K. Hacker and Joshua Benton
Staff Writers

Page 21A

The Dallas Morning News wanted to check whether cheating occurs on the Texas Assessment of Knowledge and Skills. So it turned to an expert in the field: George Wesolowsky, a professor of management science at McMaster University in Canada.

Nearly a decade ago, Dr. Wesolowsky developed a software program that uses statistical methods to detect cheating on multiple-choice exams. (An academic paper about his method is available on dallasnews.com.) He used the program to analyze 2005 and 2006 TAKS answer sheets for evidence that students may have copied answers from each other. The News analyzed those results to see how much cheating occurs across various schools, grades and subjects.

Here’s how Dr. Wesolowsky’s program works. Let’s say two students take a multiple-choice test with 50 questions, and answer 48 identically. The program calculates the chances that could happen if they were answering independently, with no cheating. It examines how common those shared answer choices are among other students. Sharing only popular right answers won’t trigger red flags – but a long string of uncommon identical wrong answers could. If the odds are extremely unlikely, the students’ answers are flagged as suspect.

If two students are flagged, it doesn’t mean both are cheaters. In many cases, one could be the innocent victim of the other’s wandering eyes.

Dr. Wesolowsky’s method considers several factors, including the difficulty of each question and how the entire class performed. Other researchers in the field said that Dr. Wesolowsky’s is the best or among the best methods for cheating detection yet devised.

Tests examined

Using open records laws, The News requested answer data from the 2005 and 2006 TAKS tests for all public schools in the state. The data covered grades three through 11 and included student responses on each test. (The Texas Education Agency withheld information for about 20 percent of students because of federal privacy laws.)

Student names and other identifying information were not included.

The analysis examined reading and math answers in each grade, plus social studies and science for grades eight, 10 and 11.

Unusual cases

Dr. Wesolowsky’s method assumes that most students in a school are not cheating, so exceptions to the rule stand out. But in a small number of cases, many students taking a test had very similar wrong answers. That could indicate widespread cheating. Many of those schools were among the group the test-security firm Caveon had considered most suspicious in its analysis.

Dr. Wesolowsky analyzed those unusual cases again, this time lumping them with dozens of other schools. That allowed cheating at each school to be properly detected, because the larger pool included enough noncheaters from other schools for the model to work.

To review the results of the analysis, The News turned to two more experts: David Harpp, a professor at McGill University, and Robert Frary, a professor emeritus at Virginia Tech.

The two men examined the results independently and both supported Dr. Wesolowsky’s findings. Dr. Harpp, using a different method, also performed an independent analysis of several schools, which supported Dr. Wesolowsky’s findings.

It’s important to note that the purpose of The News’ study was not to make cases against specific individuals, but to estimate the extent of cheating in Texas schools.

Faking the Grade: A conservative estimate

By Joshua Benton and Holly K. Hacker
Staff Writers

Page 21A

The News’ analysis found more than 50,000 students whose answer sheets on the TAKS in 2005 and 2006 appear to have been involved in cheating. But there are several reasons to believe that underestimates the problem – perhaps severely.

The News’ data did not include about 20 percent of the state’s answer sheets.

That’s because the Texas Education Agency withheld those students’ scores because of federal privacy laws. If those missing kids were flagged at the same rate as the rest of the state, about 12,000 additional tainted answer sheets would have been detected.

The News’ analysis used conservative assumptions at each step of the way.

The News and researcher George Wesolowsky set a high threshold for how similar student answer sheets had to be to be flagged for cheating. The intention was to minimize the chances of a false positive – students being flagged improperly.

The threshold for most of the analysis was set so that the probability that a school with no cheaters will have students flagged is approximately less than one in 10,000. That means there has to be extensive copying before a pair of students is flagged. The side effect is that some students – including those who copied only a few answers – will go undetected.

The methodology used by The News is substantially more conservative than the one used by the test-security firm Caveon in its analysis of Texas scores.

As a for-profit company, Caveon keeps much about its methods secret. But by analyzing the technical parts of its report on Texas, it’s possible to see how many students the company flagged for possible cheating on the 2005 TAKS. In its search for improper collaboration among students, Caveon consistently found more cheating than The News’ analysis did.

Caveon provided comparable data for only five tests. In 11th-grade math in 2005, Caveon flagged 6.1 percent of all answer sheets statewide. The News flagged 1.7 percent. In sixth-grade reading, Caveon flagged 1.9 percent of answer sheets. The News flagged 0.4 percent.

The methods used by The News don’t identify students or teachers who copy mostly correct answers.

The News’ method looks for students who share large numbers of unusual incorrect answers. But if students – or teachers – copy only right answers, resulting in perfect or near-perfect scores, it’s unlikely they’ll be detected.

For example, at Jesse Jackson Academy last year, 11th-graders did very poorly on the science test. Nearly every student bombed the test with almost identical wrong answers, seemingly copied from a single – and very bad – source. More than 90 percent of the students were flagged for cheating; only 5 percent passed.

But on social studies tests, Jackson’s students were superstars. At 10th grade, students had the highest average score of any school in the state – beating out even the state’s best schools. Thirty-two students did not miss a single question. (To put that in context, on the same test at Dallas’ School for the Talented and Gifted, which is about the same size, only three students had perfect scores.) And in 11th grade – the same group of students who bombed the science test – Jackson had the 16th-highest score in the state, out of nearly 1,500 high schools.

Those are remarkable results for a school that has earned the state’s lowest rating in six of the last seven years and caters primarily to recovered dropouts. But because the students got so few questions wrong on the social studies tests, none of those answer sheets was flagged.

There is no way to know how many students might be missed by The News’ analysis because they cheat effectively.

The News’ analysis uses only one detection method.

Caveon, for instance, used four different methods, including ones that look for unexplained sudden gains in performance and high levels of erasures on answer sheets. The News’ analysis looks only for unusually high similarity among answer sheets. That’s the cheating-detection method with the most support from researchers, but it leaves behind cheaters who might be detected through other means.

Faking the Grade: Common questions about the analysis

Page 20A

Can you really detect cheating through statistics?

The science of detecting too-similar answer sheets is decades old and well accepted among psychometricians – the people who design standardized tests. It has been the subject of dozens of academic papers in respected journals. Statistical methods similar to The News’ are used to detect cheating on major national tests like the SAT and are sometimes used to invalidate a test taker’s scores. At least one university (McGill in Montreal) has invalidated scores on final exams based on a statistical cheating analysis. Different statistical methods generally flag the same students for suspicious scores. And experts say the methods are reliable windows into the scale of cheating on school campuses.

One weakness of the statistical analysis is that it cannot detect in what direction the cheating occurred. In other words, if Johnny and Jimmy are flagged, the analysis cannot tell which one cheated off the other. Ideally, in an investigation, schools would have supporting evidence, such as a seating chart showing that a flagged pair sat next to one another on test day. Current Texas testing rules do not require schools to keep seating charts – or even to record in which classrooms students took the TAKS.

How do we know to trust the data? Couldn’t you just be flagging random kids? Isn’t there a risk of a false positive?

That’s always a risk. Looking for cheaters is a bit like DNA analysis: It can’t identify a match with 100 percent certainty. It can only say that the chance of a false positive is very, very small. The News’ analysis was designed to minimize that chance. It was based on the detection methodology of cheating researcher George Wesolowsky and used very conservative assumptions. According to Dr. Wesolowsky, those assumptions should result in a completely innocent school being falsely flagged less than once out of every 10,000 cases.

Here’s another test of the effectiveness of the methodology. At one point, Dr. Wesolowsky purposefully entered answer sheets from more than 100 schools into his computer program – without telling it which students went to which schools. He then asked his program to determine which pairs of students had answer sheets that suggested cheating.

If the program was flagging kids willy-nilly – that is, if it wasn’t catching cases of true collusion between students or adults – you’d expect only a small fraction of the pairs it found to be from within the same school. But that wasn’t the case. The program flagged 8,548 different pairs of students out of that data. Of those, only 57 featured students from different schools. In other words, without knowing where students were, the program flagged pairs within the same school 99.3 percent of the time. (None of those cross-school pairs are included in The News’ analysis – although many of them connected pairs of students from nearby schools, leaving open the possibility that text messaging was used to cheat in those cases.)

Couldn’t these kids have all the same answers because they studied together? Or couldn’t they have had a bad teacher who taught them all the wrong answers?

Experts say those aren’t valid reasons for the sort of identical answers found in The News’ study. First, kids study together in every Texas school – but two-thirds of all Texas schools had not even a single student flagged for cheating. If studying together led to flagging, you’d expect flagging to be much more common than it is. In fact, a number of studies have found that studying together does not actually lead to markedly increased similarity among students’ answer sheets.

Second, if teachers were teaching the material incorrectly, you’d expect the entire class (or close to it) to get those questions wrong. That’s not true in the vast majority of cases found in The News’ analysis. The most common form of cheating entailed a small group of students who had identical wrong answers that differed significantly from the rest of their class.

Perhaps most important, prior studies have shown that statistical detection correlates almost perfectly with where students sit. When seating positions are known, students with too-similar answer sheets are found to be seated next to one another in every or nearly every case. In other words, studying together or improper teaching don’t lead to flags – but sitting within cheating range does.

Faking the Grade: Failing to catch cheaters: State says it’s addressed the problem, but News uncovers over 50,000 cases on TAKS

By Joshua Benton and Holly K. Hacker
Staff Writers

Page 1A

First of three parts

Tens of thousands of students cheat on the TAKS test every year, including thousands on the high-stakes graduation test, according to an in-depth data analysis by The Dallas Morning News.

The analysis – among the first of its kind on this scale – found cases where 30, 50 or even 90 percent of students had suspicious answer patterns that researchers say indicate collusion, either between students or with school staff. Perpetrators go almost entirely undetected and unpunished by state officials.

The study contradicts the Texas Education Agency’s stance that cheating on the TAKS is extraordinarily rare and that the agency has done a good job of policing it. Many schools with big cheating problems, including some in North Texas, have officially been cleared by recent state investigations – in most cases simply by proclaiming their innocence on a state questionnaire.

The findings also show that on a high-stakes test like the TAKS – which can determine a school’s reputation, a teacher’s salary and whether a student walks across the stage on graduation day – some people will seek whatever advantage they can find.

“What we have here in many of the schools, particularly charter schools, is rampant cheating involving many students,” said David Harpp, a professor at Montreal’s McGill University who studies cheating and reviewed the analysis.

What the study found

The study examined statewide scores from 2005 and 2006 on the all-important Texas Assessment of Knowledge and Skills – the state test given in grades three through 11. Some of the key findings:

*The test scores of more than 50,000 students show evidence of cheating. Some of those students were the innocent victims of others copying their answers. But experts say most were likely either deliberately copying answers or had their answer sheets doctored by school staff.

* That total is a small percentage of all Texas students. (Two-thirds of Texas schools showed no evidence of cheating.) But the suspicious scores are focused on the state’s 11th-grade tests. Those are the ones students must pass to earn a diploma.

At more than 100 high schools, at least one in 10 juniors was flagged for having extremely suspicious answer patterns on the TAKS graduation tests. Many of those students graduated last month.

*Cheating is concentrated in the state’s two largest districts – Dallas and Houston – and in charter schools.

Even after accounting for their larger size, cheating is more than three times as common in Dallas and Houston as it is in the state’s other large urban school districts. In Dallas, one out of every six high school juniors was flagged for cheating in 2006.

And in the state’s lightly regulated charter schools – which are funded with tax dollars but run by private companies or groups – cheating was detected at almost four times the rate of traditional public schools. Cheating was more common at underachieving schools, where the pressure to boost scores is the highest.

*Most of the cheating appears to be driven by students copying off of each other, in pairs or small groups. But at a handful of the most flagrant schools, cheating is systemic. On several subject tests, one Houston charter school had 80 percent or more of its answer sheets flagged for cheating – a scale that seems difficult to contemplate without the passive or active involvement of an adult.

“The evidence of substantial cheating is beyond any reasonable doubt,” said George Wesolowsky, a professor at McMaster University in Canada who studies cheating on multiple-choice tests like the TAKS. He worked with The News on the analysis, which used his methodology to identify pairs of student answers that were, statistically, too similar to each other to be the result of chance.

Officials at the Texas Education Agency have consistently argued that statistical analysis can’t prove cheating and that they must rely on other forms of evidence – like getting teachers to confess to misbehavior – in their investigations. TEA decided not to use data drawn from student answer sheets – even with evidence of widespread copying in a classroom.

That approach has not been fruitful. The agency has cleared 98 percent of the schools in its recent round of investigations, in most cases because school officials did not volunteer knowledge of improprieties. Many of those schools were found to have widespread cheating in The News’ analysis.

State officials have said they are willing to reverse course and consider using statistical methods in the future. “I’m certainly open to the idea,” said Criss Cloudt, the TEA associate commissioner who recently assumed oversight over the state’s testing program.

Different school officials had different reactions to The News’ findings.

“I’m not going to dispute the methodology,” Dallas Superintendent Michael Hinojosa said. “Your study came to the conclusions on what seem like reasonably objective measures.”

He said that after suspicions were raised about TAKS cheating last year, Dallas instituted new test-security policies for this spring’s tests. There must now be two adults in every classroom, and their doors must be kept open during testing. Extra monitors were assigned to schools with suspected problems. Those reforms and others, he hopes, will reduce incidents of cheating from the levels found in 2005 and 2006.

“We’ve had issues regarding our assessment program,” he said. “That’s why we decided to change our protocols.”

Houston school officials, in contrast, issued a statement calling the analysis part of a “continued effort by The Dallas Morning News to dismiss the real academic progress in Texas schools.” The statement said there is “absolutely no evidence” of cheating in Houston schools.

Researchers say The News’ study raises serious questions about the legitimacy of the state’s methods of evaluating schools, which in some cases have given public praise – and promised hard cash – to schools with major cheating problems.

They say that on a test with such high stakes – for every level of the educational system – confronting cheating honestly can be difficult. The incentives for improved scores are strong; those for vigorously fighting cheating are weaker.

“People often don’t want to know what’s happening,” said Robert Frary, a professor emeritus of educational measurement at Virginia Tech who has studied cheating for more than 30 years.

Established methods

The News’ analysis was based on a well-established method for detecting answer-copying developed by Dr. Wesolowsky. Research in these methods dates back more than 80 years; variations of them are used to detect cheating on tests like the SAT, the ACT and some college final exams.

“Some of the methods work better than others, but they all work pretty well,” Dr. Frary said. “Wesolowsky’s is one of the best, maybe the best.” Dr. Frary is considered by some to be the modern godfather of the field, having studied it since the 1970s.

Methods like Dr. Wesolowsky’s look for pairs of students who share unusually high numbers of uncommon answers. A few shared answers won’t trigger any alarm bells. But extreme cases – when the run of identical and unusual answers stretches so long as to move past the boundaries of mere chance – lead to the pair of students being flagged.

Take the case of “Sara” and “Joe,” two students at Dallas’ South Oak Cliff High. (They’re real students, but those aren’t their real names.) In 2005, as juniors, Sara and Joe took the science portion of the graduation TAKS. Out of 55 questions, they answered 51 the same way.

That might not be unusual if they were answering them correctly. After all, the correctanswer to a TAKS question is almost always the most popular response. But they consistently gave the same unusual wrong answers. In Sara and Joe’s case, their answers are so unusually similar to each other that their pattern would appear naturally among the school’s innocent students fewer than once in 277 million cases.

“I can’t think of any other plausible explanation except cheating,” Dr. Wesolowsky said. Dr. Harpp agreed, calling it “an extreme case of arrogant collusion.”

Sara and Joe weren’t the only South Oak Cliff juniors flagged for cheating on the science test. So were 31 others, The News’ analysis found. In all, 77 answer sheets on the 2005 graduation tests were flagged. (The graduation test is also given in English language arts, math and social studies.)

Administrators at several Dallas high schools, including South Oak Cliff, referred questions to district headquarters. But students at South Oak Cliff and other area high schools said they definitely hear about cheating on the TAKS, especially by kids who haven’t studied or have missed lots of classes. That’s not shocking, since in national surveys a majority of teenagers routinely report they have cheated on tests in school.

“It actually makes me angry that they do that, because other people work hard and they study and all that stuff,” said Alma Gonzalez, who just graduated from South Oak Cliff.

She said students would whisper answers to each other on TAKS day, especially when a teacher left the room for a moment. A few times when classmates asked her for answers, she said, she gave wrong ones on purpose.

“They think they can get their way through high school by just cheating, and they’re not really learning anything,” she said.

Caveon analysis

In 2005, after a series of articles in The News about TAKS cheating, TEA hired the test-security firm Caveon to analyze that year’s test scores. Caveon identified 700 schools whose scores it considered suspicious for one or more reasons, including having too many students with answer sheets suspiciously similar to one of their schoolmates.

In that analysis, South Oak Cliff was one of the schools Caveon flagged. But TEA announced in December that it had cleared South Oak Cliff, along with nearly 600 other schools, of any wrongdoing. That decision was based on the contents of questionnaires filled out by each school’s administrators about their test-security practices.

“It is with great pride and pleasure that we are now able to exonerate a large majority of the schools flagged by the Caveon report,” state Education Commissioner Shirley Neeley said in a prepared statement at the time. “It is imperative that Texans trust our test results and have confidence that they are valid and reliable.”

One member of the TEA panel that reviewed those questionnaires said they were not useful in determining whether cheating had actually taken place.

“That’s basically what this questionnaire process was about: asking schools, ‘Did you cheat or not?’ ” the panel member said. “We weren’t given anything else to go on – no statistical data.”

Currently, only 12 of the 700 schools Caveon identified remain under investigation. No schools, so far, have been cited for even a single incident of cheating.

Meanwhile, according to The News’ analysis, cheating continued at South Oak Cliff. In 2006, 36 percent of the school’s juniors had at least one answer sheet flagged as suspicious on the graduation TAKS.

Forest Brook

South Oak Cliff isn’t the only school cleared by TEA despite an apparent cheating problem.

The school that Caveon found the most suspicious was Houston’s Forest Brook High School, in the North Forest school district. Caveon’s analysis flagged the school in 52 ways, in every TAKS subject area. In particular, Caveon flagged Forest Brook repeatedly for having lots of students with answer sheets very similar to those of their peers.

After site visits and a paperwork review, TEA cleared Forest Brook. The state’s report on its investigation states that TEA did not examine any student answer sheets or use the data produced by Caveon’s analysis. Instead, agency officials interviewed district officials about whether testing procedures were followed on test day. Forest Brook’s leaders denied any wrongdoing.

TEA officials also accepted Forest Brook’s explanation for drawing Caveon’s attention: Teachers had made a “concentrated effort” to prepare students for the TAKS test, and the school had boxed student answer sheets in such a way that they believed it could have triggered a Caveon flag.

Caveon’s work contract does not allow company officials to comment on their findings in Texas. But The News’ analysis found rampant answer copying on the graduation test at Forest Brook – and on a scale unmatched in Texas.

On the 2005 science test, for example, 93 of the 186 answer sheets were flagged for copying. That’s the highest number of tainted answer sheets on a single test at any school in the state. An additional 56 sheets were flagged on the graduation exams in the three other subjects tested.

Forest Brook – a historically poor-performing school – got a passing rate of 95 percent on the science test that year. That was up from 54 percent the year before.

Or, to put it another way: Forest Brook jumped from 23 percentage points below the state average to 14 points above it.

Dr. Harpp did his own analysis of the North Forest data – using a method different from Dr. Wesolowsky’s. He said the data “clearly shows that massive collusion took place” and that North Forest’s explanation was “completely unconvincing.”

“To dismiss this mountain of evidence merely on the word of a few teachers saying they did everything by the book defies all logic,” he said. “In effect, the TEA is certifying that it is more reasonable to believe that nature has completely deviated from its course than that someone has told a lie.”

For some schools, TAKS scores mean money. In recent years, a number of state programs have begun to reward schools and their teachers for good test scores. Forest Brook received a $165,000 state grant this year; the school’s eligibility depended in part on its 2005 TAKS scores.

North Forest representatives did not return multiple phone calls seeking comment last week.

In all, The News’ analysis found 112 schools where at least 10 percent of the answer sheets on a 2005 TAKS test were flagged for cheating. Of those, four are still under investigation by state officials. Another 33 were never flagged by Caveon in the first place, and thus were not part of the TEA investigation. The remaining 75 have been declared cheating-free by state officials.

Sophisticated tactics

Not all cheating schools are created equal.

At many schools, the students identified in The News’ analysis are in isolated pairs – the sort of pattern you might expect if the adults in a school are trying their best, but still don’t keep a close enough eye on each student to prevent one from sneaking answers off a neighbor.

More serious is the pattern common in many Dallas and Houston high schools on the graduation test. In those schools, there still isn’t the pattern one might expect if adults were actively doctoring answer sheets. But the amount of answer copying is large enough that it appears test proctoring is loose.

In 2006, for instance, 17.6 percent of Dallas juniors were flagged for cheating. So were 13.3 percent of Houston juniors. (The statewide average was 4.1 percent.)

“There’s always cheating going on, even when it isn’t the TAKS test,” said Priscilla Ramirez, a rising senior at Adamson High School. She, like all the students interviewed, said she doesn’t cheat. But she hears about students who do.

“It’s crazy how smart people are about cheating,” she said. “If it’s not one way, it’s another.”

Many Dallas-area high school students said they knew of no cheating. But many others said the tools of prospective cheaters have grown beyond the traditional to include text messaging and other electronic forms. Some tactics sound like urban legends – such as kids signaling question numbers and answers with prearranged finger codes – but students swear they’re real.

“It’s getting good enough where the teachers don’t notice it,” said Krysha Bluitt, who just finished her sophomore year at A. Maceo Smith High in Dallas.

Students say there is enormous pressure to do well on the TAKS. Performance on the test can have major impacts on the lives of students, teachers, and administrators. For adults, it can mean bonuses or raises. For schools, too many bad scores can mean permanent closure.

“From day one, when you get there, you’re there to pass the TAKS,” said Ulysses Hauxwell, who just graduated from North Dallas High.

Stephanie Westbrook, acting principal at A. Maceo Smith High, said that the school takes test security seriously and that she knows of no cheating on the TAKS. “Honestly, we try to teach our students about integrity, and it is made very clear to our teachers that [cheating] does not happen under their watch,” she said. District officials also send staffers to campuses on test days to provide “an extra set of eyes,” she said.

Other Dallas principals contacted by The News either denied there was any cheating on their campuses or declined comment. A Fort Worth official said that district is unaware of any cheating at its schools flagged by The News’ study.

The News’ analysis found 67 cases where a Dallas ISD high school had at least 10 percent of its answer sheets flagged for cheating. Those cases included nearly all of the district’s nonmagnet high schools. (A high school typically gives 10 TAKS tests, and The News looked at scores for two years.)

But in a small number of schools, the answer patterns were so off-kilter that Dr. Wesolowsky had to adapt his methodology to properly examine them, since his method is based on the assumption that most students are being tested honestly.

“This is completely outrageous,” Dr. Harpp said of the most extreme cases. “This is so mind-boggling – it requires a new language to describe.”

‘A useful tool’

TEA officials said they did not feel comfortable evaluating The News’ analysis without examining it more thoroughly. But they expressed somewhat less skepticism about the use of statistical analysis than agency officials have over the past year.

“Statistics can be a very useful tool to point you in the right direction,” said Michael Donley, TEA’s inspector general.

Despite that, he said he felt the agency had been correct to rely on interviews – and to exclude statistical evidence – from their recent investigations.

“I couldn’t prove it,” he said of accusations at Caveon-identified schools. “I tried. We talked to everyone we could think to talk to. Our investigators are pretty good at getting at when people are telling the truth.”

But after repeatedly saying that statistical analysis was not a legitimate tool in investigating cheating, officials said they would now consider using a methodology similar to The News’.

“If it works, we would absolutely look at it,” said Dr. Cloudt, the TEA associate commissioner. She assumed oversight of the state’s testing program earlier this year after the state’s assessment director, Lisa Chandler, was forced out.

No matter how the agency moves ahead, Dr. Harpp said there is no doubt in his mind that the cheating found in Texas is real and, in some places, systemic.

“At some point you have to stand up and say, ‘This runs in the face of common sense,’ ” he said.

50,000: The number of students whose TAKS answer sheets appear to have been involved in cheating in 2005 or 2006.

175: The number of Texas schools where, on at least one TAKS test, one in 10 answer sheets was flagged for cheating in 2005 or 2006.

74%: How many of the 50 worst cases of cheating were in Texas’ charter schools. Charter schools make up only 2 percent of the state’s campuses.