American education's use of "value added measures" is statistically bankrupt


If I understand the article correctly, the flaw is that some teachers teach students assigned to them because they are slower learners, while other teachers teach those who learn more rapidly. By design, the latter teachers will have a greater “value added” score than the former.


I remember this grading system from when I took Chemistry in High School. The system has a major flaw. If you start with a high score, but do not advance very far, it does not count how much you know in relation to the rest of the class.

The teacher gave a college prep exam before and after the course. As a first year student, my starting score was higher than the final score of any of his second year students. (Always learn a subject before you take a class in it.) Blew his grading method out the window. (As did my experiments in his class.)


This is a pretty bogus criticism. Sure, random assignment is necessary to establish a causal relationship between the factors you are comparing because otherwise there may be some confounding variable. But t all depends on what they are doing with these measures–what inferences are they trying to make. I would think we are most interested in having teachers do the things that work best for the students in their classroom, and not for a hypothetical randomly-chosen student who may have different needs. If for most kids, making them eat peanut butter helped them do math better, a teacher in a class with nut allergies might be expected to try other things.

The inferences we want to make might really be to only the sample space of the particular students in their class taking tests, or maybe to all the classes this teacher might teach with all the students who might take his/her class over his/her career. As such, measurement can help us know whether something the teacher did was better than what they did last time around. The entire field of epidemiology is essentially based on inference done from quasi-experiments that do not allow for random assignment, and it has saved countless lives.

1 Like

But that’s not what these silly tests are used for. They are trying to compare student performance each year, and tie it to the effectiveness of the teacher. As the article states, that is inherently flawed. There are far too many variables involved, to state the that only thing that matters in test scores is the effectiveness of the teacher. Anyone who believes in these test just does not understand kids and how they learn. Too many know it alls who haven’t been in a classroom since they were students.


Genovese isn’t saying anything new, or anything particularly smart. Statisticians and econometricians have been investigating value-added models for quite some time. If you want to read an actual worthwhile take on the effects of non-random assignment you can read this:

and note, it’s a 5 year old paper. Genovese didn’t say anything insightful.

Or think of it this way, which is more likely?

  1. Genovese is right and thousands of Ph.D. statisticians and econometricians engage in statistically bankrupt methodologies.


  1. Genovese doesn’t fully understand the problem, nor the methodologies people use to try to mitigate the known issues.

The actual models use tons of controls, try very hard to account for all observable characteristics, as well as try to figure out the bias caused by what’s unobserved. They also track both teachers and students over the course of many years to help sort things out. They take into account a child’s past level of educational growth as well as the teacher’s history. They worry about whether effects are additive, cumulative, (or both) over time. It’s an immensely complex problem and many incredibly smart people have spent years trying to disentangle all the moving parts.

Unfortunately, people in fields like Genovese tend to only have a cursory knowledge of statistics. That’s all you need when you can actually do clean, simple, random experiments. They’re usually pretty much in the dark when it comes to understanding what people do in fields where the problems are much more complex.

None of this is to say that there aren’t real critiques of Value-Added measures. There are! It’s just that Genovese’s take on it is pretty dumb.

Here’s a RAND paper that has some models in it:

Tell me if they look anything like what Genovese is assuming value-added models are like.

1 Like

This is absolutely not a bogus criticism, and saying “sure, but” doesn’t make it so. The way the law is structured right now guarantees that teachers of poor kids will be punished, and teachers of rich kids will get bonuses and job security. That fact didn’t even need a mathematical foundation - but now it has one anyway.

There is no more powerful predictor of academic achievement than poverty. This has been demonstrated in many, many studies. Until the 1980s, America had programs in place to fight this trend and give kids a more level playing field.

But beginning with Reagan, those programs were dismantled, because they were “big government interfering in our communities.” That trend culminated in No Child Left Behind, and the cult of standardized testing.

The result today is that inner-city schools are crumbling, and being closed, and tax dollars are being shifted to for-profit “charter” schools that, overall, have demonstrably worse performance than public schools. But the profits they turn can buy plenty of friends in high places. And so we continue down the road to inequality in education - which, for some, is a feature, not a bug.


The author of the article doesn’t know the first thing about the types of models being used. He’s assuming they’re the same as the super-simple, any undergrad can do it, models he uses. They’re really and truly not like that at all.

When a psychologist claims to know more about statistics than thousands of statisticians and econometricians, the smart bet is to assume that psychologist is wrong or doesn’t understand what the statisticians are actually doing. Does he really think statisticians don’t understand the concept of random sampling?!

As an aside, I used to teach “Statistical Methods for Social Sciences.” I.e., I was the one who taught statistics to the psychologists and the like.

Mind you, I’d make this same argument in reverse if any statistician dared to suggest that everything psychologists did was theoretically bankrupt. As a general rule, it’s not smart to assume you know more about another field than all the experts in that field.

1 Like

Just because there are potential adjustments to the models does not mean:

  1. Statisticians can use them; and
  2. Policy-makers will use them.

For instance, Rothstein’s paper depends on having data from Grades 3-5, an impossibility when you are setting up your system for the first time and never for children in earlier grades (“though unavoidable data limitations would prevent its widespread adoption. Most importantly, this VAM is not available for the assessment of teachers in the first three grades in which students are tested.”). He ends with “Although some assumptions about the assignment process permit
nearly unbiased estimation, other plausible assumptions yield large biases.”

This leads to the second point, which is that policy-makers, even the most lovely of them, have to use a model that works across legislated or mandate scope and is interpretable by non-technical folks. Their incentive to have well-adjusted models may be much less than having a model now for all of the teachers so they can implement the changes to the system requested, e.g. establish a penalty system for poor performance or re-allocate resources. This is often where the most problem occurs. Any global bias - and Rothstein is completely agreeing with Genovese that this is a large problem, see his self-referenced 2008 paper - is going to be a large and persistent source of errors that PhD level economists will be forced to shake their head about for generations to come since fixing it requires a lot of work.


If I could I would like this ten times.


So, in a meeting this week my Superintendent asked us to come up with a commercially available assessment to measure the arts achievement in our district. I asked if he wanted a test to measure the student’s achievement in the arts, or the impact of their arts education on their “core” classes. Both, he said. So, I need a simple, standardized test to measure performance in: visual, media arts, theater, dance, music, orchestra/band (separate from general music) and Phys. Ed. (included with the arts for no particular reason.) As well as Spanish and Chinese (also included with the arts, but only in grades 6-8.) And the impact of those arts classes on our student’s learning in the core classes. But no mention was made of any control group. But - hey, the entire continuation of our arts-education mission depends on our ability to prove the “value-added” aspect of arts education - WITHOUT A CONTROL GROUP! Sorry, was I shouting? So, any takers? Anyone out there have a good, commercially available assessment, preferably one we can administer in 45 minutes or less, that will prove our arts education is effective within the arts as well as impacting student’s core education, without using a control group exposed to the same curriculum without the arts education? Anyone? Bueller? Bueller? Yeah, I thought so.

But heh, no child left behind, testing is the be-all, end-all of education. Go team.


There are two separate issues at play here. First, the criticism that the value added measures are statistically bankrupt is methhodologically complex, but not correct. The argument is that if you can’t randomize assignment to the experimental condition, then you can’t have valid or generalizable knowledge. That was one (of the many) arguments against the now well-accepted fact that cigarette smoking leads to lung cancer. Lack of random assignment complicates science, but here are plenty of things that can’t be randomized that we can nevertheless study.

Second, it’s also fairly clear that those who are pushing teacher rating systems do not have the best interests of teachers or even children at heart. Measurement of teacher effect is not simple, and there are many other variables involved - socio-economic factors, available resources, etc. However, I don’t think anyone on either side of his debate would argue that teachers don’t matter, and if they do matter, then it should be possible to devise a fair and valid way to evaluate teacher performance.

1 Like

Which is awesome, if our mission was to test creative thinking. What if the arts teachers are actually judged, day to day, on the skill development of our students? Like, for instance, can they draw realistic figures, play Bach, be heard from stage while speaking? What? No creative thinking in that Skill Domain? How about knowledge of Art History and Concepts relevant to the arts disciplines? Not covered by creative thinking? What? Creative thinking is a poorly justified test to begin with? But does that add value to math? How about language arts? Oh, those tests don’t have anything to do with creative thinking, rather, they are solely about cultural knowledge of white main-stream culture? - It may be that this was a tough teacher week and I am totally not in an emotional place to speak to this issue at this time. But seriously? What we are told to do, and what we are tested for are not in agreement. Does anyone care if their kid can think “creatively?” No. Do they care if their kid can draw, can play piano, can be heard when they are on stage? Yes. So, again, what commercially available assessment do you have that will test ALL of the arts, as well as the impact of the arts on the “core” curriculum (other than music on math which is well-researched.) ? Yeah. Nothing.

1 Like

This is why the superintendent has his job. He’s risen above his level of competency and is, thank god, out of a classroom. But he’ll still manage to do damage…

Pennsylvania has its own ideas about how Value Added works. PVAAS, our teacher-eval version, purports to use all sorts of magic computations to control for student backgrounds. Basically, PVAAS mathematically creates a picture of how the students would have done in some imaginary neutral universe, and the teacher is judged on how much better or worse the students do than their imaginary dopplegangers. What could possibly go wrong??

Sure. Like, did the teacher cleverly arrange to have better students this year than last year. Or did the teacher foolishly insist on teaching the lowest achieving students of that grade. Yes, this model totally tells us about the teacher’s choices from year to year.


Thank you for emphasizing this. As I’ve gotten older, it has become clearer to me that there are a fair amount of facts that we are simply not capable of using correctly. As human beings, we have definite deficiencies in our ability and willingness to reason. At the point that we can with certainty say that certain facts will be abused in a manner that the facts themselves do not support, I have to question gathering those facts in the first place.

People in positions of responsibility have the responsibility to understand how the facts they gather will be used and more importantly, misused.

The “value-added” figures could theoretically be used in a productive and accurate fashion.

But they won’t.

And when the certainty that they won’t approaches 1, it’s time to face reality and stop gathering them.

Oddly enough, this could be interpreted two ways: One more favorable to the superintendent is “There is strong pressure to eliminate non-core subjects. Please help me provide some bogus figures (because everyone in the room realizes that the figures are bogus, but it could possibly criminal to acknowledge it) that I can use to justify their preservation.”

Or he could just be an idiot who believed what he was asking for.

Famous, probably irrelevant, story about statistics error. Study of kidney disease found best kidney health in rural counties. Some conclusions were drawn and publicized and a well known statistician contacted the organization that did the study and asked them if they had looked to see where kidney health was worst. Oops, kidney health is worst in rural counties.

Turns out small counties have insufficient population to draw good statistical results and most of your high and low scores will be there because temporary spikes in the data will have a greater influence on averages in a smaller sample. Pushing the results out of shape.

This is a well known problem in statistics, but still an issue in the social sciences, where mathematical models are less common that “intuitive” ones.


For what it’s worth, Diane Ravitch literally wrote half a book on the statistical flaws of Value Added Metrics. It’s a chapter or two in Death and Life of the Great American School System.