Shower Presentation Engine

A foolish consistency is the hobgoblin of little minds, adored by little statesmen and philosophers and divines.

Ralph Waldo Emerson

A foolish consistency is the hobgoblin of little minds, adored by little statesmen and philosophers and divines and measurement experts.

Ralph Waldo Emerson (Damian Betebenner)

Polly wants a cracker.

Policy maker wants a valid & reliable indicator.

Questions

Why do we desire having reliable indicators?
What is undesirable about unreliable indicators?
What is a reliable indicator?

Background

These questions began with my work with student growth over the last decade.
Indicators/measures that are inconsistent from year-to-year are seen as undesirable or corrupt.
For example, an important finding of the Met study was the low year-to-year correlation of teacher VAM scores.
Again, why is this a desirable or necessary characteristic of accountability indicators?

Two Dogmas of Measurement: Reliability & Validity

My answer: Reliability is desirable because of the desire to treat accountability indicators as measures.
And as the measurement catechism states: Reliability is a necessary condition for validity.
By abandoning the falacy that we are engaging in a process of measuring akin to psychological measurement, we can jetison much of the "validation baggage" that we bring to this enterprise.

Status Quo

For various reasons, treating accountability indicators as "measures" has proliferated.
One result has been measurement specialists applying their tools of the trade to this enterprise
But is it justified?

Samples versus Populations

Sampling approaches for accountability indicators are nearly ubiquitous (e.g., confidence intervals about percent proficient).
This presentation was motivated by a Slack discussion with Brian and others.
My interest in this topic is over a decade old. Many of my ideas on this were formalized in a JEM paper on the accuracy and precision of percent at performance level measures.
Issues related to reliability, precision, consistency, misclassification, and causality all seem to coalesce in discussions.

Convenience Samples

Random sampling is hardly universal. More typically, perhaps, the data in hand are simply the data most readily available. "Convenience samples" of this sort are not random samples. Still, researchers may quite properly be worried about replicability. The generic concern is the same as for random sampling: if the study were repeated, the results would be different. What, then, can be said about the results obtained?.

David Freedman and Richard Berk (2008)

Accountability indicators as sampling

The result for a school for one year is just one observation from which to infer a school’s true score—what the school’s average would be if we could test an infinite number of students from the school’s catchment area an infinite number of times on all the test questions that might be asked.

Rich Hill (2002)

Superpopulation Falacy

One way to treat uncertainty is to create an imaginary population from which the data are assumed to be a random sample. With this approach, the investigator does not explicitly deﬁne a population that could in principle be studied, with unlimited resources of time and money. The investigator merely assumes that such a population exists in some ill-deﬁned sense. And there is a further assumption, that the dataset being analyzed can be treated as if it were based on a random sample from the assumed population. These are convenient ﬁctions. Convenience will not be denied; the source of the ﬁction is two-fold: (i) the population does not have any empirical existence of its own, and (ii) the sample was not in fact drawn at random.

David Freedman and Richard Berk (2008)

Samples versus Populations

In our line of work there are multiple ways in which accoutability results can vary.
Measurement error from the instrument, sampling, year-to-year-consistency.
Most often, sample based (i.e., students sampled at random) confidence intervals are placed around the number (e.g., percent proficient).
An alternative would be to treat the results as a population of students with confidence intervals derived from imprecision of the assessment?

To conclude on the basis of an assessment that a school is effective as an institution requires the assumption, implicit or explicit, that the positive outcome would appear with a student body other than the present one, drawn from the same population.

Lee Cronbach (1997)

Though correct, the statement is a red herring. Inferring effectiveness requires more than placing a confidence interval about a statistic. Indeed, one of the most challenging issues in growth modeling using student assessment data is in trying to make effectiveness claims based upon observational data. Unless certain design issues are met, judging a school to be effective based upon percent of proficient students is not defensible, with or without confidence bands.8 Moreover, if such confidence intervals embolden users into believing that it is safe to make school effectiveness claims, then perhaps their use should be avoided.

Damian Betebenner (2006)

Sports Statistics

Why aren't confidence intervals used for sports statistics?
Stakes can be associated with sports results without the use of dubious notions of sampling.
Results that are inconsistent from year-to-year can be important to decision making.

The Emporer wears no clothes.

Alternatives

Unfortunately, we're likely stuck with the current state of affairs.
One alternative is to not see unreliability/inconsistency as bad but as an accurate reflection of the system.
One could adopt a "TQM" approach which tries to find the source of the unreliability/inconsistency as opposed to throwing the indicator out.
For example, with student growth, ask what it is about our institutions that makes inconsistent growth from year-to-year the norm.
What we need (IMHO) is a validtion theory for indicators, distinct from that of psychological measures.

A Foolish Consistency

A Foolish Consistency

Polly wants a cracker.

Policy maker wants a valid & reliable indicator.

Questions

Background

Two Dogmas of Measurement: Reliability & Validity

Status Quo

Samples versus Populations

Convenience Samples

Accountability indicators as sampling

Superpopulation Falacy

Samples versus Populations

Sports Statistics

The Emporer wears no clothes.

Alternatives