Managing Multiple Measures

Now that we have multiple measures, are we prepared to use them?
by Charles A. DePascale
Principal, May/June 2012
Web Resources

Regardless of how one might feel about the recent developments in teacher evaluation systems, No Child Left Behind (NCLB) and adequate yearly progress (AYP), or student assessments for high-stakes promotion decisions, educators overwhelmingly agree that use of multiple measures is better than reliance on a single measure such as a large-scale, standardized assessment.

In the current era of standards-based assessments and accountability dating from the late 1980s, the call for the use of multiple measures has been pervasive. Landmark education reform efforts in Kentucky, Maryland, and Massachusetts that gave rise to a new breed of state assessment and accountability programs in the 1990s all required the use of multiple measures. NCLB and its predecessor, the Improving America’s Schools Act (IASA), also mandated their use. Over the past three years, U.S. Secretary of Education Arne Duncan has repeatedly called for a multiple-measures approach in the design of teacher evaluation systems.

Consider the following scenarios that describe the types of student, school, and teacher accountability issues that principals regularly face. Each scenario includes multiple pieces of information that might contribute to making a more appropriate interpretation or a more valid or reliable decision.

  • As you prepare to assign a teacher’s rating on professional practice, you review your notes and preliminary ratings from the one formal and three informal observations made across the year as well as responses from parent surveys.
  • The state releases the annual school report cards. Last year, you were commended for a 25 percent increase in mathematics performance. This year, your overall classification is “below average” because of low growth in mathematics and an inadequate yearly progress rating for your students with disabilities
  • subgroup. You recently were honored for the success of the reading literacy program instituted two years ago. Next week, you expect an overflow crowd at your annual art and music night, with two-thirds of the students performing in at least one group.
  • Two students have failed the fourth-grade mathematics state assessment required for promotion to fifth grade. Previously, the students have performed well and their teacher is confident that they are ready for fifth grade. She mentions another student who just made the cut on the state assessment, but has been struggling all year.
  • About 50 students scored below proficient on the annual state assessment. You have funding for an after-school intervention program for 35 students. In deciding which students most need the program’s services, you have access to the state assessment results, test results from the computer-adaptive interim assessments, and diagnostic assessments administered at the beginning of the year; class grades and comments from teachers; and information on the students in the program last year.

Pervasive but Perplexing
With such universal agreement on the importance of multiple measures over the past 30 years, you might assume that assessment and accountability systems in 2012 would be fully integrated systems of multiple measures, and principals would be well versed at interpreting and acting on the multiple sources of information described in the scenarios. However, as Sue Brookhart points out in her 2009 Educational Leadership article “The Many Meanings of Multiple Measures,” the term multiple measures is open to a variety of interpretations and “both policymakers and practitioners need a clear understanding of what they mean by the term.”

From an historical measurement perspective identified 60 years ago in the 1951 edition of Educational Measurement, the use of multiple measures serves three key purposes:

1. To provide finer degrees of distinction;
2. To increase the reliability of measurement; and
3. To provide measures of unrelated aspects of the behavior to be predicted or evaluated.

The appropriate use and interpretation of information gained from multiple measures is dependent upon the intended purpose. For example, computing a simple average across three test scores might be appropriate if the assessments are measuring the same thing and the purpose of the multiple measurements is to increase the reliability of the score. However, a simple average, or any attempt to create a composite score, might be inappropriate if the three measures were designed to measure distinct components of English language arts (e.g., speaking, reading comprehension, and writing) and mastery of each component were required for proficiency.

Let’s consider the intended purposes of multiple measures in the context of our opening scenarios.

Finer Degrees of Distinction
The basic principle here is that more measures are needed to classify performance into greater numbers of categories. From a strictly student assessment perspective, the need for multiple measures to provide finer degrees of distinction was an afterthought for several decades. The number of measures (i.e., items) included on an assessment to attain reliability and validity was more important than the need to create distinct categories. A typical 40-item, multiple-choice test would produce 41 categories based on total score (0-40) or over a trillion categories if it mattered which items were answered correctly—clearly fine enough degrees of distinction.

As our emphasis shifted to achievement-level classifications, performance tasks or events, and teacher ratings, the need to consider the degrees of distinction re-emerged. In the fourth scenario, for example, the state assessment classifies student performance into four categories. Additional information (i.e., finer degrees of distinction) was needed to determine which students would benefit most from the available supports. The computer-adaptive assessments, for example, might be better suited to distinguish among students scoring well below the proficient benchmark (perhaps even performing below grade level) than the state assessment. Similarly, the diagnostic assessment might be better suited than the state assessment in identifying specific areas of strength or weakness that would be relevant in determining the most appropriate supports for students.

Similar issues likely will arise with respect to teacher evaluation systems. Under many evaluation systems, teacher performance on components such as professional practice, professional responsibility, or impact on student learning is being reported as a classification on a 3-point or 4-point scale. Although the number of possible categories increases when combined across components, it might be difficult to pinpoint relatively small percentages of ineffective or highly effective teachers with such blunt measures. Evaluators might find it nearly impossible to distinguish among the vast majority of teachers performing in the middle of the distribution.

Increased Reliability
Every measurement contains some degree of error that decreases its reliability. Aggregating across multiple measures is one way to improve reliability. It is somewhat paradoxical that the most common context in which multiple measures are employed to account for the lack of reliability in a single assessment score does not actually address the reliability issue.

That common context is expressed in the third scenario—students failing to meet the cut-off score score for promotion on a high-stakes assessment. A typical approach in that situation is to quickly provide a retest opportunity with little or no intervening instruction. On the surface, that is a textbook example of the use of multiple measures to improve the reliability of the estimate of the student’s performance. However, the potential increase to reliability through multiple administrations can only be realized when results of all of the assessments administered are used to produce an overall estimate. In practice, the approach commonly used is to consider each administration individually, and a student must meet or exceed the required cut score only once. Rather than improving reliability, we have developed policies to take advantage of unreliability.

In the third scenario, the case of the struggling student who manages to meet the cut-off score on the state assessment is another example of the same issue. By not examining additional evidence for students performing just above the cut on the state assessment, we are reflecting a policy that places much more concern on false negatives than on false positives.

Let’s also consider the use of repeated measurements in the context of the first scenario: a principal making multiple observations of a teacher’s instruction throughout the year. Multiple observations raise several questions about their purpose:

  1. Is the intent to observe the complete set of professional practices at each observation, or is it expected that particular practices might be exhibited in only one or two of the observations?
  2. Is the teacher expected to demonstrate particular practices consistently across the year to be considered effective or is just one positive or negative example sufficient to attain an “effective” or “ineffective” rating on that practice?
  3. Are some practices more critical than others or is each of the practices equally important?
  4. If two teachers have the same set of ratings across three observations (but not on each observation), what is the expected rating for a teacher who shows significant improvement over the year versus a teacher who declines in performance across the three observations?

Clearly, the system for collecting data at individual observations and aggregating data across multiple observations must be consistent with and reflect the answers to questions such as these.

Measures of Unrelated Aspects
Multiple measures also provide information on unrelated aspects of a particular construct or behavior to be evaluated. The school accountability scenario exemplifies the issues associated with relying on a narrow set of indicators to describe something as complex as school quality, as well as the challenges associated with attempting to expand the concept of school quality to include additional measures. A school, and its community, might receive contradictory signals because school performance varies across different indicators. As in the example provided, a school might have instituted a highly effective literacy program, but not addressed some other pressing issues. When administrators are forced to respond to a single measure as an indicator of overall school quality, the result is often overreaction to individual year fluctuations in results rather than long-term planning. This leads to actions such as reducing instruction in subject areas not included in the accountability system, elimination of recess, or cutbacks in critical programs in art and music.

High-stakes student testing and NCLB have shined a bright light on the issue of underrepresentation of the full domain or “narrowing of the curriculum” due to the requirements of the accountability system. The same conversation, however, is just beginning with regard to teacher evaluation. Educators must consider three issues in using multiple measures to provide sufficient evidence to determine teacher effectiveness or teacher quality:

  1. What are the potential unintended consequences of narrowing the focus of the evaluation to content areas in which there are high-quality, standardized assessments available?
  2. Particularly at the elementary level where teachers are responsible for multiple content areas, is it appropriate to combine results across content areas and/or to assume that effectiveness in one content area can be generalized to other content areas? Can an elementary teacher be considered effective if he or she is highly effective in teaching mathematics and perhaps only minimally effective in teaching reading?
  3. Are the multiple measures being considered under many teacher evaluation systems (i.e., impact on student learning, professional practices, professional responsibility) providing information on the full range of outcomes that an effective teacher is expected to impact or are the three components all related to the outcome of impact on student learning?

Appropriate Use
The use of multiple measures to inform decisions related to instruction and accountability should be the norm. In designing systems of multiple measures, however, it is critical to be aware of the different purposes of multiple measures and how those purposes impact the selection of measures, frequency of administration, aggregation of results across measures, and interpretation and use of results. All of the multiple measures available might not have been intended to meet the same purpose, and therefore, should be used and interpreted appropriately.

After decades of rhetoric on the importance of multiple measures, we might finally reach the point in time where multiple measures will be available for use by teachers, administrators, and education policymakers. We must devote sufficient resources to ensure their appropriate use.

Charles A. DePascale is a senior associate at the National Center for the Improvement of Educational Assessment Inc.

Copyright © National Association of Elementary School Principals. No part of the articles in NAESP magazines, newsletters, or Web site may be reproduced in any medium without the permission of the National Association of Elementary School Principals. For more information, view NAESP's reprint policy.

 

AttachmentSize
MJ12 Depascale.pdf829.67 KB