Effects of Misbehaving Common Items on Aggregate Scores and an Application of the Mantel-Haenszel Statistic in Test Equating.
AuthorMichaelides, Michalis P.
National Center for Research on Evaluation, Standards, and, Student Testing
PublisherNational Center for Research on Evaluation, Standards, and Student Testing (CRESST)
SourceNational Center for Research on Evaluation, Standards, and Student Testing (CRESST)
Google Scholar check
MetadataShow full item record
Consistent behavior is a desirable characteristic that common items are expected to have when administered to different groups. Findings from the literature have established that items do not always behave in consistent ways; item indices and IRT item parameter estimates of the same items differ when obtained from different administrations. Content effects, such as discrepancies in instructional emphasis, and context effects, such as changes in the presentation, format, and positioning of the item, may result in differential item difficulty for different groups. When common items are differentially difficult for two groups, using them to generate an equating transformation is questionable. The delta-plot method is a simple, graphical procedure that identifies such items by examining their classical test theory difficulty values. After inspection, such items are likely to drop to a non-common-item status. Two studies are described in this report. Study 1 investigates the influence of common items that behave inconsistently across two administrations on equated score summaries. Study 2 applies an alternative to the delta-plot method for flagging common items for differential behavior across administrations. The first study examines the effects of retaining versus discarding the common items flagged as outliers by the delta-plot method on equated score summary statistics. For four statewide assessments that were administered in two consecutive years under the common-item nonequivalent groups design, the equating functions that transform the Year-2 to the Year-1 scale are estimated using four different IRT equating methods (Stocking & Lord, Haebara, mean/sigma, mean/mean) under two IRT models--the three- and the one-parameter logistic models for the dichotomous items with Samejima's (1969) graded response model for polytomous items. The changes in the Year-2 equated mean scores, mean gains or declines from Year 1 to Year 2, and proportions above a cut-off point are examined when all the common items are used in the equating process versus when the delta-plot outliers are excluded from the common-item pool. Results under the four equating methods were more consistent when a one-parameter rather than when a three-parameter logistic model was fitted. In two of the four assessments, the treatment of outlying common items had an impact on aggregate statistics: equated mean scores, mean gains and proportions above a cut-off differed considerably. Factors such as the number of outlying items, their type (dichotomously or polytomously scored), their level of difficulty, the direction and the amount of their change from Year 1 to Year 2, and the IRT model and equating transformation fitted to the data are discussed with regards to their influence on equated summary statistics. The differential behavior of common items can be considered as a special case of Differential Item Functioning (DIF); the two different groups that respond to a common item can be regarded as the focal and reference groups, and their performance can be compared for DIF. Study 2 applies the Mantel-Haenszel statistic (Mantel & Haenszel, 1959), which is widely used for DIF analysis, on one statewide assessment that was administered to two consecutive annual cohorts of students. Sixty-nine common items, including nine polytomous items, are analyzed first with the delta-plot method and then with the Mantel-Haenszel procedure. A scheme for flagging dichotomous items for negligible, intermediate, or large DIF takes into account both the significance of the Mantel-Haenszel statistic and the effect size of the log-odds ratio; an alternative scheme developed specifically for polytomous items utilizes Mantel's chi-square statistic (Mantel, 1963) and the Standardized Mean Difference (e.g., Dorans & Schmitt, 1991/1993). The Mantel-Haenszel procedure flagged three common items, including one polytomous, for intermediate DIF. The delta-plot identified two dichotomous items only; one of which was flagged by both procedures. Assumptions are examined and it is argued that the Mantel-Haenszel procedure is more appropriate for comparing the performance of two groups because differences in the distributions of ability of the two cohorts are taken into account. The availability of schemes that classify items according to the amount of DIF they exhibit can be informative for the judgmental decision on how to deal with flagged items. However, some caveats relating to test construction and implementation of the equating design are noted if the proposed procedures are to be applied effectively. The same common items and an adequately large number of them must be presented in corresponding forms across administrations. This is pertinent especially for assessments employing a matrix-sampling design, where the common-items are spread among many forms. The following are appended: (1) ST Output--Equating Transformations; (2) Mantel-Haenszel Procedure and ETS Scheme for Flagging; and (3) SPSS Output for Principal Components of Responses to Common. (Contains 12 notes.)