Effects of Misbehaving Common Items on Aggregate Scores and an Application of the Mantel-Haenszel Statistic in Test Equating.

Michaelides, Michalis P.; National Center for Research on Evaluation,  Standards,  and, Student Testing

dc.contributor.author	Michaelides, Michalis P.	en
dc.contributor.author	National Center for Research on Evaluation, Standards, and, Student Testing	en
dc.creator	Michaelides, Michalis P.	en
dc.creator	National Center for Research on Evaluation, Standards, and, Student Testing	en
dc.date.accessioned	2017-07-27T10:22:25Z
dc.date.available	04-Jan
dc.date.available	2017-07-27T10:22:25Z
dc.date.issued	2006
dc.identifier.uri	https://gnosis.library.ucy.ac.cy/handle/7/37688
dc.description.abstract	Consistent behavior is a desirable characteristic that common items are expected to have when administered to different groups. Findings from the literature have established that items do not always behave in consistent ways; item indices and IRT item parameter estimates of the same items differ when obtained from different administrations. Content effects, such as discrepancies in instructional emphasis, and context effects, such as changes in the presentation, format, and positioning of the item, may result in differential item difficulty for different groups. When common items are differentially difficult for two groups, using them to generate an equating transformation is questionable. The delta-plot method is a simple, graphical procedure that identifies such items by examining their classical test theory difficulty values. After inspection, such items are likely to drop to a non-common-item status. Two studies are described in this report. Study 1 investigates the influence of common items that behave inconsistently across two administrations on equated score summaries. Study 2 applies an alternative to the delta-plot method for flagging common items for differential behavior across administrations. The first study examines the effects of retaining versus discarding the common items flagged as outliers by the delta-plot method on equated score summary statistics. For four statewide assessments that were administered in two consecutive years under the common-item nonequivalent groups design, the equating functions that transform the Year-2 to the Year-1 scale are estimated using four different IRT equating methods (Stocking & Lord, Haebara, mean/sigma, mean/mean) under two IRT models--the three- and the one-parameter logistic models for the dichotomous items with Samejima's (1969) graded response model for polytomous items. The changes in the Year-2 equated mean scores, mean gains or declines from Year 1 to Year 2, and proportions above a cut-off point are examined when all the common items are used in the equating process versus when the delta-plot outliers are excluded from the common-item pool. Results under the four equating methods were more consistent when a one-parameter rather than when a three-parameter logistic model was fitted. In two of the four assessments, the treatment of outlying common items had an impact on aggregate statistics: equated mean scores, mean gains and proportions above a cut-off differed considerably. Factors such as the number of outlying items, their type (dichotomously or polytomously scored), their level of difficulty, the direction and the amount of their change from Year 1 to Year 2, and the IRT model and equating transformation fitted to the data are discussed with regards to their influence on equated summary statistics. The differential behavior of common items can be considered as a special case of Differential Item Functioning (DIF); the two different groups that respond to a common item can be regarded as the focal and reference groups, and their performance can be compared for DIF. Study 2 applies the Mantel-Haenszel statistic (Mantel & Haenszel, 1959), which is widely used for DIF analysis, on one statewide assessment that was administered to two consecutive annual cohorts of students. Sixty-nine common items, including nine polytomous items, are analyzed first with the delta-plot method and then with the Mantel-Haenszel procedure. A scheme for flagging dichotomous items for negligible, intermediate, or large DIF takes into account both the significance of the Mantel-Haenszel statistic and the effect size of the log-odds ratio; an alternative scheme developed specifically for polytomous items utilizes Mantel's chi-square statistic (Mantel, 1963) and the Standardized Mean Difference (e.g., Dorans & Schmitt, 1991/1993). The Mantel-Haenszel procedure flagged three common items, including one polytomous, for intermediate DIF. The delta-plot identified two dichotomous items only; one of which was flagged by both procedures. Assumptions are examined and it is argued that the Mantel-Haenszel procedure is more appropriate for comparing the performance of two groups because differences in the distributions of ability of the two cohorts are taken into account. The availability of schemes that classify items according to the amount of DIF they exhibit can be informative for the judgmental decision on how to deal with flagged items. However, some caveats relating to test construction and implementation of the equating design are noted if the proposed procedures are to be applied effectively. The same common items and an adequately large number of them must be presented in corresponding forms across administrations. This is pertinent especially for assessments employing a matrix-sampling design, where the common-items are spread among many forms. The following are appended: (1) ST Output--Equating Transformations; (2) Mantel-Haenszel Procedure and ETS Scheme for Flagging; and (3) SPSS Output for Principal Components of Responses to Common. (Contains 12 notes.)	en
dc.publisher	National Center for Research on Evaluation, Standards, and Student Testing (CRESST)	en
dc.source	National Center for Research on Evaluation, Standards, and Student Testing (CRESST)	en
dc.subject	Equated scores	en
dc.subject	Test items	en
dc.subject	Item response theory	en
dc.subject	Evaluation methods	en
dc.subject	Difficulty level	en
dc.subject	Context effect	en
dc.subject	Computation	en
dc.subject	Student evaluation	en
dc.subject	Mantel haenszel procedure	en
dc.title	Effects of Misbehaving Common Items on Aggregate Scores and an Application of the Mantel-Haenszel Statistic in Test Equating.	en
dc.type	info:eu-repo/semantics/report
dc.description.volume	688
dc.author.faculty	Σχολή Κοινωνικών Επιστημών και Επιστημών Αγωγής / Faculty of Social Sciences and Education
dc.author.department	Τμήμα Ψυχολογίας / Department of Psychology
dc.type.uhtype	Report	en
dc.description.notes	Accession Number: ED492876; Sponsoring Agency: Office of Educational Research and Improvement (ED), Washington, DC.; Acquisition Information: National Center for Research on Evaluation, Standards, and Student Testing (CRESST). 300 Charles E Young Drive N, GSE&IS Building 3rd Floor, Mailbox 951522, Los Angeles, CA 90095-1522. Tel: 310-206-1532; Fax: 310-825-3883; Web site: http://www.cresst.org.; Reference Count: 74; Journal Code: JAN2017; Level of Availability: Available online; Publication Type: Reports - Research; Entry Date: 2006	en
dc.contributor.orcid	Michaelides, Michalis P. [0000-0001-6314-3680]
dc.gnosis.orcid	0000-0001-6314-3680

Files in this item

Files	Size	Format	View
There are no files associated with this item.

This item appears in the following Collection(s)

Τμήμα Ψυχολογίας / Department of Psychology [935]

Show simple item record