Document Type : Research Paper

Authors

University of Tehran

Abstract

When constructing a test, an initial decision is choosing an appropriate item response format which can be classified as selected or constructed. In large-scale tests where time and finance are of concern, the use of response chosen known as multiple-choice items is quite widespread. This study aimed at investigating the impact of response format on the performance of structure tests. Concurrent common item equating design was used to compare multiple-choice items with their constructed response stem-equivalent in a test of grammar. The Rasch model was employed to compare item difficulties, fit statistics, ability estimates and reliabilities of the two tests. Two independent sample t-tests were also conducted to investigate whether the differences among the item difficulty estimates and ability estimates of the two tests were statistically significant.  A statistically significant difference was observed in item difficulties. However, no significant difference was detected between the ability estimates, fit statistics, and reliabilities of the two tests.

Keywords

Ackerman, T.A., & Smith, P. L. (1988). A comparison of the information provided by essay, multiple-choice, and free-response writing tests. Applied Psychological Measurement, 12(2), 117- 128.
Bacon, D. R. (2003). Assessing learning outcomes: A comparison of multiple-choice and short-answer questions in a marketing context. Journal of Marketing Education, 25(1), 31-36.
Bachman, L. (1990). A fundamental consideration in language testing. Oxford: Oxford University Press.
Baghai, P. (2010). Test score was equating and fairness in language assessment. Journal of English Language Studies, 1(3), 113-128.
Bensoussan, M. (1984). A comparison of cloze and multiple- choice reading comprehension tests of English as a Foreign Language. Language Testing, 1(1), 101-104.
Bleske-Rechek, A.  Zeug, N., & Webb, R. M. (2007). Discrepant performance on multiple-choice and short answer assessments and the relation of performance to general scholastic aptitude. Assessment and Evaluation in Higher Education, 32(2), 89-105.
Bond, T. G. & Fox, C. M. (2007). Applying the Rasch model: Fundamental measurement in the human sciences. Lawrence Erlbaum.
Bridgeman, B. (1992). A comparison of quantitative questions in open-ended and multiple-choice formats. Journal of Educational Measurement, 29(3), 253-271.
Cheng, H. F. (2004). A Comparison of multiple-choice and open-ended formats for the assessment of listening proficiency in English. Foreign Language Annals, 37(4), 544–555.
Currie, M.,&Chiramanee, T. (2010).The effect of the multiple-choice item format on the measurement of knowledge of language structure.Language Testing, 27(4), 471–491.
Dávid, G. (2007). Investigating the performance of alternative types of grammar items. Language Testing, 24(1), 65–97.
Dudley, A. (2006). Multiple dichotomous-scored items in second language testing: Investigating the multiple true-false item types under norm-referenced conditions. Language Testing, 23(2) 198-228.
Ebel, R. L., & Frisbie, D. A. (1991). Essentials of educational measurement (5thed.). Englewood Cliffs, NJ: Prentice-Hall. 
Elinor, S. H. (1997, May). Reading native and foreign language texts and tests: The case of Arabic and Hebrew native speakers reading L1 and English FL texts and tests. Paper presented at the Language Testing Symposium, Ramat-Gan, Israel. (ERIC Document Reproduction Service No. ED412746)
Farr, R., Pritchard, R., & Smitten, B. (1990). A description of what happens when an examinee takes a multiple-choice reading comprehension test. Journal of Educational Measurement, 27, 209–226. 
Frederickson, N. (1984). The real test bias: Influences of testing on teaching and learning. American psychologist, 39, 193-202.
Frisbie, D.A., & Druva, C.A. (1986).Estimating the reliability of multiple true-false tests. Journal of Educational Measurement 23, 99–105.
Gibbs, W. J. (1995). An approach to designing computer-based evaluation of student constructed responses: Effects on achievement and instructional time. Journal of Computing in Higher Education, 6(2), 99-119.
Godschalk, F. I., Swineford, F., & Coffman, W. E. (1966). The measurement of writing ability. New York: College Entrance Examination Board.
Haladyna, T.M., Downing, S. M., & Rodriguez, M. C. (2002). A review of multiple-choice item-writing guidelines for classroom assessment. Applied Measurement in Education 15, 309–34.
Hambleton, R. K. & Jones, R. W.  (1993). Comparison of classical test theory
and item response theory and their applications to test development. Educational Measurement: Issues and Practice, 12 (3), 38-47.
In’nami, Y. (2006). The effects of task types on listening test performance: A quantitative and qualitative study. Unpublished doctoral dissertation, University of Tsukuba, Japan.
In'nami, Y.,& Koizumi, R. (2009). A meta-analysis of test format effects on reading and listening test performance: Focus on multiple-choice and open-ended formats. Language Testing, 26 (2), 219–244.
Katz, L., Bennett, R.E. & Berger, A.E. (2000). Effects of response format on the difficulty of SAT-Mathematics items: It's not the strategy. Journal of Educational Measurement, 37(l), 39-57.
Kobayashi, M. (2002). Method effects on reading comprehension test performance: Text organization and response format. Language Testing, 19, 193–220.
Linacre, J.M. (2016). A User's Guide to WINSTEPS®. Retrieved July 7, 2016, from  http://www.winsteps.com/
Lord, F. M. (1953). The relation of test score to the trait underlying the test. Educational and Psychological Measurement, 13, 517-548.
Lukhele, R., Thissen, D., & Wainer, H. (1994). On the relative value of multiple-choice, constructed-response, and examinee-selected items on two achievement tests. Journal of Educational Measurement, 31, 234–250.
Messick, S. (1993). Trait equivalence as construct validity of score interpretation across multiple methods of measurement. In R. E. Bennett & W. C.  Ward (Eds.), Construction versus choice in cognitive measurement: Issues in constructed response, performance testing, and portfolio assessment (pp. 61-73). Hillsdale, NJ: Lawrence Erlbaum Associates, Inc.
Morgenstem, C. F.  & Renner, J. W.  (1984). Measuring thinking with standardized tests. Journal of Research in Science Teaching, 21, 639-648.
Quellmalz, E. S., Capell, F.J.& Chou, C.P. (1982). Effects of discourse and response mode on the measurement of writing competence. Journal of Educational Measurement, 19(4), 241-258.
Rodriguez, M. C. (2003). Construct equivalence of multiple-choice and constructed-response items: A random-effects synthesis of correlations. Journal of Educational Measurement, 40, 163–184.
Rogers, W., & Harley, D. (1999). An empirical comparison of three-and four-choice items and tests: Susceptibility to test-wiseness and internal consistency reliability. Educational and Psychological Measurement, 59 (2), 234.
Shizuka, T., Takeuchi, O., Yashima, T., &Yoshizawa, K. (2006). A comparison of three- and four-option English tests for university entrance selection purposes in Japan. Language Testing, 23, 35-57.
Shohamy, E. (1984). Does the testing method make a difference? The case of reading comprehension. Language Testing, 1, 147- 170.
Smith, E. V. Jr. (2001). Evidence for the reliability of measures and validity of measure interpretation: A Rasch measurement perspective. Journal of Applied Measurement, 2(3), 281-311.
Teng, H. C. (1999). The effects of question type and preview on EFL listening assessment. Paper presented at the American Association for Applied Linguistics. (ERIC Document Reproduction Service No. ED 432920)
Thissen, D., & Steinberg, L. (1984). A response model for multiple choice items.  Psychometrika, 49, 501-519.
Tversky, A. (1964). On the optimal number of alternatives at a choice point. Journal of Mathematical Psychology, 1, 386–391.
Trujillo, J. L., (2005). The effect of format and language on the observed scores of secondary-English speakers" (2005). Electronic Theses, Treatises, and Dissertations.  Retrieved form http://diginole.lib.fsu.edu/etd/1509
Traub, R. E. (1993). On the equivalence of the traits assessed by multiple-choice and constructed response tests. In R. E. Bennett & W. C. Ward (Eds.), Construction versus choice in cognitive measurement: Issues in constructed-response, performance testing, and portfolio assessment (pp. 29-44). Hillsdale, NJ: Lawrence Erlbaum.
Van den Bergh, H. (1990). On the construct validity of multiple-choice items for reading comprehension. Applied Psychological Measurement, 14(1), 1-12.
Ventouras, E., Triantis, D., Tsiakas, P., & Stergiopoulos, C. (2010). Comparison of examination methods based on multiple-choice questions and constructed-response questions using personal computers. Computers & Education, 54(2)455-461.
Ward, W.C., Dupree, D., & Carlson, S.B. (1987). A comparison of free-response and multiple-choice questions in the assessment of reading comprehension (ETS Research Rep. No. 87-20). Princeton, NJ: Educational Testing Service.
Werts, C.E., Breland, H.M. Grandy, J., & Rock, D.A. (1980). Using longitudinal data to estimate reliability in the presence of correlated errors of measurement. Educational and Psychological Measurement, 40 (l), 19 -29.
Wilson, M. (2005).Constructing measures: An item response modeling approach. London: Lawrence Erlbaum Associates.
Wolf, D. F. (1993). A comparison of assessment tasks used to measure FL reading comprehension.  The Modern Language Journal, 77, 473-489.