The Impact of Raters’ and Test Takers’ Gender on Oral Proficiency Assessment: A Case of Multifaceted Rasch Analysis

Document Type: Research Paper


1 Islamic Azad University, Science and Research Branch

2 Assistant Professor Islamic Azad University Central Tehran Branch


The application of Multifaceted Rasch Measurement (MFRM) in rating test takers’ oral language proficiency has been investigated in some previous studies (e.g., Winke, Gass, & Myford, 2012). However, little research so far has ever documented the effect of test takers’ genders on their oral performances and few studies have investigated the relationship between the impact of raters’ gender on the awarded scores to male and female test takers. Thus, this study aimed to address the above-mentioned issue. Twenty English as a Foreign Language (EFL) teachers rated the oral performances of 300 test takers. The outcomes demonstrated that test takers’ gender differences did not have any significant role in their performance differences when they were rated by the raters of the same or opposite gender. The findings also reiterated that raters of different genders did not demonstrate bias in rating test takers of the opposite or same gender. Moreover, no significant difference was observed regarding male and female raters’ biases towards the rating scale categories. The outcomes of the study showed that both male and female raters assign fairly similar scores to test takers. This suggests no evidence based on which either male or female raters must be excluded from the rating process. The findings imply that there is no need to worry about the impact of gender for a more valid and reliable assessment.


Ahmadi, A., & Sadeghi, E. (2016). Assessing English language learners’ oral performance: A comparison of monologue, interview, and group oral test. Language Assessment Quarterly, 13(4), 341-358.
Aryadoust, V. (2016). Gender and academic major bias in peer assessment of oral presentations. Language Assessment Quarterly, 13(1), 1-24.
Attali, Y. (2016). A comparison of newly-trained and experienced raters on a standardized writing assessment. Language Testing, 33(1), 99-115.
Bachman, L. F. (2004). Statistical analyses for language assessment. Cambridge: Cambridge University Press.
Baleghizadeh, S., & Gordani, Y. (2012). Core units of spoken grammar in global ELT textbooks. Issues in Language Teaching, 1(1), 33-58.
Barkaoui, K. (2011). Think-aloud protocols in research on essay rating: An empirical study on their veridicality and reactivity. Language Testing, 28(1), 51-75.
Barrett, S. (2001). The impact of training on rater variability. International Education Journal, 2(1), 49-58.
Brown, A. (2005). Interviewer variability in oral proficiency interviews. Frankfurt: Peter Lang Pub Inc.
Buckingham, A. (1997). Oral language testing: do the age, status and gender of the interlocutor make a difference? Unpublished MA dissertation, University of Reading.
Caban, H. L. (2003). Rater group bias in speaking assessment of four L1 Japanese ESL students. Second Language Studies, 21(1), 1-44.
Cumming, A. (1990). Expertise in evaluating second language compositions. Language Testing, 7(1), 31-51.
Davis, L. (2016). The influence of training and experience on rater performance in scoring spoken language. Language Testing, 33(1), 117-135.
Eckes, T. (2005). Examining rater effects in TestDaF writing and speaking performance assessments: A many-facet Rasch analysis. Language Assessment Quarterly, 2(3), 197-221.
Eckes, T. (2015). Introduction to many-facet Rasch measurement. Frankfurt: Peter Lang Edition.
ETS (2001). ETS Oral Proficiency Testing Manual. Princeton, NJ.: Educational Testing Service.
Fall, T., Adair-Hauck, B., & Glisan, E. (2007). Assessing students’ oral proficiency: A case for online testing. Foreign Language Annals, 40(3), 377-406.
Fulcher, G., Davidson, F., & Kamp, J. (2011). Effective rating scale development for speaking tests: Performance decision trees. Language Testing, 28(1), 5-29.
Hughes, R. (2011). Teaching and researching speaking (2nd ed.). London: Pearson Education Limited.
Hyde, J. S., & Linn, M. C. (1988). Gender differences in verbal ability: A meta-analysis. Psychological Bulletin, 104(1), 53-69.
In’nami, Y., & Koizumi, R. (2016). Task and rater effects in L2 speaking and writing: A synthesis of generalizability studies. Language Testing, 33(3), 341-366.
Lim, G. S. (2011). The development and maintenance of rating quality in performance writing assessment: A longitudinal study of new and experienced raters. Language Testing, 28(4), 543-560.
Linacre, J. M. (1989). Many-faceted Rasch measurement. Chicago, IL: MESA Press.
Lumley, T., & O’Sullivan, B. (2005). The effect of test-taker gender, audience and topic on task performance in tape-mediated assessment of speaking. Language Testing, 22(4), 415-437.
Luoma, S. (2004). Assessing speaking. Cambridge. Cambridge University Press.
Maria-Ducasse, A., & Brown, A. (2009). Assessing paired oral Raters’ orientation to interaction. Language Testing, 26(3), 423-443.
McNamara, T. F. (1996). Measuring second language performance. London: Longman.
McNamara, T. F., & Lumley, T. (1997). The effect of interlocutor and assessment mode variables in overseas assessments of speaking skills in occupational settings. Language Testing, 14(2), 140-156.
Myford, C. M. & Wolfe, E. W. (2004). Detecting and measuring rater effects using many-facet Rasch measurement. Journal of Applied Measurement, 5(2), 189-227.
Nakatsuhara, F. (2011). Effect of test-taker characteristics and the number of participants in group oral tests. Language Testing, 28(4), 483-508.
O’Loughlin, K. (2002). The impact of gender in oral proficiency testing, Language Testing, 19(2), 169-192.
O’Sullivan, B. (2000). Exploring gender and oral proficiency interview performance. System, 28(3), 373-386.
O’Sullivan, B. (2002). Learner acquaintanceship and oral proficiency test pair task performance. Language Testing, 19(3), 277-295.
Porter, D. (1991). Affective factors in the assessment of oral interaction: Gender and status. In S. Anivan (Ed.), Current developments in language testing (pp.99-102). Singapore: SEAMEO RELC.
Porter, D., & Shen, S. (1991). Sex, status and style in the interview. The Dolphin, 21(2), 117-128.
Sawaki, Y. (2007). Construct validation of analytic rating scales in a speaking assessment: Reporting a score profile and a composite. Language Testing, 24(3), 355-390.
Schaefer, E. (2008). Rater bias patterns in an EFL writing assessment. Language Testing, 25(4), 465-493.
Sunderland, J. (1995). Gender and language testing. Language Testing Update, 17(1), 24-35.
Van Moere, A. (2012). A psycholinguistics approach to oral language assessment. Language Testing, 29(3), 325-344.
Winke, P., Gass, S., & Myford, C. (2012). Raters’ L2 background as a potential source of bias in rating oral performance. Language Testing, 30(2), 231-252.
Wright, B. D., & Linacre, J. M. (1994). Reasonable mean-square fit values. Rasch Measurement Transactions, 8(3), 369-386.
Xi, X, & Mollaun, P. (2011). Using raters from India to score a large-scale speaking test. Language Learning, 61(4), 1222-1255.