Generative AI and item writing – Item Writing for Language testing

A collection of resources on generative AI (such as ChatGPT) and its applications to producing language test items

Resources by Olena Rossi

Rossi, O., & Montcada, J. M. (2025). Generating reading test items with AI: Targeting higher-order thinking skills. Presentation delivered at the annual EALTA conference. May 2025, Salzburg. Download slides

Rossi, O., & Montcada, J. M. (2025). Using ChatGPT to generate True/False reading comprehension items: Recommendations for practice. Presentation delivered at the EALTA’s AI SIG online conference. March 2025. Download slides Watch presentation

Rossi, O. (2024). Automated item generation: An item writer perspective. Talk delivered at the BAAL TEASIG webinar. December 2024, online. Download slides

Rossi, O. (2024). Using ChatGPT to generate tasks for EAP reading and listening assessments. Workshop delivered as part of BALEAP Assessment Roadshow. July 2024, online. Download slides Watch workshop Part 1 Watch workshop Part 2

Rossi, O. (2024). Item writing with generative AI: Current issues and future directions. Presentation delivered at the inaugural meeting of the EALTA SIG Artificial Intelligence for Language Assessment. June 2024, Belfast. Download slides

Rossi, O. (2024). Assessment of language through AI: Opportunities, challenges, and future directions. Plenary talk delivered at 2024 International Conference Language Education 4.0: A Paradigm Shift towards Action-Oriented Approach, Artificial Intelligence Integration and Beyond. June 2024, Ankara Download slides

Rossi, O. (2023). Using technology to write language test items. Talk delivered at IATEFL TEASIG online conference Developing Assessment Tasks for the Classroom. September 2023. Download slides

Rossi, O. (2023). Using AI for test item generation: Opportunities and challenges. Webinar delivered as part of the EALTA Webinar series. May 2023. Download workshop slides Watch webinar

Review studies (click the arrow to view the abstract)

Circi, R., Hicks, J., & Sikali, E. (2023). Automatic item generation: Foundations and machine learning-based approaches for assessments. Frontiers in Education, 8:858273. https://doi.org/10.3389/feduc.2023.858273

This mini review summarizes the current state of knowledge about automatic item generation in the context of educational assessment and discusses key points in the item generation pipeline. Assessment is critical in all learning systems and digitalized assessments have shown significant growth over the last decade. This leads to an urgent need to generate more items in a fast and efficient manner. Continuous improvements in computational power and advancements in methodological approaches, specifically in the field of natural language processing, provide new opportunities as well as new challenges in automatic generation of items for educational assessment. This mini review asserts the need for more work across a wide variety of areas for the scaled implementation of AIG.

Song, Y., Du, J., & Zheng, Q. (2025). Automatic Item Generation for Educational Assessments: A Systematic Literature Review. Interactive Learning Environments. https://doi.org/10.1080/10494820.2025.2482588

This study reviewed automatic item generation (AIG) applications for educational assessments from 2010 to 2024. The analysis included 71 articles and focused on examining types of generated items and assessments, technical approaches, and evaluation models. The results showed that most generated items related to multiple choice questions, and the generated assessments were mainly about computer and medical sciences at college and vocational levels. The technical approaches were classified into four categories: feature engineering, architecture engineering, objective engineering, and prompt engineering. The models employed for evaluation were defined as manual annotation, man-machine collaborative evaluation, item analysis, Turing test, and value-added models. These findings provided knowledge and understanding to researchers and practitioners, showing the significance of expanding research focus, maintaining the theoretical foundation about educational assessments, and enhancing evaluation evidence for future AIG research.

Tan, B., Armoush, N., Mazzullo, E., Bulut, O., & Gierl, M. (2025). A Review of Automatic Item Generation Techniques Leveraging Large Language Models. International Journal of AIG & Testing Education, 12(2), 317–340. https://doi.org/10.21449/ijate.1602294

This study reviews existing research on the use of large language models (LLMs) for automatic item generation (AIG). We performed a comprehensive literature search across seven research databases, selected studies based on predefined criteria, and summarized 60 relevant studies that employed LLMs in the AIG process. We identified the most commonly used LLMs in current AIG literature, their specific applications in the AIG process, and the characteristics of the generated items. We found that LLMs are flexible and effective in generating various types of items across different languages and subject domains. However, many studies have overlooked the quality of the generated items, indicating a lack of a solid educational foundation. Therefore, we share two suggestions to enhance the educational foundation for leveraging LLMs in AIG, advocating for interdisciplinary collaborations to exploit the utility and potential of LLMs.

Latest research (click the arrow to view the abstract)

Alsagoafi, A. A., & Alomran, H. S. (2025). Revolutionizing assessment: Leveraging ChatGPT with EFL teachers. World Journal of English Language, 15(6), 385–401. https://doi.org/10.5430/wjel.v15n6p385

ChatGPT is gaining widespread acceptance in many disciplines since its launch at the end of 2022. The impact of ChatGPT on education is evident, but there is a dearth of knowledge on how English as a Foreign Language (EFL) teachers benefit from this technology. Therefore, this study investigates the use of ChatGPT to generate exam questions among EFL educators in Saudi Arabia. Through a mixed-methods approach that included an online questionnaire and an experimental design, the study attempted to gain insights from educators on using artificial intelligence (AI) technology for assessment. An online questionnaire was shared with 200 public school EFL teachers at various grade levels in the Eastern Province of Saudi Arabia. The findings revealed a varied landscape of perspectives, with some educators approving ChatGPT’s efficiency in generating exam questions, whereas others expressed concerns about its limited application. A further examination of the instructor-designed and ChatGPT-generated test items revealed that ChatGPT has the potential to stimulate critical thinking and expand assessment formats. The results indicate that educators require professional development to leverage AI technology responsibly. Furthermore, this study highlights the importance of navigating the emerging ChatGPT in EFL classrooms to ensure reliability and consistency of the evaluation process.

Ma, W. A., Flor, M., & Wang, Z. (2025). Automatic Generation of Inference-Making Questions for Reading Comprehension. arXiv, June 2025.
https://doi.org/10.48550/arXiv.2506.08260

Inference making is an essential but complex skill in reading comprehension (RC). Some inferences require resolving references across sentences, and some rely on using prior knowledge to fill in the detail that is not explicitly written in the text. Diagnostic RC questions can help educators provide more effective and targeted reading instruction and interventions for school-age students. We introduce a taxonomy of inference types for RC and use it to analyze the distribution of items within a diagnostic RC item bank. Next, we present experiments using GPT-4o to generate bridging-inference RC items for given reading passages via few-shot prompting, comparing conditions with and without chain-of-thought prompts. Generated items were evaluated on three aspects: overall item quality, appropriate inference type, and LLM reasoning, achieving high inter-rater agreements above 0.90. Our results show that GPT-4o produced 93.8% good-quality questions suitable for operational use in grade 3-12 contexts; however, only 42.6% of the generated questions accurately matched the targeted inference type. We conclude that combining automatic item generation with human judgment offers a promising path toward scalable, high-quality diagnostic RC assessments.

Zhang, T., Erlam, R., & de Magalhães, M. (2025). Exploring the dual impact of AI in post-entry language assessment: Potentials and pitfalls. Annual Review of Applied Linguistics, 1–20. https://doi.org/10.1017/S0267190525000030

This paper explores the complex dynamics of using AI, particularly generative artificial intelligence (GenAI), in post-entry language assessment (PELA) at the tertiary level. Empirical data from trials with Diagnostic English Language Needs Assessment (DELNA), the University of Auckland’s PELA, are presented. The first study examines the capability of GenAI to generate reading text and assessment items that might be suitable for use in DELNA. A trial of this GenAI-generated academic reading assessment on a group of target participants (n = 132) further evaluates its suitability. The second study investigates the use of a fine-tuned GPT-4o model for rating DELNA writing tasks, assessing whether automated writing evaluation (AWE) provides feedback of comparable quality to human raters. Findings indicate that while GenAI shows promise in generating content for reading assessments, expert evaluations reveal a need for refinement in question complexity and targeting specific subskills. In AWE, the fine-tuned GPT-4o model aligns closely with human raters in overall scoring but requires improvement in delivering detailed and actionable feedback. A Strengths, Weaknesses, Opportunities, and Threats analysis highlights AI’s potential to enhance PELA by increasing efficiency, adaptability, and personalization. AI could extend PELA’s scope to areas such as oral skills and dynamic assessment. However, challenges such as academic integrity and data privacy remain critical concerns. The paper proposes a collaborative model integrating human expertise and AI in PELA, emphasizing the irreplaceable value of human judgment. We also emphasize the need to establish clear guidelines for a human-centered AI approach within PELA to maintain ethical standards and uphold assessment integrity.

Full bibliography

Aryadoust, V., Zakaria, A., & Jia, Y. (2024). Investigating the affordances of OpenAI’s large language model in developing listening assessments. Computers and Education: Artificial Intelligence, 6. https://doi.org/10.1016/j.caeai.2024.100204

Attali, Y., LaFlair, G., & Runge, A. (2023, March 31). A new paradigm for test development [Duolingo webinar series]. Watch webinar

Attali, Y., Runge, A., LaFlair, G.T., Yancey, K., Goodwin, S., Park, Y., & von Davier, A. (2022). The interactive reading task: Transformer-based automatic item generation. Frontiers in Artificial Intelligence, 5, 903077. https://doi.org/10.3389/frai.2022.903077

Belzak, W.C.M., Naismith, B., Burstein, J. (2023). Ensuring fairness of human- and AI-generated test Items. In: N. Wang, G. Rebolledo-Mendez, V. Dimitrova, N. Matsuda, O.C. Santos (Eds.), Artificial Intelligence in education. Communications in Computer and Information Science, 1831. Springer, Cham. https://doi.org/10.1007/978-3-031-36336-8_108

Bezirhan, U., & von Davier, M. (2023). Automated reading passage generation with Open AI’s large language model. Preprint. https://doi.org/10.48550/arXiv.2304.04616

Bolender, B., Foster, C. & Vispoel, S. (2023). The criticality of implementing principled design when using AI technologies in test development. Language Assessment Quarterly, 20(4-5), 512-519. https://doi.org/10.1080/15434303.2023.2288266

Bulut, O., & Yildirim-Erbasli, S.N. (2022). Automatic story and item generation for reading comprehension assessments with transformers. International Journal of Assessment Tools in Education, 9, pp.72-87. https://doi.org/10.21449/ijate.1124382

Choi, I. & Zu, J. (2022), The impact of using synthetically generated listening stimuli on test-taker performance: A case study with multiple-choice, single-selection items. ETS Research Report Series, 2022(1), 1–14. https://doi.org/10.1002/ets2.12347

Chun, J. Y. & Barley, N. (2024). A comparative analysis of multiple-choice questions: ChatGPT-generated items vs. human-developed items. In C. A. Chapelle, G. H. Beckett, and J. Ranalli (Eds.), Exploring AI in applied linguistics (pp.118-136). Iowa State University Digital Press. https://bit.ly/TSLL23openbook

Dijkstra, R., Gen¸c, Z., Kayal, S., & Kamps, J. (2022). Reading comprehension quiz generation using generative pre-trained transformers. Pre-print. https://intextbooks.science.uu.nl/workshop2022/files/itb22_p1_full5439.pdf

Fei, Z., Zhang, Q., Gui, T., Liang, D., Wang, S., Wu, W., & Huang, X. (2022). CQG: A simple and effective controlled generation framework for multi-hop question generation. In
Proceedings of the 60th Annual Meeting of the Association for Computational
Linguistics (Volume 1: Long Papers). Association for Computational Linguistics, 2022. https://aclanthology.org/2022.acl-long.475

Felice, M., Taslimipoor, S., & Buttery, P. (2022). Constructing open cloze tests using generation and discrimination capabilities of transformers. In Findings of the Association for Computational Linguistics: ACL 2022, pp. 1263–1273, Dublin, Ireland. Association for Computational Linguistics. https://arxiv.org/pdf/2204.07237.pdf

Ghanem, B., Coleman, L.L., Dexter, J. R., von der Ohe, S. M., & Fyshe, A. (2022). Question generation for reading comprehension assessment by modelling how and what to ask. https://doi.org/10.48550/arXiv.2204.02908

Kalpakchi, D., & Boye, J. (2021). BERT-based distractor generation for Swedish reading comprehension questions using a small-scale dataset. Paper presented at the 14th International Conference on Natural Language Generation INLG2021. https://arxiv.org/pdf/2108.03973.pdf

Kalpakchi D., & Boye, J. (2023a). Quasi: A synthetic question-answering dataset in Swedish using GPT-3 and zero-shot learning. In T. Alumäe and M. Fishel (Eds.), Proceedings
of the 24th Nordic Conference on Computational Linguistics (pp.477–491). https://aclanthology.org/2023.nodalida-1.48/

Kalpakchi D., & Boye, J. (2023b). Generation and evaluation of multiple-choice reading
comprehension questions for Swedish. https://urn.kb.se/resolve?urn=urn:nbn:se:kth:diva-329400

Khademi, A. (2023). Can ChatGPT and Bard generate aligned assessment items? A reliability analysis against human performance. Journal of Applied Learning & Teaching, 6(1), pp.75-80. https://doi.org/10.37074/jalt.2023.6.1.28

Liusie, A., Raina, V., & Gales, M. (2023). “World knowledge” in multiple choice reading
comprehension. In Proceedings of the Sixth Fact Extraction and VERification Workshop
(FEVER). Association for Computational Linguistics. https://aclanthology.org/2023.fever-1.5

Ma, W. A., Flor, M., & Wang, Z. (2025). Automatic Generation of Inference-Making Questions for Reading Comprehension. arXiv, June 2025. https://doi.org/10.48550/arXiv.2506.08260

O’Grady, S. (2023). An AI generated test of pragmatic competence and connected speech. Language Teaching Research Quarterly, 37, 188-203. https://doi.org/10.32038/ltrq.2023.37.10

Raina, V., & Gales, M. (2022). Multiple-choice question generation: Towards an automated
assessment framework. https://doi.org/10.48550/arXiv.2209.11830

Raina, V., Liusie, A., & Gales, M. (2023a). Analyzing multiple-choice reading and listening
comprehension tests. https://doi.org/10.48550/arXiv.2307.01076

Raina, V., Liusie, A., & Gales, M. (2023b). Assessing distractors in multiple-choice tests. https://doi.org/10.48550/arXiv.2311.04554

Rathod, A., Tu, T., & Stasaski, K. (2022). Educational multi-question generation for reading
comprehension. In Proceedings of the 17th Workshop on Innovative Use of NLP for
Building Educational Applications (pp.216-223). https://aclanthology.org/2022.bea-1.26

Rodriguez-Torrealba, R., Gracia-Lopez, E., & Garcia-Cabot, A. (2022). End-to-end generation of multiple-choice questions using text-to-text transfer transformer models. Expert Systems With Applications, 208, 118258. https://doi.org/10.1016/j.eswa.2022.118258

Runge, A., Attali, Y., LaFlair, G. T., Park, Y., & Church, J. (2024). A generative AI-driven interactive listening assessment task. Frontiers in Artificial Intelligence, 7, 1474019.
https://doi.org/10.3389/frai.2024.1474019

Sayin, A., & Gierl, M. (2024). Using OpenAI GPT to generate reading comprehension items. Educational Measurement 43(1), 5-18. https://doi.org/10.1111/emip.12590

Shin, I., & Gierl, M. (2022). Generating reading comprehension items using automated processes. International Journal of Testing, 22(3-4), 289-311.
https://doi.org/10.1080/15305058.2022.2070755

Shin, D., & Lee, J. H. (2024). AI-powered automated item generation for language testing. ELT Journal, ccae016. https://doi.org/10.1093/elt/ccae016

von Davier, A. (2023, February 27). Generative AI for test development [a talk given for the Department of Education, University of Oxford]. Watch presentation

Uto, M., Tomikawa, Y., & Suzuki, A. (2023). Difficulty-controllable neural question generation for reading comprehension using item response theory. In Proceedings of the 18th Workshop on Innovative Use of NLP for Building Educational Applications (pp.119-129). https://aclanthology.org/2023.bea-1.10

Wang, X., Liu, B., & Wu, L. (2023). SkillQG: Learning to generate question for
reading comprehension assessment. https://doi.org/10.48550/arXiv.2305.04737

Yunjiu, L., Wei, W., & Zheng, Y. (2022). Artificial intelligence-generated and human expert-designed vocabulary tests: A comparative study. SAGE Open, 12(1). https://doi.org/10.1177/21582440221082130