11:00 - 11:45
Parallel sessions 7
11:00 - 11:45
Location: Room 244
The last presenter of the session is kindly asked to take over the moderation of the session helping to keep the time.
Each presenter is invited to use up to 10 min for presentation and up to 5 min for Q&A.
Submission 155
Augment, Don'T Automate: An Evidence-Based Workflow for Human-GenAI Collaboration in Exam Question Generation
Presented by: Remco Jongkind
Remco JongkindSarah OttoInayah HodzicFloor van der SteijleTom Broens
Teaching & Learning Centre - Amsterdam UMC - University of Amsterdam

Medical education faces a structural tension between expanding student cohorts and stagnating resources, making the creation of high-quality Multiple-Choice Questions (MCQs) a costly, time-intensive burden. This study investigates the technical capability and ecological validity of using Large Language Models (LLMs) to generate undergraduate medical MCQs.

We implemented a curriculum-aligned GenAI question generator at the University of Amsterdam Faculty of Medicine. Generated items underwent standard expert peer review, yielding a 30% usability rate for high-stakes assessment without modification. We then evaluated a high-stakes physiology exam taken by 721 second-year students, embedding 5 GenAI-authored MCQs alongside 29 human-authored ones.

The results demonstrated a robust psychometric profile for GenAI items:

  • Difficulty: Statistically comparable to human items (mean 0.83 vs. 0.78), falling within optimum reference ranges.

  • Discrimination: GenAI items significantly outperformed human items in stratifying high- and low-performing students (0.392 vs. 0.283, p=0.046).

  • Acceptability: GenAI items received zero post-exam student objections, significantly fewer than human items with mean of 0.6 objection per question (p=0.044).

  • Detectability: Experts and students correctly identified GenAI items only 45% of the time.

While the 30% usability rate indicates the tool is not (yet) safe for unmediated student self-assessment due to hallucination risks, it offers immense value for faculty. GenAI overcomes drafting fatigue by generating high-volume, creative content. To operationalize this, we propose a "Hybrid Assessment Model", a Human-in-the-Loop workflow that shifts faculty roles from primary writers to critical editors.

This presentation provides a concrete blueprint for human-AI collaboration, equipping attendees with MCQ prompt engineering frameworks, empirical psychometric evidence, workflow design principles, and an evidence-based question evaluation checklist.