Medical education faces a structural tension between expanding student cohorts and stagnating resources, making the creation of high-quality Multiple-Choice Questions (MCQs) a costly, time-intensive burden. This study investigates the technical capability and ecological validity of using Large Language Models (LLMs) to generate undergraduate medical MCQs.
We implemented a curriculum-aligned GenAI question generator at the University of Amsterdam Faculty of Medicine. Generated items underwent standard expert peer review, yielding a 30% usability rate for high-stakes assessment without modification. We then evaluated a high-stakes physiology exam taken by 721 second-year students, embedding 5 GenAI-authored MCQs alongside 29 human-authored ones.
The results demonstrated a robust psychometric profile for GenAI items:
Difficulty: Statistically comparable to human items (mean 0.83 vs. 0.78), falling within optimum reference ranges.
Discrimination: GenAI items significantly outperformed human items in stratifying high- and low-performing students (0.392 vs. 0.283, p=0.046).
Acceptability: GenAI items received zero post-exam student objections, significantly fewer than human items with mean of 0.6 objection per question (p=0.044).
Detectability: Experts and students correctly identified GenAI items only 45% of the time.
While the 30% usability rate indicates the tool is not (yet) safe for unmediated student self-assessment due to hallucination risks, it offers immense value for faculty. GenAI overcomes drafting fatigue by generating high-volume, creative content. To operationalize this, we propose a "Hybrid Assessment Model", a Human-in-the-Loop workflow that shifts faculty roles from primary writers to critical editors.
This presentation provides a concrete blueprint for human-AI collaboration, equipping attendees with MCQ prompt engineering frameworks, empirical psychometric evidence, workflow design principles, and an evidence-based question evaluation checklist.