A Novel Ensemble Method for Automated Short Answer Grading Based on Continuous Response IRT

Tue-02

Presented by: Tuo Liu

Tuo Liu ^*, Saba Mateen

Background, Objectives, and Research question: Automated Short Answer Grading (ASAG) is a field concerned with evaluating short answers written by students using various machine learning techniques. With the development of machine learning, several ASAG approaches have been proposed. According to prior surveys, an ASAG approach can be divided into two parts: language representation and learning algorithm. The choice of these two components can result in different theoretical underpinnings and use of information from short answers. To leverage the benefits of multiple approaches, ensemble methods that combine multiple approaches may lead to improved predictive performance than any one of the individual approaches alone. This study presents a novel ensemble method based on the Continuous Response Item Response Theory (IRT) model, which has been commonly used in psychometrics to combine the ratings from multiple human experts, for integrating different ASAG approaches. Using the validation with an open-accessed dataset ASAP-SAS, the performance of this new ensemble method will be compared to existing ensemble methods and individual approaches.

Method: Specifically, this framework utilizes IRT to model the probabilities of predictive classification in each ASAG approach, taking into account the unique characteristics of each individual approach. The latent score of the IRT model is then utilized as the ensemble probability for final grading. The study will also compare three commonly used ensemble methods: majority voting, distribution summation, and the Generalized Many Facet Rasch Model(g-MFRM). The performance of these methods will be evaluated using two metrics: the Brier Score and Quadratic Kappa.

Selected individual approaches in ASAG: In this study, we aim to incorporate all possible combinations of several usually used language representations and learning algorithms. The language representations we plan to consider include: the Bag-of-Words model (n-gram), Distributed Representations (such as Neural Network-based approaches, like NNLM and RNNLM), and Pre-trained Models (word2vec, GloVe, BERT). The learning algorithms to be included in the study are Logistic Regression, Random Forest, and Multi-layer Perceptron (MLP). Additionally, pre-trained models like BERT can be added additional layers (fine-tune), combining language representation and learning algorithm into one model. This hybrid approach will also be considered as a separate category.

Expected results and implications: We expect that our new ensemble method will result in improved performance compared to the other ensemble methods and individual approaches, as indicated by higher scores on both evaluation metrics. Theoretically, this research attempts to combine the knowledge from both psychometrics and machine learning fields. The results of this study are expected to provide insights into the optimal integration of different language representations and learning algorithms for better predictive performance in the field of ASAG.