Uncertain performance: How to quantify uncertainty and draw test sets when evaluating classifiers

P7-S186-2

Presented by: Francisco Tomás-Valiente

Francisco Tomás-Valiente

ETH Zürich

Supervised machine learning has become increasingly popular in social science research, often employing classifiers to measure political constructs of interest. While evaluating classifiers on a test set is critical for establishing measurement validity, guidance on how to conduct such evaluation is limited. In particular, researchers have paid insufficient attention to the variance of performance metrics estimated on test sets. This paper demonstrates that typically-sized test sets yield highly noisy and often uninformative estimates of classifier performance, particularly for imbalanced tasks —common in social science applications like identifying hateful tweets or political speeches containing emotional language— where one outcome class is rare. To address this issue, the paper introduces simple methods to quantify uncertainty in performance estimates, based on their sampling distribution and validated with simulations. It also introduces a design-based approach in political science to enhance statistical precision in evaluation tasks through efficient stratified sampling based on the outcome's predicted probability. Simulations show that this strategy meaningfully improves the precision of performance estimates, especially in imbalanced tasks, while ensuring they remain unbiased. Finally, the paper illustrates how to calculate the test set size required to achieve a desired level of statistical precision. Accompanying statistical software enables researchers to easily implement these methods, estimate uncertainty around performance metrics, implement optimal stratified sampling, and conduct power calculations for test set design.

Keywords: machine learning, validation, evaluation, sampling

Sponsors