When can we ignore missing values? A Test of Conditional Independence in Missing-Data Analysis
P1-S14-3
Presented by: Thomas Robinson
A critical but often overlooked situation in which the conventional approach to handling missing data --- discarding incomplete rows --- produces unbiased regression estimates is when the outcome variable is independent of the missingness pattern, conditional on regressors. We call this conditional independence in missing-data analysis (CIMDA). We propose a general test of CIMDA that exploits the power of machine learning algorithms to accurately and efficiently identify complex patterns of missingness and relationships between variables. Our test compares the generalization error of supervised learning models trained on samples that include and exclude the outcome variable. Large differences in model performance indicate a low likelihood that CIMDA holds, creating a serious risk that omitting incomplete rows will bias regression estimates. We validate the test through a series of experimental studies involving simulated as well as real social science data. Applications to regression analyses from recently published studies illustrate both the inferential pitfalls of potential CIMDA violations and the utility of our test in guiding analysts toward appropriate missing-data strategies. We make available easy-to-use software for implementing the test.
Keywords: Missing data, multiple imputation, machine learning, econometrics, inference