Systematic data cleaning using R
Over the last decades there have been several attempts to set up frameworks for statistical data processing and statistical data cleaning. One of the key notions is that a data cleaning procedure can be decomposed into a sequence of fundamental steps, where each step is controlled by external information defined by experts. In this model, some imperfect data set is input for a processing step. The processing step is generally parameterized by two types of metadata. First, a set of validation rules describe the
desired ultimate state of the data set. Second, there are parameters that control the details of process. For example, if the processing step concerns an imputation procedure an imputation model specification may enter as a parameter. The process then yields an improved dataset while keeping a log of its activities in a that can be used for monitoring. In this presentation we demonstrate how a set of tools build in R can be flexibly combined to follow precisely this model.
Reference:
IPS04-003
Session:
Using R in the Statistical Institutes
Presenter/s:
Mark van der Loo
Presentation type:
Oral presentation
Room:
GASP
Chair:
Alexander KOWARIK, Statistics Austria, Austria, (Email)
Date:
Wednesday, 13 March
Time:
10:00 - 11:00
Session times:
10:00 - 11:00