Systematic data cleaning using R

Mark van der Loo, (Email), Edwin de Jonge, (Email)

Statistics Netherlands (CBS), Den Haag

Over the last decades there have been several attempts to set up frameworks for statistical data processing and statistical data cleaning. One of the key notions is that a data cleaning procedure can be decomposed into a sequence of fundamental steps, where each step is controlled by external information defined by experts. In this model, some imperfect data set is input for a processing step. The processing step is generally parameterized by two types of metadata. First, a set of validation rules describe the desired ultimate state of the data set. Second, there are parameters that control the details of process. For example, if the processing step concerns an imputation procedure an imputation model specification may enter as a parameter. The process then yields an improved dataset while keeping a log of its activities in a that can be used for monitoring. In this presentation we demonstrate how a set of tools build in R can be flexibly combined to follow precisely this model.

Abstract

Presentation

Reference:

IPS04-003

Session:

Using R in the Statistical Institutes

Presenter/s:

Mark van der Loo

Presentation type:

Oral presentation

Room:

GASP

Chair:

Alexander KOWARIK, Statistics Austria, Austria, (Email)

Date:

Wednesday, 13 March

Time:

10:00 - 11:00

Session times:

10:00 - 11:00