Statistical learning in official statistics: the case of statistical matching

Marcello D'Orazio, (Email)

Office of Chief Statistician, Food and Agriculture organization (FAO) of the United Nations, Rome

National Statistical offices are facing the challenge of modernizing their statistical production processes, beyond traditional sample surveys and censuses, so as to exploit all available data provided by administrative registers and big data. Taking advantage of large data sources requires adoption of modern statistical methods, as those based on machine learning. In addition, availability of different data sources on the same phenomena poses the challenge of integrating them for producing a wider set of statistical outputs so as to satisfy users’ request. This work will show how statistical learning methods can be beneficial in integrating data. Statistical learning (SL) is an area of statistics relatively recent (see e.g. [1] and [2]) that includes a wide set of techniques that “learn from the data”. They have become very popular in marketing, finance, and other domains, because allow analysis of large data sources, with many variables and observations. Under SL umbrella falls many recent methods related to classification, regression and clustering (generalized additive models, classification and regression trees, neural networks, etc.). Integration is the core of new statistical production processes aimed at providing a richer set of statistical outputs by taking advantage of already existing data, avoiding setting up new surveys. Focus here is on statistical matching (SM, also known as data fusion) whose objective is integration of data sources (mainly from sample surveys), lacking of units’ identifiers, to investigate relationship between variables not jointly observed in the same survey (see e.g. [3]). These methods are frequently applied to integrate the survey on household income with the one on expenditures to get a thorough picture of people well-being [4]. SM methods include a variety of well-known methods developed to impute missing values in a dataset (predictive mean matching, hotdeck imputation, etc.), but adapted to the specific SM setting.

Abstract

Presentation

Reference:

CPS05-003

Session:

Data linking and statistical matching

Presenter/s:

Marcello D'Orazio

Presentation type:

Oral presentation

Room:

JENK

Chair:

Roeland Beerten, Statistics Flanders, Belgium, (Email)

Date:

Wednesday, 13 March

Time:

11:30 - 12:30

Session times:

11:30 - 12:30