Outlier Detection Methods for mixed-type and large-scale data like Census

Frantisek Hajnovic, (Email), Alessandra Sozzi, (Email)

Office for National Statistics, London

Outlier detection (OD) refers to the problem of finding patterns in data that do not conform to expected normal behaviour. OD has been a widely researched problem and finds immense use in a wide variety of application domains. In this paper we consider the domain of building automated OD methods for quality assure the next 2021 UK Census. The scale and nature of such dataset pose computational challenges to traditional OD methods. In general, the scale of the full Census is too large for a sequential execution of the OD methods. Most of the methods scale super-linearly with the size of the dataset and need either a distributed implementation or separate runs of the algorithm on chunks of the dataset. Additionally, Census questions are of mixed type (numeric, categorical, ordinal, free-text and date) and detecting outliers in this multi-dimensional space is an open area of research with no optimal solution yet. Experiences from previous census processing show that it is easy to be overwhelmed with data quickly, and a mechanism for pointing in the right direction will save huge amounts of time and improve quality where it is needed most. It will also help minimize the risk of serious errors, identifying them earlier. This work is being carried out and will culminate with the development a set of lightweight tools to be ready to test on the mid-2019 UK Census rehearsal. Ultimately, these could be run against the full-scale 2021 Census data in a distributed fashion to automatically flag anomalous observations in the dataset. The up to date results of experiments with such methods will be the main focus of the presentation.

Abstract

Presentation

Reference:

CPS06-002

Session:

Survey Design

Presenter/s:

Alessandra Sozzi

Presentation type:

Oral presentation

Room:

JENK

Chair:

Natalie SHLOMO, The University of Manchester, (Email)

Date:

Wednesday, 13 March

Time:

14:30 - 15:30

Session times:

14:30 - 15:30