IT infrastructure for Big Data and Data Science: Challenges at Statistics Netherlands

Marco Puts, (Email)

CBS/Statistics Netherlands, Heerlen

Until a couple of years ago, processing data at national statistical institutes (NSI’s) did not differ that much as processing data at an administrative environment. In case of surveys, one needed a database comprising all companies or persons in the country, take a sample of this population, and, when the questionnaire was sent back, process the (relatively) small amounts of data to calculate the wanted estimates. In case of register data, the amount of data was higher, but still manageable for relational database management systems (RDBMS) or even in the form of comma separated values. Most of the data sets would fit in memory of small dimensioned desktops. The role of statistical methodology was simple: Given one or several datasets, define an algorithm that executes a certain method for estimating the target variables. Processes as data cleaning could be rule based or could encompass an iterative process to find the “right” values. Most of the times, these algorithms did not need to be efficient, since inefficient algorithms would maybe run for a couple of hours once every three months or once a year. In the age of big data and data science, this is not the case anymore. Nowadays, many questions can be asked and answered by using more timely data and it is expected from modern statisticians to be data scientists, who use modern technology to be able to quickly answer current questions. Transitions of the used systems is not an easy task. The administrative landscape is based on low data throughput and relatively small latencies can be relatively easy realized. For data science infrastructures, this is another story. Small latencies are not easy to realize. In certain cases, a high data throughput needs to be realized and this cannot be realized with more traditional infrastructures. Whereas in traditional infrastructures, vertical scaling, or scaling up, was normally done when more compute power was needed, horizontal scaling, or scaling out, is used in data intensive infrastructures. More servers are coupled to create clusters, which scale easily with the data [1]. In this paper, we will discuss what the transition from “classical” statistics to modern data science means and what challenges lie ahead when implementing.

Abstract

Reference:

IPS11-002

Session:

Big Data infrastructures within NSIs

Presenter/s:

Marco Puts

Presentation type:

Oral presentation

Room:

GASP

Chair:

Michail Skaliotis, Eurostat, Luxembourg, (Email)

Date:

Thursday, 14 March

Time:

15:45 - 16:45

Session times:

15:45 - 16:45