Email)">
15:45 - 16:45
Invited Paper Session
Room: GASP
Chair:
Michail Skaliotis, Eurostat, Luxembourg, (Email)
Discussant:
Juri Marcucci, Bank of Italy, Italy, (Email)
Organiser:
Piet DAAS, Statistics Netherlands, Netherlands, (Email)
IT infrastructure for Big Data and Data Science: Challenges at Statistics Netherlands
Marco Puts, (Email)
CBS/Statistics Netherlands, Heerlen
Until a couple of years ago, processing data at national statistical institutes (NSI’s) did not differ that much as processing data at an administrative environment. In case of surveys, one needed a database comprising all companies or persons in the country, take a sample of this population, and, when the questionnaire was sent back, process the (relatively) small amounts of data to calculate the wanted estimates. In case of register data, the amount of data was higher, but still manageable for relational database management systems (RDBMS) or even in the form of comma separated values. Most of the data sets would fit in memory of small dimensioned desktops. The role of statistical methodology was simple: Given one or several datasets, define an algorithm that executes a certain method for estimating the target variables. Processes as data cleaning could be rule based or could encompass an iterative process to find the “right” values. Most of the times, these algorithms did not need to be efficient, since inefficient algorithms would maybe run for a couple of hours once every three months or once a year. In the age of big data and data science, this is not the case anymore. Nowadays, many questions can be asked and answered by using more timely data and it is expected from modern statisticians to be data scientists, who use modern technology to be able to quickly answer current questions. Transitions of the used systems is not an easy task. The administrative landscape is based on low data throughput and relatively small latencies can be relatively easy realized. For data science infrastructures, this is another story. Small latencies are not easy to realize. In certain cases, a high data throughput needs to be realized and this cannot be realized with more traditional infrastructures. Whereas in traditional infrastructures, vertical scaling, or scaling up, was normally done when more compute power was needed, horizontal scaling, or scaling out, is used in data intensive infrastructures. More servers are coupled to create clusters, which scale easily with the data [1]. In this paper, we will discuss what the transition from “classical” statistics to modern data science means and what challenges lie ahead when implementing.


Reference:
IPS11-002
Session:
Big Data infrastructures within NSIs
Presenter/s:
Marco Puts
Presentation type:
Oral presentation
Room:
GASP
Chair:
Michail Skaliotis, Eurostat, Luxembourg, (Email)
Date:
Thursday, 14 March
Time:
15:45 - 16:45
Session times:
15:45 - 16:45