12:00 - 13:00
We-SPOT03
Chair:
Martin Karlberg (Eurostat, Luxembourg)
Providing large trade data sets for research using Apache Parquet and R Shiny
David Zenz, (Email), Oliver Reiter, (Email)
The Vienna Institute for International Economic Studies

We present a use case of the column-oriented data storage format Apache Parquet for a large database on international trade data, as a fast and efficient alternative of data provision as opposed to databases based on csv or Stata files. With the use of Apache Parquet, which provides efficient data compression and enhanced performance (with Apache Arrow), we were able to reduce the size of the UN Comtrade trade database for all classifications from ~900GB to 55GB (compression ratio of ~16), mean improvement in speed by 63.4%, and median improvement by 66% when extracting subsets of data from the database.