On robustness of the supervised multiclass classifier for autocoding system

Yukako Toko, (Email) ¹, Shinya Iijima, (Email) ¹, Mika Sato-Ilic, (Email) ^{1, 2}

¹ National Statistics Center, Tokyo
² University of Tsukuba, Ibaraki

We developed a supervised multiclass classifier for autocoding based on reliability scores (Toko et al., 2018a). The purpose of this paper is to investigate the robustness of this classifier in coding tasks in official statistics. Text response fields such as fields for occupation, industry, and household income and expenditure, are sometimes found on survey forms in official statistics. Those responded text descriptions are usually translated into corresponding classification codes for efficient data processing. Although, originally, coding tasks are performed manually, the importance of automated coding is increasing with the improvement of computer technology in recent years. Therefore, studies focused on developing an algorithm for autocoding have been seen in official statistics. For example, Hacking and Willenborg (2012) introduced coding methods including autocoding techniques. Gweon et al. (2017) illustrated methods for automated occupation coding based on statistical learning. We also developed a supervised multiclass classifier for the coding task of the Family Income and Expenditure Survey in Japan. Originally, our classifier was developed based on a simple machine learning technique, and it performs exclusive classification (Toko et al., 2017; Tsubaki et al., 2017; Shimono et al., 2018). However, the classifier incorrectly assigns classification codes for some objects with ambiguous information because of the semantic problem, interpretation problem, and insufficiently detailed input information. The main reason for these problems is the unrealistic restriction that one object is classified to a single class, we developed a new classifier that allows for the assignment of one object to multiple classification codes with a calculation of newly defined reliability scores utilising our previously proposed algorithm based on partition entropy (Toko et al., 2018 (a); Toko et al., 2018 (b)). In this technique, although we improved the classification accuracy, we consider not only the accuracy but the robustness of the classification. A classifier for the autocoding system requires robustness for the stable code assignment, whereas the style of text description is not always stable even in the same survey as it depends on respondents. Therefore, this study investigates the robustness of our classifier based on reliability scores with a numerical example using the noise-added survey data.

Abstract

Presentation

Reference:

CPS06-001

Session:

Survey Design

Presenter/s:

Yukako Toko

Presentation type:

Oral presentation

Room:

JENK

Chair:

Natalie SHLOMO, The University of Manchester, (Email)

Date:

Wednesday, 13 March

Time:

14:30 - 15:30

Session times:

14:30 - 15:30