Record Linkage with Text: Merging Data Sets When Information is Limited

Tom Paskhalis

New York University

The recent years have seen the emergence of new, more scalable ways to link information about different entities across multiple data sources. However, merging data sets when the number of variables used for record linkage is restricted remains challenging. In this paper I consider the case when the information is limited to a single multi-token text string. This situation often occurs when researchers work with organization names, user accounts or any other short labels. Using Lobbying Disclosure Act data I illustrate substantive implications that the choice of record linkage approach can have in empirical research. I review the existing approaches and consider three types of noise that can typically be encountered in this scenario: character-level, word-level or a combination of both. Furthermore, I conduct a simulation study showing the sensitivity of the existing approaches to the presence of errors occurring at different levels. The results suggest that the optimal choice of a record linkage approach depends on contextual knowledge about the most likely type of noise, as well as stress the need to conduct sensitivity analysis using different record linkage approaches.