How to build a corpus out of an “infinite” universe of “long” texts?
P9-S230-4
Presented by: Barak Zur
The rise in the use of large text data has brought to the fore discussions on how to construct an adequate corpus from vast text universes. Two main computational approaches are widely used: keyword-based search and Latent Dirichlet Allocation (LDA)-based methods. This paper aims to (1) propose guidelines for selecting between these approaches based on text universe characteristics and (2) introduce a novel method for building corpora from universes which do not fit either of the two prevailing approaches.
We distinguish text universes by two aspects: (1) their degree of "finity" and (2) the length of their text units. We suggest that LDA is an appropriate method for corpus building when a universe is "finite", that is when it includes a defined amount of sources (e.g. central bank statements). However, LDA is impractical for "infinite" universes with undefined sources (e.g. Tweets) due to financial costs and computational limitations, making keyword searches more suitable. Yet when text units are lengthy or of variable length (e.g., newspaper articles), keyword searches alone fall short, as keyword presence themselves often fails to ensure relevance.
To address this, we propose a hybrid method for constructing corpora from relatively infinite universes with diverse text lengths. This two-stage process involves (1) creating an initial corpus with computer-assisted keyword search and (2) applying LDA to categorize and filter texts by relevance. Using a newspaper corpus on fiscal and monetary policy from NexisUni, we demonstrate the method's effectiveness in terms of recall, precision, and practicality.
We distinguish text universes by two aspects: (1) their degree of "finity" and (2) the length of their text units. We suggest that LDA is an appropriate method for corpus building when a universe is "finite", that is when it includes a defined amount of sources (e.g. central bank statements). However, LDA is impractical for "infinite" universes with undefined sources (e.g. Tweets) due to financial costs and computational limitations, making keyword searches more suitable. Yet when text units are lengthy or of variable length (e.g., newspaper articles), keyword searches alone fall short, as keyword presence themselves often fails to ensure relevance.
To address this, we propose a hybrid method for constructing corpora from relatively infinite universes with diverse text lengths. This two-stage process involves (1) creating an initial corpus with computer-assisted keyword search and (2) applying LDA to categorize and filter texts by relevance. Using a newspaper corpus on fiscal and monetary policy from NexisUni, we demonstrate the method's effectiveness in terms of recall, precision, and practicality.
Keywords: Infinite Universe, Corpus Building, NexisUni