Text and Data Mining for Chinese Historical Texts

Submission 113

WK01-01

Presented by: Donald Sturgeon

Donald Sturgeon^*

Durham University

This hands-on workshop introduces participants to a complete text and data mining workflow for material written in classical Chinese, from digital transcription and annotation of premodern works through to computer-assisted extraction of data from their contents. It consists of four parts:

1. Getting started: using the Chinese Text Project crowdsourced editing platform to create and obtain accurate, linked digital transcriptions of premodern Chinese texts.

2. Interactive text mining: extracting and visualizing statistical properties and relationships from transcribed texts. Types of analysis to be introduced include pattern matching of words and phrases, identification of text reuse, and identification of patterns of vocabulary usage; visualizations include summarization via interactive networks, charts, and textual heatmaps. Techniques will be demonstrated using classical Chinese materials from ctext.org, however these can all be applied equally to materials from other sources, as well as in other languages.

3. Semantic annotation: disambiguating and linking explicitly references in texts to entities (such as names of people, places, and eras), connecting these references to authority databases, extracting knowledge claims about these entities (such as dates of birth, death, or appointment to a particular bureaucratic office) and contributing them to a crowdsourced knowledge base.

4. Interactive data mining: extracting and visualizing data from annotated texts and extracted knowledge claims. This includes querying a knowledge base for particular types of information, and summarizing results via networks, charts, and maps. This section will make use of the Chinese Text Project’s Linked Open Data knowledge graph, containing data on people, places, dates, and many other historical entities covering a period of over 3000 years. A brief introduction to querying using the industry-standard SPARQL language will be provided; this language is also used by many other systems containing relevant data, in particular a wide variety of institutions in the GLAM [Galleries, Libraries, Archives, and Museums] sphere, as well as Wikidata, and (via Shanghai Library) the China Biographical Database (CBDB).

This workshop does not assume any prior background in digital methods, and requires only a computer with a web browser. Participants are encouraged to create a free account on ctext.org prior to the workshop: https://ctext.org/account.pl .