Submission 73
Inferring Bon-gwan (本貫) in the Late Joseon Dynasty: Digital Humanities with AI
SP01-06
Presented by: HANA LEE
Historical records are inevitably subject to omission and damage for various reasons. Consequently, key information about historical facts must often be inferred from partial evidence in order to understand the societal context of the time.
In this study, we employ digital humanities methodologies to address the challenges of such information loss. Specifically, we attempt to infer absent historical records by predicting individuals’ bon-gwan (本貫) information using AI models. The bon-gwan (本貫), which refers to the clan's place of origin associated with its progenitor (始祖), was a crucial indicator for understanding a person's lineage, social status, and kinship relations during the late Joseon Dynasty.
1. Data
For this research, we utilized data on eum-gwan (蔭官) officials from the Joseon Dynasty, collected and organized by the Academy of Korean Studies. The term eum-gwan (蔭官) refers to officials who entered government service through the hereditary privilege known as the eum-seo (蔭敍) system. This system granted positions to the sons and descendants of meritorious elites, high-ranking officials, and members of the royal family without requiring the state civil service examination. The dataset contains systematically curated information on these individuals, including names, birth and death years, bon-gwan, fathers’ names, courtesy names (字), pen names (號), official career trajectories, and kinship ties including fathers(父), biological fathers(生父), grandfathers(祖父), maternal grandfathers(外祖父), and fathers-in-law(妻父).
The data was compiled from multiple genealogical and bureaucratic registers such as Eum-an (蔭案), Jinsinbo (搢紳譜), Muneumjinsinbo (文蔭縉紳譜), Mubo (武譜), Jinsinmubo (縉紳武譜), and Muneumbo (文蔭譜). Including variant editions, the information was collected from a total of 11 volumes of records. The dataset comprises 11,122 records, including duplicates, for individuals whose birth years range from 1722 to 1904, placing them in the late Joseon period. Based on this dataset, we designed a predictive model that estimates bon-gwan when provided with partial biographical information, such as their name and their father's name.
2. Data Preprocessing
The raw dataset contained a significant number of missing values and duplicate entries for the same individuals. For example, the number of recorded relatives per person ranged from as few as 1 to as many as 42, with an average of 6.94, reflecting significant variation. Accordingly, rigorous preprocessing was necessary.
First, the dataset was decomposed into three interrelated tables: (1) basic individual information, (2) kinship information, and (3) officeholding records. A unique identifier was assigned to link these tables relationally.
Second, all input values were converted into a numerical format using label encoding and Hangul(Korean)/Hanja embeddings. Specifically, Bon-gwan and kinship categories were label-encoded; names were decomposed into syllables and converted into Unicode; and office titles were embedded using word2vec.
Third, the same individuals often appeared multiple times across the dataset, and these duplicate records sometimes contained conflicting values due to recording errors. Since a uniform method was insufficient for filtering these duplicates, it was necessary to determine which combinations of matching values could confirm that they referred to the same person. After analyzing various data combinations, we established the following 6 criteria for identifying identical individuals: (1) identical Chinese-character names, birth years, and fathers’ Korean names, (2) identical Korean names, birth years, and fathers’ Korean names, (3) identical Chinese-character names, birth years, bon-gwan, and kin names, (4) identical Chinese-character names, bon-gwan, and fathers’ Korean names, (5) identical Chinese-character names, bon-gwan, and fathers-in-law’s Korean names, (6) identical Korean names, fathers’ Korean names, and kin names.
After preprocessing, we removed records with missing bon-gwan values (the target variable) and the duplicates identified using the criteria above. We then experimented with various data compositions, such as using the dataset after removing all remaining missing values or using only a subset of features.
3. Modeling and Result
For modeling, we employed both machine learning tree-based models (XGBoost, CatBoost, LightGBM) and deep learning architectures (MLP, SVM, autoencoder, TabNet, TabTransformer).
The most effective configuration involved 9,289 individuals, using Unicode embeddings of the initial, medial, and final phonetic components of each syllable in both the individual's Korean name and their father's Korean name. Unexpectedly, the inclusion of kinship and career information did not improve model performance. The ability to accurately predict bon-gwan from name data alone is likely attributable to the historical practice of using dollimja—generation names. This tradition involved incorporating a specific character into a personal name based on the clan's generational naming cycle (行列), creating a strong correlation between name patterns and clan affiliation.
The best-performing model was XGBoost, which yielded an accuracy of 0.800921, an F1 score of 0.730775, a precision of 0.802985, and a recall of 0.705552. With this model, the probability of the correct bon-gwan being within the top 3 predictions was 0.927503, and within the top 5 was 0.963751. These results were obtained after removing bon-gwan with fewer than 20 instances from the training data.
4. Further Analysis and Discussion
To further improve performance, we experimented with tree model ensembling and stacking methods. The best result from these attempts was a stacking model using an MLP with two hidden layers, which achieved an accuracy of 0.796318 and an F1 score of 0.730563—both slightly lower than the standalone XGBoost model.
Among the deep learning models, TabNet yielded the highest accuracy at 0.3349 with an F1 score of 0.1394, indicating that it effectively failed to make meaningful predictions. The poor performance of deep learning models in this study is likely due to the insufficient size of the dataset. With a range of 5,901 to 9,289 instances and only 8 to 15 features, the dataset was too small for these complex models to learn effectively.
5. Conclusion and Future Work
This study demonstrates the potential of applying digital humanities methodologies to infer missing information in historical records. By integrating computational prediction with historical inquiry, we show that bon-gwan inference is feasible even with fragmented data. Future research will aim to expand this approach to predict other missing information from the dataset, such as political faction affiliations (黨派) and the surnames and bon-gwan of maternal and paternal relatives.