The Merits of Screening Automation for Bibliometric Analyses: The Case of Translational Psychotherapy

Tue-01

Presented by: Claudiu Petrule

Claudiu Petrule ^*, André Bittermann , Viktoria Ritter , Anke Haberkamp , Winfried Rief

Background: Analyzing emerging fields using traditional bibliometric techniques can be difficult due to inconsistent terminology and vague or porous boundaries. The absence of precise terms for database search queries can hinder efforts to fully cover a given construct of interest and may result in biased or distorted representations. The common solution of widening the scope of the search can lead to the inclusion of a large number of false positives, making it difficult to distinguish eligible data from noise. As a consequence, the drastically increased amount of records makes screening an absolute requirement for building a complete and accurate dataset.

Objectives: Assessing the added value of a Machine Learning (ML) augmented approach to bibliometric analysis compared to a traditional keywords-based database search.

Research question: Is the additional workload required for semi-automating the screening process justified, or is the customary database search approach sufficient in terms of representativeness?

Method/Approach: ML is employed to semi-automate the screening process of the emerging research landscape of translational psychotherapy as a use case. We leverage the ML feature in Rayyan to automate the screening process and identify eligible records from a large pool of publications that are relevant to the field but differ in their terminology. The Rayyan classification model is trained on the screened results of a search query using the field’s known terminology (i.e., “translational psychotherapy”). This training data is then used to predict the inclusion probability of unseen records. We perform two rounds of active learning and screen results until the inclusion rates of the classifier is above 95%. Lastly, we compare bibliometric indicators and impact measurements of the ML augmented dataset to the database search dataset.

Results/Findings: The ML-augmented dataset differs significantly from the one based on a typical database search that only relies on known vocabulary. This is evident in various bibliometric aspects, including the top authors, journals, countries, and impact, to the extent that it becomes unrecognizable. Although the training of the classifier is laborious, the resulting model identifies literature that would otherwise be overlooked, hence justifying the additional effort.

Conclusions and implications: The importance of maintaining a consistent terminology within research fields is emphasized in our study. By adhering to standard terminology, researchers can avoid knowledge stagnation and hampered scientific progress. However, there may be instances where maintaining this consistency becomes challenging due to the emergence of new concepts or a lack of standardization across disciplines. In such cases, the use of ML can offer several advantages and benefits, not least relevant its capacity to sift through the vast and ever-growing amounts of data generated by recent research efforts. This is not only a superfluous or gimmicky cherry on the cake, but opens new horizons in explorative research and synthesis of emerging fields.