Performance of Semi-Automated Screening Using Rayyan and ASReview – A Retrospective Analysis of Potential Work Reduction and Different Stopping Rules

Tue-02

Presented by: Julian Scherhag

Julian Scherhag ^*, Tanja Burgard

Background
Although systematic reviews are a pillar of modern science, guiding policy making and evidence-based practice (Borenstein et al., 2009; Cooper et al., 2009), their undertaking becomes increasingly more difficult due to an ever increasing pool of scientific publications (Bornmann et al., 2021) which exceeds human information processing (Robledo et al., 2021) and labor capacities (Borah et al., 2017). In recent years, a variety of semi-automation tools have been developed to solve these issues via machine learning. Their application holds great promise, halving the screening workload while achieving high recall levels of 95% and above (Burgard & Bittermann, in press). Few studies to this date evaluated the performance of Rayyan or ASReview.

Objectives / Research Questions
Thus, we were interested in 1) how accurately (precision) and completely (recall) did they identify relevant articles, 2) how did they perform compared to other semi-automation tools, and 3) how did their performance differ by applying different stopping criteria.

Method
To assess the quality of the tool´s relevance predictions and semi-automated screening decisions with regards to work saved over sampling, achieved recall and precision, the Systematic Drug Class Review Gold Standard Data (Cohen et al., 2006) was used. For each of the fifteen reviews the abstract triage and article triage status were given. The bibliographic information was uploaded to the tools and the actual screening decisions were then given to the tools successively. The predictive model was retrained with subsequent user decisions and the remaining studies reranked accordingly. This process terminated when all articles were reviewed.

Results / Findings (expected)
We analyzed retrospectively, how many screenings were necessary to achieve certain recall rates (e.g. 95%, 100%) and how the tools performed at the different stopping points. The tested stopping criteria were 50/100 consecutive negatives, stopping after each quarter screened (25%, 50%, 75%), stopping at the tool´s relevance threshold, and stopping after achieving 95% of the estimated recall based on the random training sets. Additionally, we compared the results from this study with prior studies using the Cohen data sets.

Conclusions and Implications (expected)
We expect to find a trade-off between recall and precision, resulting in desirable levels of achieved recall but with low reliability. Furthermore, we anticipate that the application of the tools results in greater work savings than the unprioritized random sampling. However, based on prior findings, we do not expect Rayyan to outperform its counterparts (Robledo et al., 2021). The individual performance likely depends on the data´s characteristics and how frequently the predictive model is retrained to adjust the prioritization.