Everything has its Price: Foundations of Cost-Sensitive Machine Learning and its Application in Psychology
Tue-01
Presented by: Philipp Sterner
a) Background
With the increasing availability of large data sets, psychology has seen an increase in the use of machine learning (ML) methods. One of the most common ML settings is binary classification, in which observations are classified into one of two groups. Off-the-shelf classification algorithms assume that the costs of a misclassification (false-positive or false-negative) are equal. Because this is often not reasonable (e.g., in clinical psychology), cost-sensitive machine learning (CSL) methods are needed to take different misclassification costs into account.
b) Objective
We present the mathematical foundations of CSL and introduce a taxonomy of the most commonly used CSL methods: cost-sensitive hyperparameter tuning, cost-sensitive meta-learning, and cost-sensitive model training. We describe their application and usefulness on a psychological data set, i.e., the drug consumption dataset (N = 1885) from the UCI Machine Learning Repository. The task in this data set is to use personality scores to predict drug consumption.
c) Results
In a benchmark, we applied different CSL methods on three algorithms of varying complexity: logistic regression, random forest, and gradient boosting. In our example, all demonstrated CSL methods noticeably reduced mean misclassification costs compared to cost-insensitive versions of the ML algorithms.
d) Conclusions and Implications
In light of our results as well as previous studies, we explain the necessity for researchers to perform small benchmarks of CSL methods for their own practical application. To assist researchers in this task, our open materials provide a template R code, demonstrating how CSL methods can be applied within the mlr3 framework (https://osf.io/cvks7/). Additionally, we discuss the important question of how to determine the misclassification costs and how certain aspects of open science could help to facilitate informed decisions regarding these costs.
With the increasing availability of large data sets, psychology has seen an increase in the use of machine learning (ML) methods. One of the most common ML settings is binary classification, in which observations are classified into one of two groups. Off-the-shelf classification algorithms assume that the costs of a misclassification (false-positive or false-negative) are equal. Because this is often not reasonable (e.g., in clinical psychology), cost-sensitive machine learning (CSL) methods are needed to take different misclassification costs into account.
b) Objective
We present the mathematical foundations of CSL and introduce a taxonomy of the most commonly used CSL methods: cost-sensitive hyperparameter tuning, cost-sensitive meta-learning, and cost-sensitive model training. We describe their application and usefulness on a psychological data set, i.e., the drug consumption dataset (N = 1885) from the UCI Machine Learning Repository. The task in this data set is to use personality scores to predict drug consumption.
c) Results
In a benchmark, we applied different CSL methods on three algorithms of varying complexity: logistic regression, random forest, and gradient boosting. In our example, all demonstrated CSL methods noticeably reduced mean misclassification costs compared to cost-insensitive versions of the ML algorithms.
d) Conclusions and Implications
In light of our results as well as previous studies, we explain the necessity for researchers to perform small benchmarks of CSL methods for their own practical application. To assist researchers in this task, our open materials provide a template R code, demonstrating how CSL methods can be applied within the mlr3 framework (https://osf.io/cvks7/). Additionally, we discuss the important question of how to determine the misclassification costs and how certain aspects of open science could help to facilitate informed decisions regarding these costs.