Analytical formulation of synthetic minority oversampling technique (SMOTE) for imbalanced learning
Main Article Content
Abstract
Imbalanced data is an issue that affects various applications in machine learning and data science. Synthetic minority oversampling technique (SMOTE) is a common method used to artificially balance the data. Despite the popularity of SMOTE, there is limited information about its analytical properties. In this paper, we develop a precise theoretical formulation of the sampling distribution of SMOTE in several important cases. We also examine the convergence of the SMOTE distribution to the underlying distribution in mean. The results provide a better understanding of SMOTE and other sampling algorithms. In addition, we uncover surprising connections to other fields such as information theory, Euler's constant, and compound distributions. Finally, we show that the SMOTE-generated distribution Z converges to that of the true underlying distribution X in mean.