Imbalanced Sample Size Problems for Linear Discriminant Analysis
M.C. Owoyi *
Department of Mathematics, Dennis Osadebay University, Asaba, Nigeria.
F.Z. Okwonu
Department of Mathematics, Delta State University, Abraka, Nigeria.
*Author to whom correspondence should be addressed.
Abstract
This study evaluated the behaviour of Linear Discriminant Analysis (LDA) in an imbalanced dataset where two LDA models were considered: the classical Fisher linear Discriminant Analysis (CFLDA), which is the same as the LDA and the Robust Fisher Linear Discriminant Analysis (RFLDA). The study applied the Monte Carlo simulation to investigate the comparative performance of both classifiers. Also, an investigation was done on both classifiers using practical datasets. The violation of the assumptions of the LDA model was observed, and the satisfaction of the central limit theorem was observed in the performance of the classifiers. The imbalance data concept associated with the practical data and the impact of data balancing using the Mean Variance Cloning Techniques (MVCT) was also demonstrated. The analysis demonstrated the comparative performance of the classifiers and also indicated the weaknesses of both classifiers. The results demonstrated that the RFLDA performance when faced with contamination and alteration in both training and validation samples is not affected, but will perform better than the CFLDA if the assumptions (normality and Homoscedasticity) are violated, as the RFLDA can resist noise. In general, the result showed that RFLDA is not susceptible to contamination and alteration, but the CFLDA was shown to perform well on an imbalanced sample size, thereby validating the Concept of data dependency and central limit theory.
Keywords: FLDA, imbalanced dataset, RFLDA, performance evaluation