Combination of PCA with SMOTE Oversampling for Classification of High-Dimensional Imbalanced Data

Mulla, Guhdar; Demir, Yıldırım; Hassan, Masoud

doi:10.17798/bitlisfen.939733

Combination of PCA with SMOTE Oversampling for Classification of High-Dimensional Imbalanced Data

Mulla G. A. A., Demir Y., Hassan M.

Bitlis Eren Üniversitesi Fen Bilimleri Dergisi, cilt.10, sa.3, ss.558-569, 2021 (TRDizin)

Yayın Türü: Makale / Tam Makale
Cilt numarası: 10 Sayı: 3
Basım Tarihi: 2021
Doi Numarası: 10.17798/bitlisfen.939733
Dergi Adı: Bitlis Eren Üniversitesi Fen Bilimleri Dergisi
Derginin Tarandığı İndeksler: TR DİZİN (ULAKBİM)
Sayfa Sayıları: ss.558-569
Açık Arşiv Koleksiyonu: AVESİS Açık Erişim Koleksiyonu
Van Yüzüncü Yıl Üniversitesi Adresli: Evet

Özet

Imbalanced data classification is a common issue in data mining where the classifiers are skewed towards the larger data class. Classification of high-dimensional skewed (imbalanced) data is of great interest to decision-makers as it is more difficult to. The dimension reduction method, a process in which variables are reduced, allows high dimensional datasets to be interpreted more easily with a certain loss. This study, a method combining SMOTE oversampling with principal component analysis is proposed to solve the imbalance problem in high dimensional data. Three classification algorithms consisting of Logistic Regression, K-Nearest Neighbor, Decision Tree methods and two separate datasets were utilized to evaluate the suggested method's efficacy and determine the classifiers' performance. Respectively, raw datasets, converted datasets by PCA, SMOTE and SMOTE+PCA (SMOTE and PCA) methods, were analyzed with the given algorithms. Analyzes were made using WEKA. Analysis results suggest that almost all classification algorithms improve their classification performance using PCA, SOMTE, and SMOTE+PCA methods. However, the SMOTE method gave more efficient results than PCA and PCA+SMOTE methods for data rebalancing. Experimental results also suggest that K-Nearest Neighbor classifier provided higher classification performance compared to other algorithms.