Evaluasi Kinerja Model Machine Learning dalam Cross-Project Defect Prediction Menggunakan Library PyCaret

Rian Hidayat; Agus Subekti

doi:10.31358/techne.v24i2.603

Authors

Rian Hidayat Universitas Nusa Mandiri
Agus Subekti Universitas Nusa Mandiri

DOI:

https://doi.org/10.31358/techne.v24i2.603

Keywords:

Cross-Project Defect Prediction (CPDP), Hyperparameter Tuning, Machine Learning, PyCaret, Data Distribution, Random Forest

Abstract

Cross Project Defect Prediction (CPDP) is used to overcome the limitation of training data on new projects. This study tests the performance of machine learning models (Random Forest, CatBoost, Logistic Regression, KNN, SVM) in a CPDP scenario with the AEEEM dataset, comparing results before and after hyperparameter adjustment. Models were tested using a one-to-many CPDP architecture, with evaluation metrics of Accuracy, AUC, Recall, Precision, and F1-Score.

As a result, Random Forest excels in 9 out of 20 prediction combinations, followed by CatBoost which is best in 4 combinations after tuning. KNN and SVM won in 3 and 2 combinations respectively, while Logistic Regression only excelled in 2 combinations. Hyperparameter tuning improved the performance of all models except Logistic Regression, with SVM improving most significantly (6.39%), followed by KNN (3.94%), RF (5.14%), and CatBoost (1.4%).

Project combinations such as LC ? EQ, ML ? EQ, and PDE ? EQ performed well, demonstrating the effectiveness of CPDP when projects are similar. In contrast, combinations such as EQ ? ML and ML ? LC performed poorly due to differences in data distribution. This study concludes that CPDP is effective for software defect prediction when local data is limited, and can be the basis for further research such as transfer learning or project selection based on semantic similarity.

Downloads

Download data is not yet available.

References

K. Shebl, Y. Afify, and N. Badr, “Software defect prediction approaches revisited,” Int. J. Intell. Comput. Inf. Sci., vol. 23, no. 3, 2023, doi: 10.21608/ijicis.2023.200737.1261.

N. A. G. Gayatri, H. Soeparno, F. L. Gaol, and Y. Arifin, “Enhancing Software Quality Through Defect Prediction,” 2024 3rd International Conference on Creative Communication and Innovative Technology (ICCIT), pp. 1–7, 2024, doi: 10.1109/iccit62134.2024.10701182.

S. Abbas, S. Aftab, M. A. Khan, T. M. Ghazal, H. Al Hamadi, and C. Y. Yeun, “Data and Ensemble Machine Learning Fusion Based Intelligent Software Defect Prediction System,” C. Mater. Contin., vol. 75, no. 3, pp. 6083–6100, 2023, doi: 10.32604/cmc.2023.037933.

G. Cauvery and D. DhinaSuresh, “Software Defect Prediction Using Machine Learning Techniques,” Data Anal. Artif. Intell., vol. 3, no. 2, pp. 30–33, 2023, doi: https://dx.doi.org/10.46632/daai/3/2/7

E. H. A. Prastyo, M. A. Yaqin, S. Suhartono, M. Faisal, and R. A. J. Firdaus, “Naive Bayes Classification for Software Defect Prediction,” Transactions on Informatics and Data Science, vol. 1, no. 1, pp. 11–20, 2024, doi: 10.24090/tids.v1i1.12192.

K. Javed, S. Ren, M. Asim, and M. A. Wani, “Cross-Project Defect Prediction Based on Domain Adaptation and LSTM Optimization,” Algorithms, vol. 17, no. 5, 2024, doi: 10.3390/a17050175.

A. Taskeen, S. U. R. Khan, and E. A. Felix, “A research landscape on software defect prediction,” J. Softw., vol. 35, no. 12, 2023, doi: 10.1002/smr.2549.

Z. Li, J. Niu, X.-Y. Jing, W. Yu, and C. Qi, “Cross-Project Defect Prediction via Landmark Selection-Based Kernelized Discriminant Subspace Alignment,” IEEE Trans. Reliab., vol. 70, no. 3, pp. 996–1013, 2021, doi: 10.1109/TR.2021.3074660.

S. Jiang, J. Zhang, O. Teng, and J. Li, “Balanced Adversarial Tight Matching for Cross-Project Defect Prediction,” IET Softw., 2024, doi: 10.1049/2024/1561351.

Y. Z. Bala, P. Abdul Samat, K. Y. Sharif, and N. Manshor, “Improving Cross-Project Software Defect Prediction Method Through Transformation and Feature Selection Approach,” IEEE Access, vol. 11, no. January, pp. 2318–2326, 2023, doi: 10.1109/ACCESS.2022.3231456.

A. Hadianto and W. H. Utomo, “CatBoost Optimization Using Recursive Feature Elimination,” Jurnal Online Informatika, vol. 9, no. 2, 2024, doi: 10.15575/join.v9i2.1324.

T. Inoue, N. Hanafusa, Y. Kawaguchi, and K. Tsuchiya, “Predicting anemia management in dialysis patients using open-source machine learning libraries,” Ren. Replace. Ther., vol. 11, no. 1, p. 47, Jun. 2025, doi: 10.1186/s41100-025-00633-8.

Y. Z. Bala, P. A. Samat, K. Y. Sharif, and N. Manshor, “Cross-project software defect prediction through multiple learning,” Bul. Tek. Elektro dan Inform., vol. 13, no. 3, 2024, doi: 10.11591/eei.v13i3.5258.

Y. Zhao, Y. Zhu, Q. Yu, and X. Chen, “Cross-Project Defect Prediction Method Based on Manifold Feature Transformation,” Futur. Internet, vol. 13, no. 8, p. 216, 2021, doi: 10.3390/FI13080216.

M. J. Hernandez-Molinos, A. J. Sanchez-Garcia, R. E. Barrientos-Martínez, J. C. Perez-Arriaga, and J. O. Ocharán-Hernández, “Software Defect Prediction with Bayesian Approaches,” Mathematics, vol. 11, no. 11, p. 2524, 2023, doi: 10.3390/math11112524.

Y. Wang, Y. Li, H. Wang, Z. Lei, and X. Zhang, “Better Knowledge Enhancement for Privacy-Preserving Cross-Project Defect Prediction,” 2024, doi: 10.48550/arxiv.2412.17317.

K. Arai, J. Shimazoe, and M. Oda, “Method for Hyperparameter Tuning of Image Classification with PyCaret,” Int. J. Adv. Comput. Sci. Appl., vol. 14, no. 9, p. 7, 2023, doi: 10.14569/ijacsa.2023.0140930.

F. Handayanna and S. Sunarti, “Penerapan Algoritma K-Means Untuk Mengelompokkan Kepadatan Penduduk di Provinsi DKI Jakarta,” J. Appl. Comput. Sci. Technol., vol. 5, no. 1, pp. 50–55, Mar. 2024, doi: 10.52158/jacost.v5i1.477.

A. Kumar, S. Kumar, A. Kumari, and A. Saini, “Edge Computing based IDS Detecting Threats using Machine Learning and PyCaret,” 2023 International Conference on Computational Intelligence and Sustainable Engineering Solutions (CISES), 2023, pp. 668–673. doi: 10.1109/CISES58720.2023.10183591.

A. T. Olanipekun, “An Improved Prediction of Transparent Conductor Formation Energy using PyCaret: An Open-Source Machine Learning Library,” J. Appl. Data Sci., vol. 5, no. 4, pp. 1914–1924, 2024, doi: 10.47738/jads.v5i4.202.

P. S. Vadar, T. T. Moharekar, and U. R. Pol, “Comparative analysis of automated machine learning libraries: pycaret, h2o, tpot, auto-sklearn, and flaml,” Int. J. Sci. Res. Eng. Manag., vol. 08, no. 11, pp. 1–8, Nov. 2024, doi: https://doi.org/10.55041/IJSREM39119.

K. Gopalakrishnan, “Leveraging PyCaret for Time Series Analysis-A Low Code Approach,” Journal of Artificial Intelligence & Cloud Computing, vol. 2, no. 3, pp. 1–4, 2023, doi: 10.47363/jaicc/2023(2)314.

Y. Kim, Y.-C. Byun, and S.-J. Lee, “A Study on Sugar Content Improvement and Distribution Flow Response through Citrus Sugar Content Prediction Based on the PyCaret Library,” Horticulturae, vol. 10, no. 6, p. 630, 2024, doi: 10.3390/horticulturae10060630.

S. Gul, K. Ayturan, and F. Hardalaç, “PyCaret for Predicting Type 2 Diabetes: A Phenotype- and Gender-Based Approach with the “Nurses’ Health Study” and the “Health Professionals’ Follow-Up Study” Datasets,” J. Pers. Med., vol. 14, no. 8, p. 804, 2024, doi: 10.3390/jpm14080804.

A. Munjal, R. Khandia, and B. Gautam, “A Machine learning Approach for selection of polycystic ovarian syndrome (PCOS) attributes and comparing different Classifier performance with the help of WEKA and PyCaret,” Int. J. Sci. Res., pp. 59–63, 2020, doi: 10.36106/IJSR/5416514.

A. P. and R. Gunasundari, “An Interpretable PyCaret Approach for Alzheimer’s Disease Prediction,” Int. J. Comput. Exp. Sci. Eng., vol. 10, no. 4, 2024, doi: 10.22399/ijcesen.655.