Comparative Analysis of Bagging and Boosting Algorithms for Predicting Protein-Protein Interactions Using Learned Embeddings
DOI:
https://doi.org/10.55006/biolsciences.2026.6102Keywords:
Viral infections, Protein-Protein Interactions (PPIs), Word2Vec embedding, Learned embeddings, Ensemble learningAbstract
Viral infections are a major global health concern, as evidenced by the rapid spread of SARS-CoV-2, leading to a worldwide pandemic. Viruses can manipulate host cell machinery by integrating their genetic material into the host genome, a process facilitated by Protein-Protein Interactions (PPIs). Identifying PPIs between humans and viruses is essential for understanding the mode of infection and host immune responses and developing effective treatment regimes. Although experimental methods like mass spectrometry-based proteomics and yeast two-hybrid assays are widely employed to identify human-virus PPIs they are often time-consuming, expensive, and labor-intensive. Here, we propose an alternative method that overcomes technical limitations by leveraging machine learning models to predict human-virus PPIs with enhanced accuracy and efficiency, emphasizing the role of automatic feature extraction and ensemble learning techniques in driving superior prediction performance. Protein sequences are analyzed using Word2Vec embeddings to automatically extract complex features, offering a significant advantage over manual feature engineering. The study employs two ensemble learning approaches, boosting and bagging, to train predictive models on the extracted features. Among these, XGBoost, a boosting algorithm, demonstrated superior predictive performance compared to bagging models. Our findings highlight the potential of combining automated feature extraction with advanced ensemble learning methods to improve the efficiency and accuracy of PPI prediction. This approach enhances our understanding of protein sequences and their interactions and holds promise for accelerating the development of effective antiviral therapies.
Downloads
References
1. Xu, X.; Chen, P.; Wang, J.; Feng, J.; Zhou, H.; Li, X.; Zhong, W.; Hao, P. Evolution of the novel coronavirus from the ongoing Wuhan outbreak and modeling of its spike protein for risk of human transmission. Science China Life Sciences 2020, 63, 457â460.
2. Wu, J.; Yuan, X.; Wang, B.; Gu, R.; Li, W.; Xiang, X.; Tang, L.; Sun, H. Severe acute respiratory syndrome coronavirus 2: from gene structure to pathogenic mechanisms and potential therapy. Frontiers in Microbiology 2020, 11, 1576.
3. Hu, B.; Guo, H.; Zhou, P.; Shi, Z.-L. Characteristics of SARS-CoV-2 and COVID-19. Nature Reviews Microbiology 2021, 19 (3), 141â154.
4. Trottein, F.; Sokol, H. Potential causes and consequences of gastrointestinal disorders during a SARS-CoV-2 infection. Cell Reports 2020, 32 (3).
5. Decaro, N.; Lorusso, A. Novel human coronavirus (SARS-CoV-2): A lesson from animal coronaviruses. Veterinary Microbiology 2020, 244, 108693.
6. Esposito, M. M.; Turku, S.; Lehrfield, L.; Shoman, A. The Impact of Human Activities on Zoonotic Infection Transmissions. Animals 2023, 13 (10), 1646.
7. Dyer, M. D.; Murali, T. M.; Sobral, B. W. The landscape of human proteins interacting with viruses and other pathogens. PLoS Pathogens 2008, 4 (2), e32.
8. Lian, X.; Yang, X.; Yang, S.; Zhang, Z. Current status and future perspectives of computational studies on humanâvirus proteinâprotein interactions. Briefings in Bioinformatics 2021, 22 (5), bbab029.
9. De Las Rivas, J.; Fontanillo, C. Proteinâprotein interactions essentials: key concepts to building and analyzing interactome networks. PLoS Computational Biology 2010, 6 (6), e1000807.
10. Yi, B.; Deng, Q.; Guo, C.; Li, X.; Wu, Q.; Zha, R.; Wang, X.; Lu, J. Evaluating the zoonotic potential of RNA viromes of rodents provides new insight into rodent-borne zoonotic pathogens in Guangdong, China. One Health 2023, 17, 100631.
11. Duffy, S. Why are RNA virus mutation rates so damn high? PLoS Biology 2018, 16 (8), e3000003.
12. Luck, K.; Kim, D.-K.; Lambourne, L.; Spirohn, K.; Begg, B. E.; Bian, W.; et al. A reference map of the human binary protein interactome. Nature 2020, 580 (7803), 402â408.
13. Gordon, D. E.; Jang, G. M.; Bouhaddou, M.; Xu, J.; Obernier, K.; White, K. M.; et al. A SARS-CoV-2 protein interaction map reveals targets for drug repurposing. Nature 2020, 583 (7816), 459â468.
14. Via, A.; Uyar, B.; Brun, C.; Zanzoni, A. How pathogens use linear motifs to perturb host cell networks. Trends in Biochemical Sciences 2015, 40 (1), 36â48.
15. Weatheritt, R. J.; Gibson, T. J. Linear motifs: lost in (pre)translation. Trends in Biochemical Sciences 2012, 37 (8), 333â341.
16. Calderwood, M. A.; Venkatesan, K.; Xing, L.; Chase, M. R.; Vazquez, A.; Holthaus, A. M.; et al. EpsteinâBarr virus and virus human protein interaction maps. Cell 2007, 130 (5), 889â899.
17. Asgari, E.; Mofrad, M. R. K. Continuous distributed representation of biological sequences for deep proteomics and genomics. PLoS One 2015, 10 (11), e0141287.
18. Libbrecht, M. W.; Noble, W. S. Machine learning applications in genetics and genomics. Nature Reviews Genetics 2015, 16 (6), 321â332.
19. Schapire, R. E. The strength of weak learnability. Machine Learning 1990, 5, 197â227.
20. Freund, Y.; Schapire, R. E. A decision-theoretic generalization of on-line learning and an application to boosting. Journal of Computer and System Sciences 1997, 55 (1), 119â139.
21. Chen, T.; Guestrin, C. XGBoost: A scalable tree boosting system. In Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD '16), San Francisco, CA, USA, August 13â17, 2016; pp 785â794.
22. Vidal, M.; Cusick, M. E.; BarabĂĄsi, A.-L. Interactome networks and human disease. Nature Reviews Genetics 2011, 12 (9), 615â628.
23. Hashemifar, S.; Neyshabur, B.; Khan, A. A.; Xu, J. Predicting proteinâprotein interactions through sequence-based deep learning. Bioinformatics 2018, 34 (17), i802âi810.
24. Chen, H.; Li, F.; Wang, L.; Jin, Y.; Kurgan, L. Systematic evaluation of machine learning methods for identifying humanâpathogen proteinâprotein interactions. Briefings in Bioinformatics 2021, 22 (3), bbaa068.
25. Breiman, L. Random forests. Machine Learning 2001, 45, 5â32.
26. Huang, Y.-A.; You, Z.-H.; Gao, X.; Wong, L.; Wang, L. Using Weighted Sparse Representation Model Combined with Discrete Cosine Transformation to Predict ProteinâProtein Interactions from Protein Sequence. BioMed Research International 2015, 2015 (1), 902198.
27. Huang, Y.-A.; You, Z.-H.; Li, X.; Chen, X.; Hu, P.; Li, S.; Luo, X. Construction of reliable proteinâprotein interaction networks using weighted sparse representation-based classifier with pseudo substitution matrix representation features. Neurocomputing 2016, 218, 131â138.
28. Zamil, K. S.; Rahman, J. Prediction of protein-protein interaction from amino acid sequence using ensemble classifier. In Proceedings of the 2018 International Conference on Computer, Communication, Chemical, Material and Electronic Engineering (IC4ME2), Rajshahi, Bangladesh, February 8, 2018; IEEE: 2018; pp 1â4.
29. Brown, C. A.; Hansen, H. N.; Jiang, X. J.; Blateyron, F.; Berglund, J.; Senin, N.; Bartkowiak, T.; Dixon, B.; Le GoĂŻc, G.; Quinsat, Y.; Stemp, W. J. Multiscale analyses and characterizations of surface topographies. CIRP Annals 2018, 67 (2), 839â862.
30. Yang, X.; Yang, S.; Li, Q.; Wuchty, S.; Zhang, Z. Prediction of human-virus protein-protein interactions through a sequence embedding-based machine learning method. Computational and Structural Biotechnology Journal 2020, 18, 153â161.
31. Chen, H.; Li, F.; Wang, L.; Jin, Y.; Chi, C.-H.; Kurgan, L.; Song, J.; Shen, J. Systematic evaluation of machine learning methods for identifying humanâpathogen proteinâprotein interactions. Briefings in Bioinformatics 2021, 22 (3), bbaa068.
32. Alashwal, H.; Deris, S.; Othman, R. M. One-class support vector machines for protein-protein interactions prediction. International Journal of Biological and Medical Sciences 2006, 1 (2), 120â127.
33. Chen, X.-W.; Liu, M. Prediction of proteinâprotein interactions using random decision forest framework. Bioinformatics 2005, 21 (24), 4394â4400.
34. Jensen, L. J.; Kuhn, M.; Stark, M.; Chaffron, S.; Creevey, C.; Muller, J.; Doerks, T.; Julien, P.; Roth, A.; Simonovic, M.; Bork, P. STRING 8âa global view on proteins and their functional interactions in 630 organisms. Nucleic Acids Research 2009, 37 (suppl_1), D412âD416.
35. Wang, L.; Wang, H. F.; Liu, S. R.; Yan, X.; Song, K. J. Predicting protein-protein interactions from matrix-based protein sequence using convolution neural network and feature-selective rotation forest. Scientific Reports 2019, 9 (1), 9848.
36. Zhan, X.; You, Z.; Yu, C.; Pan, J.; Li, R. Predicting Protein-Protein Interactions from Protein Sequence Using Locality Preserving Projections and Rotation Forest. In Intelligent Computing Theories and Application: 16th International Conference, ICIC 2020, Bari, Italy, October 2â5, 2020, Proceedings, Part II; Springer International Publishing: 2020; pp 121â131.
37. Wang, J.; Zhang, L.; Jia, L.; Ren, Y.; Yu, G. Protein-protein interactions prediction using a novel local conjoint triad descriptor of amino acid sequences. International Journal of Molecular Sciences 2017, 18 (11), 2373.
38. Gui, Y.; Wang, R.; Wei, Y.; Wang, X. DNN-PPI: a large-scale prediction of proteinâprotein interactions based on deep neural networks. Journal of Biological Systems 2019, 27 (01), 1â8.
39. Yang, L.; Han, Y.; Zhang, H.; Li, W.; Dai, Y. Prediction of ProteinâProtein Interactions with Local WeightâSharing Mechanism in Deep Learning. BioMed Research International 2020, 2020 (1), 5072520.
40. Hashemifar, S.; Neyshabur, B.; Khan, A. A.; Xu, J. Predicting proteinâprotein interactions through sequence-based deep learning. Bioinformatics 2018, 34 (17), i802âi810.
41. Liang, W.; Luo, S.; Zhao, G.; Wu, H. Predicting hard rock pillar stability using GBDT, XGBoost, and LightGBM algorithms. Mathematics 2020, 8 (5), 765.
42. Nobre, J.; Neves, R. F. Combining principal component analysis, discrete wavelet transform and XGBoost to trade in the financial markets. Expert Systems with Applications 2019, 125, 181â194.
43. Taha, A. A.; Malebary, S. J. An intelligent approach to credit card fraud detection using an optimized light gradient boosting machine. IEEE Access 2020, 8, 25579â25587.
44. Song, J.; Liu, G.; Jiang, J.; Zhang, P.; Liang, Y. Prediction of proteinâATP binding residues based on ensemble of deep convolutional neural networks and LightGBM algorithm. International Journal of Molecular Sciences 2021, 22 (2), 939.
45. Song, J.; Liu, G.; Jiang, J.; Zhang, P.; Liang, Y. Prediction of proteinâATP binding residues based on ensemble of deep convolutional neural networks and LightGBM algorithm. International Journal of Molecular Sciences 2021, 22 (2), 939.
46. Breiman, L. Bagging predictors. Machine Learning 1996, 24, 123â140.
47. Alelyani, S. Stable bagging feature selection on medical data. Journal of Big Data 2021, 8 (1), 11.
48. Geurts, P.; Ernst, D.; Wehenkel, L. Extremely randomized trees. Machine Learning 2006, 63, 3â42.
49. Rodriguez, J. J.; Kuncheva, L. I.; Alonso, C. J. Rotation forest: A new classifier ensemble method. IEEE Transactions on Pattern Analysis and Machine Intelligence 2006, 28 (10), 1619â1630.
50. Hong, S.; Lynn, H. S. Accuracy of random-forest-based imputation of missing data in the presence of non-normality, non-linearity, and interaction. BMC Medical Research Methodology 2020, 20, 199.
51. Tsukiyama, S.; Hasan, M. M.; Fujii, S.; Kurata, H. LSTM-PHV: prediction of human-virus proteinâprotein interactions by LSTM with word2vec. Briefings in Bioinformatics 2021, 22 (6), bbab228.
52. Hamid, M.-N.; Friedberg, I. Identifying antimicrobial peptides using word embedding with deep recurrent neural networks. Bioinformatics 2019, 35 (12), 2009â2016.
53. Sini Raj, S.; Vinod Chandra, S. S. Significance of sequence features in classification of proteinâprotein interactions using machine learning. The Protein Journal 2024, 43 (1), 72â83.
Downloads
Published
How to Cite
Issue
Section
License
Copyright (c) 2026 Sini S Raj, Vinod Chandra S S

This work is licensed under a Creative Commons Attribution 4.0 International License.
-
Attribution â You must give appropriate credit, provide a link to the license, and indicate if changes were made. You may do so in any reasonable manner, but not in any way that suggests the licensor endorses you or your use.
- No additional restrictions â You may not apply legal terms or technological measures that legally restrict others from doing anything the license permits.

