Integrative machine learning reveals potential signature genes using transcriptomics in colon cancer

Authors

DOI:

https://doi.org/10.14295/bjs.v4i9.745

Keywords:

cancer genome atlas, colon cancer, machine learning, transcriptomics

Abstract

Colon cancer is a significant health burden in the world and the second leading cause of cancer-related deaths. Despite advancements in diagnosis and treatment, identifying potential biomarkers for early detection and therapeutic targets remains challenging. This study used an integrative approach combining transcriptomics and machine learning to identify signature genes and pathways associated with colon cancer. RNA-Seq data from The Cancer Genome Atlas- Colon Adenocarcinoma (TCGA-COAD) project, comprising 485 samples, were analyzed in this study. Differential gene expression analysis revealed 657 upregulated and 8,566 downregulated genes. Notably, EPB41L3, TSPAN7, and ABI3BP were identified as highly upregulated, while LYVE1, PLPP1, and NFE2L3 were significantly downregulated in tumor samples. Gene Set Enrichment Analysis (GSEA) identified dysregulated pathways, including E2F targets, MYC targets, and G2M checkpoints, underscoring cell cycle regulation and metabolic reprogramming alterations in colon cancer. Machine learning models-Random Forest, Neural Networks, and Logistic Regression-achieved high classification accuracy (97–99%). Key genes consistently identified across these models highlight their potential translational relevance as biomarkers. This study integrates differential expression analysis, pathway enrichment, and machine learning to uncover critical insights into colon cancer biology. The study lays the groundwork for developing diagnostic and therapeutic strategies, with the identified genes and pathways serving as potential candidates for further validation and clinical applications. This approach exemplifies the potential of precision medicine to advance colon cancer research and improve patient outcomes.

References

Aono, S., Hatanaka, A., Hatanaka, A., Gao, Y., Hippo, Y., Taketo, M. M., Waku, T., & Kobayashi, A. (2019). beta-Catenin/TCF4 complex-mediated induction of the NRF3 (NFE2L3) gene in cancer cells. International Journal of Molecular Sciences, 20(13). https://doi.org/10.3390/ijms20133344 DOI: https://doi.org/10.3390/ijms20133344

Augustus, G. J., & Ellis, N. A. (2018). Colorectal cancer disparity in african americans: Risk factors and carcinogenic mechanisms. The American Journal of Pathology, 188(2), 291-303. https://doi.org/10.1016/j.ajpath.2017.07.023 DOI: https://doi.org/10.1016/j.ajpath.2017.07.023

Barabasi, A. L., Gulbahce, N., & Loscalzo, J. (2011). Network medicine: a network-based approach to human disease. Nature Reviews Genetics, 12(1), 56-68. https://doi.org/10.1038/nrg2918 DOI: https://doi.org/10.1038/nrg2918

Bury, M., Le Calve, B., Lessard, F., Dal Maso, T., Saliba, J., Michiels, C., Ferbeyre, G., & Blank, V. (2019). NFE2L3 Controls colon cancer cell growth through regulation of DUX4, a CDK1 inhibitor. Cell Reports, 29(6), 1469-1481 e1469. https://doi.org/10.1016/j.celrep.2019.09.087 DOI: https://doi.org/10.1016/j.celrep.2019.09.087

Capuano, A., Pivetta, E., Sartori, G., Bosisio, G., Favero, A., Cover, E., Andreuzzi, E., Colombatti, A., Cannizzaro, R., Scanziani, E., Minoli, L., Bucciotti, F., Amor Lopez, A. I., Gaspardo, K., Doliana, R., Mongiat, M., & Spessotto, P. (2019). Abrogation of EMILIN1-beta1 integrin interaction promotes experimental colitis and colon carcinogenesis. Matrix Biology, 83, 97-115. https://doi.org/10.1016/j.matbio.2019.08.006 DOI: https://doi.org/10.1016/j.matbio.2019.08.006

Chen, W., Huang, J., Xiong, J., Fu, P., Chen, C., Liu, Y., Li, Z., Jie, Z., & Cao, Y. (2021). Identification of a Tumor Microenvironment-Related Gene Signature Indicative of Disease Prognosis and Treatment Response in Colon Cancer. Oxidative Medicine and Cellular Longevity, 2021, 6290261. https://doi.org/10.1155/2021/6290261 DOI: https://doi.org/10.1155/2021/6290261

Colaprico, A., Silva, T. C., Olsen, C., Garofano, L., Cava, C., Garolini, D., Sabedot, T. S., Malta, T. M., Pagnotta, S. M., Castiglioni, I., Ceccarelli, M., Bontempi, G., & Noushmehr, H. (2016). TCGAbiolinks: an R/Bioconductor package for integrative analysis of TCGA data. Nucleic Acids Res, 44(8), e71. https://doi.org/10.1093/nar/gkv1507 DOI: https://doi.org/10.1093/nar/gkv1507

Dunne, P. D., & Arends, M. J. (2024). Molecular pathological classification of colorectal cancer-an update. Virchows Arch, 484(2), 273-285. https://doi.org/10.1007/s00428-024-03746-3 DOI: https://doi.org/10.1007/s00428-024-03746-3

Ellrott, K., Wong, C. K., Yau, C., Castro, M. A. A., Lee, J. A., Karlberg, B. J., Grewal, J. K., Lagani, V., Tercan, B., Friedl, V., Hinoue, T., Uzunangelov, V., Westlake, L., Loinaz, X., Felau, I., Wang, P. I., Kemal, A., Caesar-Johnson, S. J., Shmulevich, I. & Laird, P. W. (2024). Classification of non-TCGA cancer samples to TCGA molecular subtypes using compact feature sets. Cancer Cell. https://doi.org/10.1016/j.ccell.2024.12.002 DOI: https://doi.org/10.1016/j.ccell.2024.12.002

Horpaopan, S., Kirfel, J., Peters, S., Kloth, M., Huneburg, R., Altmuller, J., Drichel, D., Odenthal, M., Kristiansen, G., Strassburg, C., Nattermann, J., Hoffmann, P., Nurnberg, P., Buttner, R., Thiele, H., Kahl, P., Spier, I., & Aretz, S. (2017). Exome sequencing characterizes the somatic mutation spectrum of early serrated lesions in a patient with serrated polyposis syndrome (SPS). Hereditary Cancer in Clinical Practice, 15, 22. https://doi.org/10.1186/s13053-017-0082-9 DOI: https://doi.org/10.1186/s13053-017-0082-9

Johnson, J., Thijssen, B., McDermott, U., Garnett, M., Wessels, L. F., & Bernards, R. (2016). Targeting the RB-E2F pathway in breast cancer. Oncogene, 35(37), 4829-4835. https://doi.org/10.1038/onc.2016.32 DOI: https://doi.org/10.1038/onc.2016.32

Latini, F. R., Hemerly, J. P., Freitas, B. C., Oler, G., Riggins, G. J., & Cerutti, J. M. (2011). ABI3 ectopic expression reduces in vitro and in vivo cell growth properties while inducing senescence. BMC Cancer, 11, 11. https://doi.org/10.1186/1471-2407-11-11 DOI: https://doi.org/10.1186/1471-2407-11-11

Libbrecht, M. W., & Noble, W. S. (2015). Machine learning applications in genetics and genomics. Nature Reviews Genetics, 16(6), 321-332. https://doi.org/10.1038/nrg3920 DOI: https://doi.org/10.1038/nrg3920

Liberzon, A., Birger, C., Thorvaldsdottir, H., Ghandi, M., Mesirov, J. P., & Tamayo, P. (2015). The Molecular Signatures Database (MSigDB) hallmark gene set collection. Cell Syst, 1(6), 417-425. https://doi.org/10.1016/j.cels.2015.12.004 DOI: https://doi.org/10.1016/j.cels.2015.12.004

Lopez-Cortes, A., Cabrera-Andrade, A., Vazquez-Naya, J. M., Pazos, A., Gonzales-Diaz, H., Paz, Y. M. C., Guerrero, S., Perez-Castillo, Y., Tejera, E., & Munteanu, C. R. (2020). Prediction of breast cancer proteins involved in immunotherapy, metastasis, and RNA-binding using molecular descriptors and artificial neural networks. Scientific Report, 10(1), 8515. https://doi.org/10.1038/s41598-020-65584-y DOI: https://doi.org/10.1038/s41598-020-65584-y

Mounir, M., Lucchetta, M., Silva, T. C., Olsen, C., Bontempi, G., Chen, X., Noushmehr, H., Colaprico, A., & Papaleo, E. (2019). New functionalities in the TCGAbiolinks package for the study and integration of cancer data from GDC and GTEx. PLoS Computational Biology, 15(3), e1006701. https://doi.org/10.1371/journal.pcbi.1006701 DOI: https://doi.org/10.1371/journal.pcbi.1006701

Nong, B., Guo, M., Wang, W., Songyang, Z., & Xiong, Y. (2021). Comprehensive Analysis of Large-Scale Transcriptomes from Multiple Cancer Types. Genes (Basel), 12(12). https://doi.org/10.3390/genes12121865 DOI: https://doi.org/10.3390/genes12121865

Okoro, P. C., Schubert, R., Guo, X., Johnson, W. C., Rotter, J. I., Hoeschele, I., Liu, Y., Im, H. K., Luke, A., Dugas, L. R., & Wheeler, H. E. (2021). Transcriptome prediction performance across machine learning models and diverse ancestries. HGG Advances, 2(2). https://doi.org/10.1016/j.xhgg.2020.100019 DOI: https://doi.org/10.1016/j.xhgg.2020.100019

Oshi, M., Takahashi, H., Tokumaru, Y., Yan, L., Rashid, O. M., Nagahashi, M., Matsuyama, R., Endo, I., & Takabe, K. (2020). The E2F Pathway Score as a Predictive Biomarker of Response to Neoadjuvant Therapy in ER+/HER2- Breast Cancer. Cells, 9(7). https://doi.org/10.3390/cells9071643 DOI: https://doi.org/10.3390/cells9071643

Palma, M., Lopez, L., Garcia, M., de Roja, N., Ruiz, T., Garcia, J., Rosell, E., Vela, C., Rueda, P., & Rodriguez, M. J. (2012). Detection of collagen triple helix repeat containing-1 and nuclear factor (erythroid-derived 2)-like 3 in colorectal cancer. BMC Clinical Pathology, 12, 2. https://doi.org/10.1186/1472-6890-12-2 DOI: https://doi.org/10.1186/1472-6890-12-2

Parr, C., & Jiang, W. G. (2003). Quantitative analysis of lymphangiogenic markers in human colorectal cancer. International Journal of Oncology, 23(2), 533-539. https://doi.org/10.3892/ijo.23.2.533 DOI: https://doi.org/10.3892/ijo.23.2.533

Qi, Y., Li, H., Lv, J., Qi, W., Shen, L., Liu, S., Ding, A., Wang, G., Sun, L., & Qiu, W. (2020). Expression and function of transmembrane 4 superfamily proteins in digestive system cancers. Cancer Cell Internation, 20, 314. https://doi.org/10.1186/s12935-020-01353-1 DOI: https://doi.org/10.1186/s12935-020-01353-1

Reimand, J., Isserlin, R., Voisin, V., Kucera, M., Tannus-Lopes, C., Rostamianfar, A., Wadi, L., Meyer, M., Wong, J., Xu, C., Merico, D., & Bader, G. D. (2019). Pathway enrichment analysis and visualization of omics data using g:Profiler, GSEA, cytoscape and enrichmentMap. Nature Protocols, 14(2), 482-517. https://doi.org/10.1038/s41596-018-0103-9 DOI: https://doi.org/10.1038/s41596-018-0103-9

Saliba, J., Coutaud, B., Makhani, K., Epstein Roth, N., Jackson, J., Park, J. Y., Gagnon, N., Costa, P., Jeyakumar, T., Bury, M., Beauchemin, N., Mann, K. K., & Blank, V. (2022). Loss of NFE2L3 protects against inflammation-induced colorectal cancer through modulation of the tumor microenvironment. Oncogene, 41(11), 1563-1575. https://doi.org/10.1038/s41388-022-02192-2 DOI: https://doi.org/10.1038/s41388-022-02192-2

Sawicki, T., Ruszkowska, M., Danielewicz, A., Niedzwiedzka, E., Arlukowicz, T., & Przybylowicz, K. E. (2021). A review of colorectal cancer in terms of epidemiology, risk factors, development, symptoms and diagnosis. Cancers (Basel), 13(9). https://doi.org/10.3390/cancers13092025 DOI: https://doi.org/10.3390/cancers13092025

Siegel, R. L., Giaquinto, A. N., & Jemal, A. (2024). Cancer statistics, 2024. CA Cancer Journal for Clinicians, 74(1), 12-49. https://doi.org/10.3322/caac.21820 DOI: https://doi.org/10.3322/caac.21820

Siegel, R. L., Wagle, N. S., Cercek, A., Smith, R. A., & Jemal, A. (2023). Colorectal cancer statistics, 2023. CA: A Cancer Journal for Clinicians, 73(3), 233-254. https://doi.org/https://doi.org/10.3322/caac.21772 DOI: https://doi.org/10.3322/caac.21772

Son, H. J., Choi, E. J., Yoo, N. J., & Lee, S. H. (2020). Mutation and expression of a candidate tumor suppressor gene EPB41L3 in gastric and colorectal cancers. Pathology & Oncology Research, 26(3), 2003-2005. https://doi.org/10.1007/s12253-019-00787-x DOI: https://doi.org/10.1007/s12253-019-00787-x

Subramanian, A., Tamayo, P., Mootha, V. K., Mukherjee, S., Ebert, B. L., Gillette, M. A., Paulovich, A., Pomeroy, S. L., Golub, T. R., Lander, E. S., & Mesirov, J. P. (2005). Gene set enrichment analysis: a knowledge-based approach for interpreting genome-wide expression profiles. Proceedings of the National Academy of Sciences, 102(43), 15545-15550. https://doi.org/10.1073/pnas.0506580102 DOI: https://doi.org/10.1073/pnas.0506580102

Sundov, Z., Tomic, S., Alfirevic, S., Sundov, A., Capkun, V., Nincevic, Z., Nincevic, J., Kunac, N., Kontic, M., Poljak, N., & Druzijanic, N. (2013). Prognostic value of MVD, LVD and vascular invasion in lymph node-negative colon cancer. Hepatogastroenterology, 60(123), 432-438. https://doi.org/10.5754/hge12826

Tang, X., & Brindley, D. N. (2020). Lipid Phosphate Phosphatases and Cancer. Biomolecules, 10(9). https://doi.org/10.3390/biom10091263 DOI: https://doi.org/10.3390/biom10091263

Tomczak, K., Czerwinska, P., & Wiznerowicz, M. (2015). The Cancer Genome Atlas (TCGA): an immeasurable source of knowledge. Contemporary Oncology (Pozn), 19(1A), A68-77. https://doi.org/10.5114/wo.2014.47136 DOI: https://doi.org/10.5114/wo.2014.47136

Viudez-Pareja, C., Kreft, E., & Garcia-Caballero, M. (2023). Immunomodulatory properties of the lymphatic endothelium in the tumor microenvironment. Frontiers Immunology, 14, 1235812. https://doi.org/10.3389/fimmu.2023.1235812 DOI: https://doi.org/10.3389/fimmu.2023.1235812

Walter Reed National Military Medical Center. (2024). Colorectal Cancer Awareness Month: Early detection is the best prevention. https://walterreed.tricare.mil/News-Gallery/Articles/Article/3719070/colorectal-cancer-awareness-month-early-detection-is-the-best-prevention#:~:text=According%20to%20the%20American%20Cancer,men%20and%2019%2C890%20in%20women).

Wang, Z., Gerstein, M., & Snyder, M. (2009). RNA-Seq: a revolutionary tool for transcriptomics. Nature Reviews Genetics, 10(1), 57-63. https://doi.org/10.1038/nrg2484 DOI: https://doi.org/10.1038/nrg2484

Xi, Y., & Xu, P. (2021). Global colorectal cancer burden in 2020 and projections to 2040. Translational Oncology, 14(10), 101174. https://doi.org/10.1016/j.tranon.2021.101174 DOI: https://doi.org/10.1016/j.tranon.2021.101174

Downloads

Published

2025-06-27

How to Cite

Hamza, M. A., & Islam, S. (2025). Integrative machine learning reveals potential signature genes using transcriptomics in colon cancer. Brazilian Journal of Science, 4(9), 12–23. https://doi.org/10.14295/bjs.v4i9.745