Mathematical patterns associated with genetic recombination in HIV-1 and SARS-CoV-2 via explainability in CNNs: relationship of interpretability scoremaps with viral phylogeny and the role of the third codon nucleotide
Cargando...
Fecha
2025-01-10
Autores
Título de la revista
ISSN de la revista
Título del volumen
Editor
Universidad de Deusto
Resumen
To research genetic recombination among viruses, we applied genomic spectrograms and pre-trained CNNs via Transfer Learning, followed by explainability tools. Our approach aimed to understand recombination mechanisms and characterize genomic sequences. Via this methodology, we detected mathematical patterns that indicated genetic recombination in complete HIV-1 sequences. The main hot zone was near f = 1/3 in 5'LTR, extending to the first half of gag. LTRs regulate gene expression and virus replication. Nucleotides A and T in polyA and polyT stretches make these regions prone to polymerase skipping, aiding genetic recombination. The high Adenine content in HIV-1 explains the prominent role of A and, secondly, T. We found a high correlation between hot zone similarity and phylogenetic proximity, with subtypes U, G, and O showing unique patterns. The CNN's behavior aligned with the biological traits of these subtypes. Our phylogenetic analysis and literature review confirmed the microbiological reality for subtype U. For subtype G, our results supported its potential recombinant origin and provided data for future research. Applying Two-Stage Transfer Learning and Three-Step Explainability to complete SARS-CoV-2 genomic spectrograms, we detected a clear mathematical signature associated with the recombinant feature at f = 1/6 in the Spike protein, which is crucial for SARS-CoV-2 evolution and with multiple double-crossover genetic recombination events. By this methodology, we also identified clear mathematical signatures for non-recombinant SARS-CoV-2 variants: pre-VOC, Alpha, Delta, and Omicron. Analyzing recombinant and non-recombinant SARS-CoV-2 sequences, we found a high Thymine content exceeding 45% in the third codon nucleotides at Spike protein. This phenomenon is statistically significant considering the low conservation degree of nucleotides in Spike (0.23%). We found evidence suggesting that third codon nucleotides, less significant for amino acid coding, may contain additional information characterizing SARS-CoV-2. Periodicities of 3 and their multiples were linked to recombinant features in HIV-1 and SARS-CoV-2, and to specific variants in SARS-CoV-2. This suggests that third-position nucleotides may hold encrypted information affecting viral classification. Validating this hypothesis could reshape our understanding of the genetic code and spur new research in molecular biology and viral evolution.
Palabras clave
Descripción
Materias
Ciencias Tecnológicas
Tecnología bioquímica
Tecnología bioquímica