Errors found in the genetic sequences of coronavirus in the world’s largest database

  • Science Park
  • April 2nd, 2025
covid

A study led by the Institute for Integrative Systems Biology (I2SysBio, UV-CSIC) has detected a significantly higher volume of data errors in the sequences of certain mutations of the SARS-CoV-2 virus. This was revealed after comparing information from GISAID — the main database used during the pandemic — with data obtained from direct genome sequencing. The findings, which highlight weaknesses in public databases, will help improve virus surveillance and, consequently, enhance viral detection methodologies and clinical intervention processes.

The research, recently published in the journal Virus Evolution, offers a new perspective on the ability of the SARS-CoV-2 virus to mutate in unusual ways and infect humans. The results indicate that many sequences with corrected mutations in the virus’s spike protein — the most variable part of the viral genome and the main route of infection of human cells — were the result of errors introduced during processing in large genetic databases.

According to the study, the computer methods used to analyse millions of viral sequences may lead to errors, creating the false impression that the virus corrects a particular type of mutation more frequently than it actually does. By comparing these processed data with information obtained directly from genome sequencing, the research team has gained a more realistic understanding of the genetic changes occurring in the virus. “If we examine these regions — the part of the spike protein that has been lost — and rely on the processed data, we are overestimating the actual number of mutations present in that DNA sequence”, says Mireia Coscollà Devís, a CSIC researcher at I2SysBio and leader of the project. “We realised that the sequences in the GISAID database were processed differently by each laboratory and contained many distortions for this type of marker”, the scientist adds.

Sharing Genomic Data on Pathogens

While the research underscores the importance of carefully examining genetic data to avoid erroneous conclusions, and the World Health Organization (WHO) advocates for a policy of sharing genomic data on pathogens to protect public health, Spain lacks a central repository for human, animal and environmental pathogen sequence data. There is also no policy in place for the anonymised sharing of data between healthcare and scientific institutions. “This makes it more difficult to track and respond to infectious diseases, including monitoring antimicrobial resistance”, explains Fernando González-Candelas, professor of Genetics at the University of Valencia and researcher at the FISABIO foundation, who also participated in the study. Ron Geller, a CSIC researcher at I2SysBio, emphasised the importance of combining computational and evolutionary biology with laboratory experiments to advance knowledge of pathogens.

The study was led by I2SysBio (the Pathogenomics and the Viral Biology groups) and also involved the Institute of Biomedicine of Valencia (IBV, CSIC) and the La Fe Health Research Institute (IIS-La Fe).

The work has been funded by the Ministry of Science, Innovation and Universities and the European Union, with NextGenerationEU/PRTR funds through the PTI+ Global Health initiative of the CSIC. Additionally, it is supported by the Government of Valencia and the European Social Fund through the CIACIF/2022/333 grant. The computational work was carried out in Garnatxa, the high-performance computing (HPC) cluster of the Institute for Integrative Systems Biology.

Reference:

Miguel Álvarez-Herrera, Paula Ruiz-Rodriguez, Beatriz Navarro-Domínguez, Joao Zulaica, Brayan Grau, María Alma Bracho, Manuel Guerreiro, Cristóbal Aguilar‐Gallardo, Fernando González-Candelas, Iñaki Comas, Ron Geller, Mireia Coscollá, Genome data artifacts and functional studies of deletion repair in the BA.1 SARS-CoV-2 spike protein, Virus Evolution, 2025; https://doi.org/10.1093/ve/veaf015

Links: