Our new article has been accepted in Computational and Structural Biotechnology Journal:
Sarumi OA, Hahn M, Heider D: NeuralBeds: Neural Embeddings for Efficient DNA Data Compression and Optimized Similarity Search. Computational and Structural Biotechnology Journal 2024, in press. (Link)
The availability of high throughput sequencing tools coupled with the declining costs in the production of DNA sequences has led to the generation of enornmous amounts of omics data curated in several databases such as NCBI and EMBL. Identification of similar DNA sequences from these databases is one of the fundamental tasks in bioinformatics. It is essential for discovering homologous sequences in organisms, phylogenetic studies of evolutionary relationships among several biological entities, or detection of pathogens. Improving DNA similarity search is of outmost importance because of the increased complexity of the evergrowing repositories of sequences. Therefore, instead of using the conventional approach of comparing raw sequences, e.g., in fasta format, a numerical representation of the sequences can be used to calculate their similarities and optimize the search process. In this study, we analyzed different approaches for numerical embeddings, including Chaos Game Representation, hashing, and neural networks, and compared them with classical approaches such as principal component analysis. It turned out that neural networks generate embeddings that are able to capture the similarity between DNA sequences as a distance measure and outperform the other approaches on DNA similarity search, significantly.