Our new article has been accepted in Bioinformatics:
Löchel HF, Eger D, Sperlea T, Heider D: Deep Learning on Chaos Game Representation for Proteins. Bioinformatics 2019, in press.
Motivation: Classification of protein sequences is one big task in bioinformatics and has many applications.
Different machine learning methods exist and are applied on these problems, such as support vector
machines (SVM), random forests (RF), and neural networks (NN). All of these methods have in common
that protein sequences have to be made machine-readable and comparable in the first step, for which
different encodings exist. These encodings are typically based on physical or chemical properties of
the sequence. However, due to the outstanding performance of deep neural networks (DNN) on image
recognition, we used frequency matrix chaos game representation (FCGR) for encoding of protein
sequences into images. In this study, we compare the performance of SVMs, RFs, and DNNs, trained
on FCGR encoded protein sequences. While the original chaos game representation (CGR) has been
used mainly for genome sequence encoding and classification, we modified it to work also for protein
sequences, resulting in n-flakes representation, an image with several icosagons.
Results: We could show that all applied machine learning techniques (RF, SVM, and DNN) show promising
results compared to the state-of-the-art methods on our benchmark datasets, with DNNs outperforming
the other methods and that FCGR is a promising new encoding method for protein sequences.
Supplementary information: Supplementary data are available at Bioinformatics online.