The challenge in finding genes in eukaryotic organisms using computational methods is an ongoing problem in the biology. Based on various genomic signals found in eukaryotic genomes, this problem can be divided into many different sub-problems such as identification of transcription start sites, translation initiation sites, splice sites, poly (A) signals, etc. Each sub-problem deals with a particular type of genomic signals and various computational methods are used to solve each sub-problem. Aggregating information from all these individual sub-problems can lead to a complete annotation of a gene and its component signals. The fundamental principle of most of these computational methods is the mapping principle – building an input-output model for the prediction of a particular genomic signal based on a set of known input signals and their corresponding output signal. The type of input signals used to build the model is an essential element in most of these computational methods. The common factor of most of these methods is that they are mainly based on the statistical analysis of the basic nucleotide sequence string composition. 4 Our study is based on a novel approach to predict genomic signals in which uniquely generated structural profiles that combine compressed physicochemical properties with topological and compositional properties of DNA sequences are used to develop machine learning predictive models. The compression of the physicochemical properties is made using principal component analysis transformation. Our ideas are evaluated through prediction models of canonical splice sites using support vector machine models. We demonstrate across several species that the proposed methodology has resulted in the most accurate splice site predictors that are publicly available or described. We believe that the approach in this study is quite general and has various applications in other biological modeling problems.
|Date made available
|KAUST Research Repository