Promoters are key regions that are involved in differential transcription regulation
of protein-coding and RNA genes. The gene-specific architecture of promoter
sequences makes it extremely difficult to devise a general strategy for their computational
identification. Accurate prediction of promoters is fundamental for interpreting
gene expression patterns, and for constructing and understanding genetic regulatory
networks. In the last decade, genomes of many organisms have been sequenced and
their gene content was mostly identified. Promoters and transcriptional start sites
(TSS), however, are still left largely undetermined and efficient software able to accurately
predict promoters in newly sequenced genomes is not yet available in the
public domain. While there are many attempts to develop computational promoter
identification methods, reliable tools to analyze long genomic sequences are still lacking.
In this dissertation, I present the methods I have developed for prediction of promoters
for different organisms. The first two methods, TSSPlant and PromCNN,
achieved state-of-the-art performance for discriminating promoter and non-promoter
sequences for plant and eukaryotic promoters respectively. For TSSPlant, a large
number of features were crafted and evaluated to train an optimal classifier. Prom-
CNN was built using a deep learning approach that extracts features from the data
automatically. The trained model demonstrated the ability of a deep learning approach
to grasp complex promoter sequence characteristics.
For the latest method, DeeReCT-PromID, I focus on prediction of the exact positions
of the TSSs inside the eukaryotic genomic sequences, testing every possible location. This is a more difficult task, requiring not only an accurate classifier, but also
appropriate selection of unique predictions among multiple overlapping high scoring
genomic segments. The new method significantly outperform the previous promoter
prediction programs by considerably reducing the number of false positive predictions.
Specifically, to reduce the false positive rate, the models are adaptively and
iteratively trained by changing the distribution of samples in the training set based
on the false positive errors made in the previous iteration.
The new methods are used to gain insights into the design principles of the core
promoters. Using model analysis, I have identified the most important core promoter
elements and their effect on the promoter activity. Furthermore, the importance of
each position inside the core promoter was analyzed and validated using a large single
nucleotide polymorphisms data set. I have developed a novel general approach to
detect long range interactions in the input of a deep learning model, which was used
to find related positions inside the promoter region. The final model was applied
to the genomes of different species without a significant drop in the performance,
demonstrating a high generality of the developed method.
|Date of Award||Mar 2 2020|
|Original language||English (US)|
- Computer, Electrical and Mathematical Sciences and Engineering
|Supervisor||Xin Gao (Supervisor)|
- deep learning