TY - GEN
T1 - Digital signal processing for potential promoter prediction
AU - Zhang, Xuejuan
AU - Kassim, Ashraf
AU - Bajic, Vladimir B.
PY - 2004
Y1 - 2004
N2 - We evaluate the suitability of three domain transforms, DFT, DCT and DWT for recognition of human promoter sequences. We use genomic segments covering [-512,+512] relative to transcription start sites (TSSs), and also non-promoter sequences of the same length. We used a total of 14,001 promoter sequences with TSS locations determined based on experimental transcript data. Sequences were extracted from the human genome using PromoSer and FIE2 tools. The non-promoter set has the same number of sequences. We used the total count of mono-, di- and tri-nucleotides in the sequences, as well as the coefficients of domain transforms. The promoters and non-promoters were divided into 22 disjoint groups based on their GC-content. Feature selection procedures were separately applied to the data for each group and we opted to use 30 best ranked features. In each group, the data is first divided into training and test sets after random ordering of positive and negative data before it is further divided into two sets. Linear discriminant analysis is used to predict sequences as promoter (positive) and non-promoter (negative) ones. Three general observations can be made based on the experiments performed: i) the ability to recognize promoters degrades with the reduction of GC-content, ii) there are no significant differences in the prediction performance when any transform is used, and iii) the best performance was achieved by combining all three transforms. We show that the use of domain transforms in predicting human promoters is promising and thus should be combined with predictions of biological features for even better performance results.
AB - We evaluate the suitability of three domain transforms, DFT, DCT and DWT for recognition of human promoter sequences. We use genomic segments covering [-512,+512] relative to transcription start sites (TSSs), and also non-promoter sequences of the same length. We used a total of 14,001 promoter sequences with TSS locations determined based on experimental transcript data. Sequences were extracted from the human genome using PromoSer and FIE2 tools. The non-promoter set has the same number of sequences. We used the total count of mono-, di- and tri-nucleotides in the sequences, as well as the coefficients of domain transforms. The promoters and non-promoters were divided into 22 disjoint groups based on their GC-content. Feature selection procedures were separately applied to the data for each group and we opted to use 30 best ranked features. In each group, the data is first divided into training and test sets after random ordering of positive and negative data before it is further divided into two sets. Linear discriminant analysis is used to predict sequences as promoter (positive) and non-promoter (negative) ones. Three general observations can be made based on the experiments performed: i) the ability to recognize promoters degrades with the reduction of GC-content, ii) there are no significant differences in the prediction performance when any transform is used, and iii) the best performance was achieved by combining all three transforms. We show that the use of domain transforms in predicting human promoters is promising and thus should be combined with predictions of biological features for even better performance results.
UR - http://www.scopus.com/inward/record.url?scp=28244482594&partnerID=8YFLogxK
M3 - Conference contribution
AN - SCOPUS:28244482594
SN - 0780386655
T3 - 2004 IEEE International Workshop on Biomedical Circuits and Systems
SP - S2.7.INV-16-S2.7.INV-19
BT - 2004 IEEE International Workshop on Biomedical Circuits and Systems
T2 - 2004 IEEE International Workshop on Biomedical Circuits and Systems
Y2 - 1 December 2004 through 3 December 2004
ER -