TY - GEN
T1 - Modeling 5' regions of histone genes using Bayesian networks
AU - Chowdhary, Rajesh
AU - Ali, R. Ayesha
AU - Bajic, Vladimir B.
PY - 2005
Y1 - 2005
N2 - Histones constitute a rich protein family that is evolutionarily conserved across species. They play important roles in chromosomal functions in cell, such as chromosome condensation, recombination, replication, and transcription. We have modeled histone gene 5' end segments covering [-50,+500] relative to transcription start sites (TSSs). These segments contain parts of the coding regions in most of the genes that we studied. We determined characteristics of these segments for 116 mammalian (human, mouse, rat) histone genes based on distribution of DNA motifs obtained from MEME-MAST. We found that all five mammalian histone types (H1, H2A, H2B, H3, H4) have mutually distinct, prominent and strongly conserved properties downstream to the TSS reasonably well conserved across analyzed species. We then transformed the primary level motif data for each sequence into a higher order motif arrangement that involved only features such as presence of a motif, its position, its strand orientation, and mutual spacer length between motifs. We have built a Bayesian Network model based on these features and used the higher order motif arrangement data for its training and testing. When tested for classification between the five histone groups and using the leave-one-out cross-validation technique, the Bayesian model correctly classified 100% of histone H1 sequences, 100% of histone H2A sequences, 96.9% of histone H2B sequences, 94.4% of histone H3 sequences, and 95.8% of histone H4 sequences. Overall, the model correctly classified 97.4% of all histones sequences. Our Bayesian model has the advantage in having a small number of trainable parameters and it produces very few false positives. The model could be used to scan the genome for discovery of genes whose products are similar to histones.
AB - Histones constitute a rich protein family that is evolutionarily conserved across species. They play important roles in chromosomal functions in cell, such as chromosome condensation, recombination, replication, and transcription. We have modeled histone gene 5' end segments covering [-50,+500] relative to transcription start sites (TSSs). These segments contain parts of the coding regions in most of the genes that we studied. We determined characteristics of these segments for 116 mammalian (human, mouse, rat) histone genes based on distribution of DNA motifs obtained from MEME-MAST. We found that all five mammalian histone types (H1, H2A, H2B, H3, H4) have mutually distinct, prominent and strongly conserved properties downstream to the TSS reasonably well conserved across analyzed species. We then transformed the primary level motif data for each sequence into a higher order motif arrangement that involved only features such as presence of a motif, its position, its strand orientation, and mutual spacer length between motifs. We have built a Bayesian Network model based on these features and used the higher order motif arrangement data for its training and testing. When tested for classification between the five histone groups and using the leave-one-out cross-validation technique, the Bayesian model correctly classified 100% of histone H1 sequences, 100% of histone H2A sequences, 96.9% of histone H2B sequences, 94.4% of histone H3 sequences, and 95.8% of histone H4 sequences. Overall, the model correctly classified 97.4% of all histones sequences. Our Bayesian model has the advantage in having a small number of trainable parameters and it produces very few false positives. The model could be used to scan the genome for discovery of genes whose products are similar to histones.
UR - http://www.scopus.com/inward/record.url?scp=84856980638&partnerID=8YFLogxK
U2 - 10.1142/9781860947322_0028
DO - 10.1142/9781860947322_0028
M3 - Conference contribution
AN - SCOPUS:84856980638
SN - 1860944779
SN - 9781860944772
T3 - Series on Advances in Bioinformatics and Computational Biology
SP - 283
EP - 288
BT - Proceedings of the 3rd Asia-Pacific Bioinformatics Conference, APBC 2005
PB - Imperial College Press
T2 - 3rd Asia-Pacific Bioinformatics Conference, APBC 2005
Y2 - 17 January 2005 through 21 January 2005
ER -