Motivation:The significance of long non-coding RNAs (lncRNAs) in many biological processes and diseases has gained intense interests over the past several years. However, computational identification of lncRNAs in a wide range of species remains challenging; it requires prior knowledge of well-established sequences and annotations or species-specific training data, but the reality is that only a limited number of species have high-quality sequences and annotations. Results:Here we first characterize lncRNAs by contrast to protein-coding RNAs based on feature relationship and find that the feature relationship between ORF (open reading frame) length and GC content presents universally substantial divergence in lncRNAs and protein-coding RNAs, as observed in a broad variety of species. Based on the feature relationship, accordingly, we further present LGC, a novel algorithm for identifying lncRNAs that is able to accurately distinguish lncRNAs from protein-coding RNAs in a cross-species manner without any prior knowledge. As validated on large-scale empirical datasets, comparative results show that LGC outperforms existing algorithms by achieving higher accuracy, well-balanced sensitivity and specificity, and is robustly effective (>90% accuracy) in discriminating lncRNAs from protein-coding RNAs across diverse species that range from plants to mammals. To our knowledge, this study, for the first time, differentially characterizes lncRNAs and protein-coding RNAs based on feature relationship, which is further applied in computational identification of lncRNAs. Taken together, our study represents a significant advance in characterization and identification of lncRNAs and LGC thus bears broad potential utility for computational analysis of lncRNAs in a wide range of species. Availability:LGC web server is publicly available at http://bigd.big.ac.cn/lgc/calculator. The scripts and data can be downloaded at http://bigd.big.ac.cn/biocode/tools/BT000004. Supplementary information:Supplementary data are available at Bioinformatics online.
Bibliographical noteKAUST Repository Item: Exported on 2020-10-01
Acknowledged KAUST grant number(s): BAS/1/1606-01-01
Acknowledgements: Acknowledgements: We thank Shuhui Song, Lili Hao and Shixiang Sun for valuable comments on this work. Funding: This work was supported by Strategic Priority Research Programme of the Chinese Academy of Sciences [XDB13040500 and XDA08020102 to Z.Z.]; National Key Research and Development Programme of China [2017YFC0907502 and 2015AA020108 to Z.Z.; 2016YFE0206600 to Y.B.]; International Partnership Programme of the Chinese Academy of Sciences [153F11KYSB20160008]; National Natural Science Foundation of China [31200978 to L.M.]; The 100-Talent Programme of Chinese Academy of Sciences to Z.Z. and Y.B.; The Open Biodiversity and Health Big Data Initiative of IUBS [Y.B.]; The 13th Five-year Informatization Plan of Chinese Academy of Sciences [XXH13505-05 to Y.B.]; The King Abdullah University of Science and Technology (KAUST) Base Research Funds [BAS/1/1606-01-01 to VBB]. Funding for open access charge: Strategic Priority Research Programme of the Chinese Academy of Sciences.