Long-read sequencing technology enables significant progress in de novo genome assembly. However, the high error rate and the wide error distribution of raw reads result in a large number of errors in the assembly. Polishing is a procedure to fix errors in the draft assembly and improve the reliability of genomic analysis. However, existing methods treat all the regions of the assembly equally while there are fundamental differences between the error distributions of these regions. How to achieve very high accuracy in genome assembly is still a challenging problem. Motivated by the uneven errors in different regions of the assembly, we propose a novel polishing workflow named BlockPolish. In this method, we divide contigs into blocks with low complexity and high complexity according to statistics of aligned nucleotide bases. Multiple sequence alignment is applied to realign raw reads in complex blocks and optimize the alignment result. Due to the different distributions of error rates in trivial and complex blocks, two multitask bidirectional Long short-term memory (LSTM) networks are proposed to predict the consensus sequences. In the whole-genome assemblies of NA12878 assembled by Wtdbg2 and Flye using Nanopore data, BlockPolish has a higher polishing accuracy than other state-of-the-arts including Racon, Medaka and MarginPolish & HELEN. In all assemblies, errors are predominantly indels and BlockPolish has a good performance in correcting them. In addition to the Nanopore assemblies, we further demonstrate that BlockPolish can also reduce the errors in the PacBio assemblies. The source code of BlockPolish is freely available on Github (https://github.com/huangnengCSU/BlockPolish).
|Original language||English (US)|
|Journal||Briefings in bioinformatics|
|State||Published - Oct 7 2021|
Bibliographical noteKAUST Repository Item: Exported on 2021-10-12
Acknowledged KAUST grant number(s): FCC/1/1976-26-01, OSR, REI/1/4473-01-01, REI/1/4742-01, URF/1/3412-01, URF/1/4098-01-01
Acknowledgements: This work was supported in part by the National Natural Science Foundation of China under grants (Nos. U1909208 and 61772557); 111 Project (No. B18059); Hunan Provincial Science and Technology Program (No. 2018wk4001 to J.W.); the US National Institute of Food and Agriculture (NIFA) under grant (2017-70016-26051 to F.L.) and the US National Science Foundation (NSF) under grant (ABI-1759856 to F.L.); the King Abdullah University of Science and Technology (KAUST) Office of Sponsored Research (OSR) under Award No. (FCC/1/1976-26-01, URF/1/3412-01-01, URF/1/4098-01-01, REI/1/4742-01-01 and REI/1/4473-01-01 to X.G.).
ASJC Scopus subject areas
- Molecular Biology
- Information Systems