Building extraction in pixel level from aerial imagery with a deep encoder-decoder network 基于编解码网络的航空影像像素级建筑物提取

Kaiqiang Chen, Xin Gao, Menglong Yan, Yue Zhang, Xian Sun

Research output: Contribution to journalArticlepeer-review

10 Scopus citations


Building extraction plays a significant role in land use analysis like urban planning. Classical methods based on hand-crafted features fail to derive prominent building extraction results due to the limited representation capacity of the hand-crafted features. In this paper, we achieve building extraction in pixel level based on a deep Convolutional Neural Network (CNN) with an encoder-decoder structure. In contrast to the hand-crafted features that require professional knowledge and have a poor representation capacity, convolutional neural networks are equipped with a high representation capacity and able to learn highly abstract and distinguishing features from data. The encoder is used to derived a space compressed representation of the input raw image. This compressed representation is also called a feature of the input image and it is assumed to be abstract and distinguishing. The decoder uses the feature as input and recover the space resolution to the size of the input image. Thereby, the encoder-decoder network achieves pixel-wise building extraction in an end-to-end way from the raw image to the building extraction result. Applying the encoder-decoder network to building extraction will cause a Marginal Phenomenon (MP). Specifically, the prediction accuracy near the edges of a patch is usually lower than that near the central area. Marginal phenomenon will lead to the reduction of building extraction accuracy. To alleviate this effect, we propose the usage of the Field of View Enhancement (FoVE) method. The FoVE method includes two parts: enlarging the patch size and cropping patches with overlaps when making predictions. Therefore, the FoVE method contains two hyper-parameters, which are patch size and overlapping size. Extensive experiments on two building extraction datasets are conducted to analyze the impact of the two hyper-parameters through the Precision-Recall Curves (PRC) and some interesting conclusions are derived from the the analysis: (1) Enlarging the input patch size when making prediction can effectively improve the building extraction performance while the improvement saturates as the overlapping size increases; (2) Cropping patches with an overlap when making prediction can improve the building extraction performance while the improvement saturates as the input patch size increases; (3) The FoVE can effectively improve building extraction accuracy but this improvement from the FoVE has a limit; (4) The convolutional neural network for building extraction plays the key role and further attentions should be focused on the network design. In addition to the numerical analysis of the FoVE experimental results, we attempt to explain why FoVE works and why it has a limit. We blame them on the Field of View (FoV) and that is reason why the method is call FoVE. FoV plays an important role in building extraction and a larger FoV is beneficial to building extraction. Firstly, the marginal phenomenon is caused by the lack of context information of the marginal pixels. FoVE improves the overall accuracy through abandoning the unreliable predictions of the marginal pixels. Secondly, enlarging input patches can enlarge the FoV of each pixel and thus improves the accuracy. Thirdly, the the improvement from FoVE has a limit because that when the field of view is large enough, the improvement derived from more contextual information can be ignore.
Original languageEnglish (US)
Pages (from-to)1134-1142
Number of pages9
JournalYaogan Xuebao/Journal of Remote Sensing
Issue number9
StatePublished - Sep 25 2020
Externally publishedYes

Bibliographical note

Generated from Scopus record by KAUST IRTS on 2023-09-21


Dive into the research topics of 'Building extraction in pixel level from aerial imagery with a deep encoder-decoder network 基于编解码网络的航空影像像素级建筑物提取'. Together they form a unique fingerprint.

Cite this