TY - JOUR
T1 - Multiscale Multiinteraction Network for Remote Sensing Image Captioning
AU - Wang, Yong
AU - Zhang, Wenkai
AU - Zhang, Zhengyuan
AU - Gao, Xin
AU - Sun, Xian
N1 - Generated from Scopus record by KAUST IRTS on 2023-09-21
PY - 2022/1/1
Y1 - 2022/1/1
N2 - Much of the recent work in remote sensing image captioning is influenced by natural image captioning. These methods tend to fix the defects of the model architecture to improve the previous work, but pay little attention to the differences between remote sensing images and natural images. By considering these differences, we propose a multiscale multiinteraction remote sensing image captioning model. As in Fig. 1(a), the targets in remote sensing images have a wide range of scales; while the natural images are generally taken close-up, resulting in a similar scale for the foreground targets. Due to the difference in shooting methods, the model pretrained on close-up natural images cannot capture multiscale remote sensing targets well. To alleviate this problem, we propose a two-stage multiscale structure for feature representation, where we first finetune the CNN backbone on remote sensing images for domain adaption, then we collect features from different stages as the multiscale feature representation. Moreover, due to the shooting distance, the height information of the target in the remote sensing image is greatly weakened, thus some objects like low plants and grasses become difficult to identify, as in Fig. 1(b). Thus, we further propose a multiinteraction feature representation module, where information flow of the same and different layers could effectively interact. By calculating the similarity score among features, we fuse features with high similarity, and increase the distance between features of different categories, thereby enhancing the distinguishability. Results on RSICD, Sydney-Captions, and UCM-Captions show a clear improvement over the compared methods.
AB - Much of the recent work in remote sensing image captioning is influenced by natural image captioning. These methods tend to fix the defects of the model architecture to improve the previous work, but pay little attention to the differences between remote sensing images and natural images. By considering these differences, we propose a multiscale multiinteraction remote sensing image captioning model. As in Fig. 1(a), the targets in remote sensing images have a wide range of scales; while the natural images are generally taken close-up, resulting in a similar scale for the foreground targets. Due to the difference in shooting methods, the model pretrained on close-up natural images cannot capture multiscale remote sensing targets well. To alleviate this problem, we propose a two-stage multiscale structure for feature representation, where we first finetune the CNN backbone on remote sensing images for domain adaption, then we collect features from different stages as the multiscale feature representation. Moreover, due to the shooting distance, the height information of the target in the remote sensing image is greatly weakened, thus some objects like low plants and grasses become difficult to identify, as in Fig. 1(b). Thus, we further propose a multiinteraction feature representation module, where information flow of the same and different layers could effectively interact. By calculating the similarity score among features, we fuse features with high similarity, and increase the distance between features of different categories, thereby enhancing the distinguishability. Results on RSICD, Sydney-Captions, and UCM-Captions show a clear improvement over the compared methods.
UR - https://ieeexplore.ieee.org/document/9720234/
UR - http://www.scopus.com/inward/record.url?scp=85125332888&partnerID=8YFLogxK
U2 - 10.1109/JSTARS.2022.3153636
DO - 10.1109/JSTARS.2022.3153636
M3 - Article
SN - 2151-1535
VL - 15
SP - 2154
EP - 2165
JO - IEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing
JF - IEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing
ER -