studies cannot be directly applied. Therefore, how to make
sufficient use of context information in infrared small targets
is an urgent problem to be addressed.
In our research, we propose a data-driven approach
called attention-guided pyramid context network (AGPC-
Net). For the contextual information of feature pixels, we
propose attention-guided context block (AGCB), which
perceives the target location within the patches via local se-
mantic association (LSA) and suppresses structured clutter.
The correlation between patches is estimated by global con-
text attention (GCA), which further suppresses point-like
highlighted noise by combining with context information.
For the multiscale representation of deep semantics, we
propose context pyramid module (CPM), which collates and
fuses multiple scales of AGCBs as feature pyramids with the
original feature map. In the upsampling stage, we propose
asymmetric fusion module ( AFM), which takes both low
and deep semantics as input and fuses the extracted attention
with the feature map, preserving as much information as
possible about small targets.
The main c ontributions of this work can be summarized
as follows:
1) We propose AGPCNet for infrared small targets
detection. Contextual information for infrared small
targets is integrated and explored through CPM.
AFM fuses low and deep semantics in an upsampling
stage to retain more useful information.
2) AGCB for perceiving contextual information of in-
frared small targets at a local scale is proposed. It
uses LSA and GCA to estimate the correlation of
pixels within and between patches, which highlights
targets and suppresses the background.
3) Experiments on three datasets and six complex
scenes demonstrate the effectiveness of each module.
Compared with state-of-the-art methods, AGPCNet
has more superior and robust performance for the
complex backgrounds.
The rest of this article is organized as follows. Section II
describes the work related to our finding; Section III specif-
ically describes AGPCNet and the individual modules in
it; Section IV confirms the effectiveness of the proposed
network with systematic experiments; Finally, Section V
concludes this article.
II. RELATED WORK
Context Modules: With the presentation of a nonlocal
(NL) network [22], context modules are widely used for
tasks such as semantic segmentation and target detection
due to their excellent performance and ease of embedding.
Subsequently, some work set out to improve its perfor-
mance [25], [26]. Dual attention network (DANet) [23] en-
hances feature representation by simultaneously estimating
the correlation between channels and pixels. Global con-
text network (GCNet) [24] combines pixel correlation with
channel attention [27] to achieve promising performance.
Point-wise spatial attention network (PSANet) [28] divides
the c orrelation between two pixels into “collection” and
“distribution” to compute the two processes independently.
Criss-cross network (CCNet) [29] addresses the high com-
plexity of NL by limiting the computation from global to
criss-cross and using recurrent operations to reach the global
correlation, greatly improving computational efficiency.
Infrared Small Target Detection Networks: In visible
image data, small target detection is gradually receiving
more attention [30], [31], [32], [33], [34].However,the
uniqueness of infrared imaging method leads to a great
difference between infrared images and visible images in
terms of background and target. Meanwhile, infrared small
target detection technology has been studied for decades
as a key component of modern information systems. In
recent years, the release of public datasets has facilitated
the development of neural networks. The m iss detection
false alarm (MDFA) dataset [20] contains 10 000 training
images and 100 test images, and the single-frame infrared
small target (SIRST) dataset [19] has 427 images in total.
The infrared small target detection task has also been inten-
sively studied as a target pixel segmentation task. Scholars
have also proposed solutions from a variety of perspectives
such as generative adversarial networks [20], cross-layer
feature fusion, and feature tensor local contrast [21]. These
data-driven approaches have achieved promising results, but
they ignore the correlation between feature pixels in neural
networks, which can be explored as critical items to improve
performance.
III. ATTENTION-GUIDED PYRAMID CONTEXT NET-
WORK
We illustrate the details of our general network architec-
ture in this section. CPM, AGCB, and AFM are described
individually. Especially, the AGCB contains a global con-
textual attention submodule and a local semantic association
submodule.
A. Network Architecture and CPM
Fig. 1(a) shows the whole network pipeline. Our input
is an image, then a feature map X with a spatial size of
H × W × C is generated after a deep convolutional neural
network (e.g., ResNet [35]). To reduce missing features
caused from downsampling and preserve small targets in-
formation, we remove the maxpooling layer and set the
stride of the first convolutional layer to one. The last three
convolutional blocks perform the downsampling operation,
leading to the feature map X , i.e.,
1/8 of the input image.
Then, we take the feature map X through the CPM to
obtain the integrated feature map C. CPM feeds X into
multiple scales of AGCB in parallel, the scale is denoted
as S ∈{S
1
,...,S
n
}. For each scale S
i
, AGCB integrates
contextual information through the operations in Fig. 1(b),
retaining key information for small targets and obtaining
the feature map A
S
i
. We will describe the details of AGCB
and GCA, which will be included in the next section. In
the next step, we concatenate {A
S
i
} obtained by AGCB at
multiple scales with the feature map X . Finally, we fuse the
ZHANG ET AL.: AGPCNETS FOR DETECTING INFRARED SMALL TARGET UNDER COMPLEX BACKGROUND 4251
Authorized licensed use limited to: NANJING UNIVERSITY OF AERONAUTICS AND ASTRONAUTICS. Downloaded on September 25,2024 at 13:18:48 UTC from IEEE Xplore. Restrictions apply.