Asymmetric Contextual Modulation for Infrared Small Target Detection
Yimian Dai
1
Yiquan Wu
1
Fei Zhou
1
Kobus Barnard
2
1
College of Electronic and Information Engineering, Nanjing University of Aeronautics and Astronautics
2
Department of Computer Science, University of Arizona
Abstract
Single-frame infrared small target detection remains a
challenge not only due to the scarcity of intrinsic target char-
acteristics but also because of lacking a public dataset. In
this paper, we first contribute an open dataset with high-
quality annotations to advance the research in this field. We
also propose an asymmetric contextual modulation module
specially designed for detecting infrared small targets. To
better highlight small targets, besides a top-down global
contextual feedback, we supplement a bottom-up modulation
pathway based on point-wise channel attention for exchang-
ing high-level semantics and subtle low-level details. We
report ablation studies and comparisons to state-of-the-art
methods, where we find that our approach performs signifi-
cantly better. Our dataset and code are available online
1
.
1. Introduction
Infrared small target detection is the key technique for ap-
plications including early warning systems, precision-guided
weapons, and maritime surveillance systems. In many cases,
the traditional assumptions of static backgrounds do not ap-
ply [
17
]. Therefore, researchers have started to pay more
attention to the single-frame detection problem recently [
10
].
The prevalent idea from the signal processing commu-
nity is to directly build models that measure the contrast
between the infrared small target and its neighborhood con-
text [
2
,
10
]. By applying a threshold on the final saliency
map, the potential targets are then segmented out. Despite be-
ing learning-free and computationally friendly, these model-
driven methods suffer from the following shortcomings:
1.
The target hypotheses of having global unique saliency,
sparisty, or high contrast do not hold in real-world im-
ages. Real dim targets can be inconspicuous and low-
contrast, whereas many background distractors satisfy
these hypotheses, resulting in many false alarms.
2.
Many hyper-parameters, such as
λ
in [
10
] and
h
in [
4
],
are sensitive and highly relevant with the image content,
which is not robust enough for highly variable scenes.
1
https://github.com/YimianDai/open-acm
In short, these methods are handicapped because they lack
a high-level understanding of the holistic scene, making
them incapable to detect the extreme dim ones and remove
salient distractors. Hence, it is necessary to embed high-level
contextual semantics into models for better detection.
1.1. Motivation
It is well known that deep networks can provide high-level
semantic features [
12
], and attention modules can further
boost the representation power of CNNs by capturing long-
range contextual interactions [
9
]. However, despite the great
success of convolutional neural networks in object detection
and segmentation [
36
], very few deep learning approaches
have been studied in the field of infrared small target detec-
tion. We suggest the principal reasons are as follows:
1. Lack of a public dataset so far
. Deep learning is data-
hungry. However, until now, there is no public infrared
small target dataset with high-quality annotations for
the single-frame detection scenario, on which various
new approaches can be trained, tested, and compared.
2. Minimal intrinsic information
. SPIE defines the in-
frared small target as having a total spatial extent of less
than 80 pixels (9 × 9) of a 256 × 256 image [34]. The
lack of texture or shape characteristics makes purely
target-centered representations inadequate for reliable
detection. Especially, in deep networks, small targets
can be easily overwhelmed by complex surroundings.
3. Contradiction between resolution and semantics
.
Infrared small targets are often submerged in compli-
cated backgrounds with low signal-to-clutter ratios. For
networks, detecting these dim targets with low false
alarms needs both a high-level semantic understanding
of the whole infrared image and a fine-resolution pre-
diction map, which is an endogenous contradiction of
deep networks since they learn more semantic represen-
tations by gradually attenuating the feature size [14].
In addition, these state-of-the-art networks are designed
for generic image datasets [15, 19]. Directly using them for
infrared small target detection can fail catastrophically due
to the large difference in the data distribution. It requires a
re-customization of the network in multiple aspects including
949
2021 IEEE Winter Conference on Applications of Computer Vision (WACV)
978-1-6654-0477-8/21/$31.00 ©2021 IEEE
DOI 10.1109/WACV48630.2021.00099
2021 IEEE Winter Conference on Applications of Computer Vision (WACV) | 978-1-6654-0477-8/20/$31.00 ©2021 IEEE | DOI: 10.1109/WACV48630.2021.00099
Authorized licensed use limited to: NANJING UNIVERSITY OF AERONAUTICS AND ASTRONAUTICS. Downloaded on September 25,2024 at 14:01:37 UTC from IEEE Xplore. Restrictions apply.