Asymmetric Contextual Modulation for Infrared Small Target Detection
Yimian Dai
Yiquan Wu
Fei Zhou
Kobus Barnard
College of Electronic and Information Engineering, Nanjing University of Aeronautics and Astronautics
Department of Computer Science, University of Arizona
Single-frame infrared small target detection remains a
challenge not only due to the scarcity of intrinsic target char-
acteristics but also because of lacking a public dataset. In
this paper, we first contribute an open dataset with high-
quality annotations to advance the research in this field. We
also propose an asymmetric contextual modulation module
specially designed for detecting infrared small targets. To
better highlight small targets, besides a top-down global
contextual feedback, we supplement a bottom-up modulation
pathway based on point-wise channel attention for exchang-
ing high-level semantics and subtle low-level details. We
report ablation studies and comparisons to state-of-the-art
methods, where we find that our approach performs signifi-
cantly better. Our dataset and code are available online
1. Introduction
Infrared small target detection is the key technique for ap-
plications including early warning systems, precision-guided
weapons, and maritime surveillance systems. In many cases,
the traditional assumptions of static backgrounds do not ap-
ply [
]. Therefore, researchers have started to pay more
attention to the single-frame detection problem recently [
The prevalent idea from the signal processing commu-
nity is to directly build models that measure the contrast
between the infrared small target and its neighborhood con-
text [
]. By applying a threshold on the final saliency
map, the potential targets are then segmented out. Despite be-
ing learning-free and computationally friendly, these model-
driven methods suffer from the following shortcomings:
The target hypotheses of having global unique saliency,
sparisty, or high contrast do not hold in real-world im-
ages. Real dim targets can be inconspicuous and low-
contrast, whereas many background distractors satisfy
these hypotheses, resulting in many false alarms.
Many hyper-parameters, such as
in [
] and
in [
are sensitive and highly relevant with the image content,
which is not robust enough for highly variable scenes.
In short, these methods are handicapped because they lack
a high-level understanding of the holistic scene, making
them incapable to detect the extreme dim ones and remove
salient distractors. Hence, it is necessary to embed high-level
contextual semantics into models for better detection.
1.1. Motivation
It is well known that deep networks can provide high-level
semantic features [
], and attention modules can further
boost the representation power of CNNs by capturing long-
range contextual interactions [
]. However, despite the great
success of convolutional neural networks in object detection
and segmentation [
], very few deep learning approaches
have been studied in the field of infrared small target detec-
tion. We suggest the principal reasons are as follows:
1. Lack of a public dataset so far
. Deep learning is data-
hungry. However, until now, there is no public infrared
small target dataset with high-quality annotations for
the single-frame detection scenario, on which various
new approaches can be trained, tested, and compared.
2. Minimal intrinsic information
. SPIE defines the in-
frared small target as having a total spatial extent of less
than 80 pixels (9 × 9) of a 256 × 256 image [34]. The
lack of texture or shape characteristics makes purely
target-centered representations inadequate for reliable
detection. Especially, in deep networks, small targets
can be easily overwhelmed by complex surroundings.
3. Contradiction between resolution and semantics
Infrared small targets are often submerged in compli-
cated backgrounds with low signal-to-clutter ratios. For
networks, detecting these dim targets with low false
alarms needs both a high-level semantic understanding
of the whole infrared image and a fine-resolution pre-
diction map, which is an endogenous contradiction of
deep networks since they learn more semantic represen-
tations by gradually attenuating the feature size [14].
In addition, these state-of-the-art networks are designed
for generic image datasets [15, 19]. Directly using them for
infrared small target detection can fail catastrophically due
to the large difference in the data distribution. It requires a
re-customization of the network in multiple aspects including
