Efficient Spatialtemporal Context Modeling for Action Recognition

Cao, Congqi; Lu, Yue; Zhang, Yifan; Jiang, Dongmei; Zhang, Yanning

Computer Science > Computer Vision and Pattern Recognition

arXiv:2103.11190 (cs)

[Submitted on 20 Mar 2021 (v1), last revised 6 Apr 2021 (this version, v2)]

Title:Efficient Spatialtemporal Context Modeling for Action Recognition

Authors:Congqi Cao, Yue Lu, Yifan Zhang, Dongmei Jiang, Yanning Zhang

View PDF

Abstract:Contextual information plays an important role in action recognition. Local operations have difficulty to model the relation between two elements with a long-distance interval. However, directly modeling the contextual information between any two points brings huge cost in computation and memory, especially for action recognition, where there is an additional temporal dimension. Inspired from 2D criss-cross attention used in segmentation task, we propose a recurrent 3D criss-cross attention (RCCA-3D) module to model the dense long-range spatiotemporal contextual information in video for action recognition. The global context is factorized into sparse relation maps. We model the relationship between points in the same line along the direction of horizon, vertical and depth at each time, which forms a 3D criss-cross structure, and duplicate the same operation with recurrent mechanism to transmit the relation between points in a line to a plane finally to the whole spatiotemporal space. Compared with the non-local method, the proposed RCCA-3D module reduces the number of parameters and FLOPs by 25% and 30% for video context modeling. We evaluate the performance of RCCA-3D with two latest action recognition networks on three datasets and make a thorough analysis of the architecture, obtaining the optimal way to factorize and fuse the relation maps. Comparisons with other state-of-the-art methods demonstrate the effectiveness and efficiency of our model.

Comments:	16 pages, 7 figures
Subjects:	Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
Cite as:	arXiv:2103.11190 [cs.CV]
	(or arXiv:2103.11190v2 [cs.CV] for this version)
	https://doi.org/10.48550/arXiv.2103.11190

Submission history

From: Yue Lu [view email]
[v1] Sat, 20 Mar 2021 14:48:12 UTC (1,243 KB)
[v2] Tue, 6 Apr 2021 04:40:12 UTC (1,362 KB)

Computer Science > Computer Vision and Pattern Recognition

Title:Efficient Spatialtemporal Context Modeling for Action Recognition

Submission history

Access Paper:

References & Citations

DBLP - CS Bibliography

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Computer Vision and Pattern Recognition

Title:Efficient Spatialtemporal Context Modeling for Action Recognition

Submission history

Access Paper:

References & Citations

DBLP - CS Bibliography

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators