StrucTexT: Structured Text Understanding with Multi-Modal Transformers

Li, Yulin; Qian, Yuxi; Yu, Yuchen; Qin, Xiameng; Zhang, Chengquan; Liu, Yan; Yao, Kun; Han, Junyu; Liu, Jingtuo; Ding, Errui

Computer Science > Computer Vision and Pattern Recognition

arXiv:2108.02923 (cs)

[Submitted on 6 Aug 2021 (v1), last revised 8 Nov 2021 (this version, v3)]

Title:StrucTexT: Structured Text Understanding with Multi-Modal Transformers

Authors:Yulin Li, Yuxi Qian, Yuchen Yu, Xiameng Qin, Chengquan Zhang, Yan Liu, Kun Yao, Junyu Han, Jingtuo Liu, Errui Ding

View PDF

Abstract:Structured text understanding on Visually Rich Documents (VRDs) is a crucial part of Document Intelligence. Due to the complexity of content and layout in VRDs, structured text understanding has been a challenging task. Most existing studies decoupled this problem into two sub-tasks: entity labeling and entity linking, which require an entire understanding of the context of documents at both token and segment levels. However, little work has been concerned with the solutions that efficiently extract the structured data from different levels. This paper proposes a unified framework named StrucTexT, which is flexible and effective for handling both sub-tasks. Specifically, based on the transformer, we introduce a segment-token aligned encoder to deal with the entity labeling and entity linking tasks at different levels of granularity. Moreover, we design a novel pre-training strategy with three self-supervised tasks to learn a richer representation. StrucTexT uses the existing Masked Visual Language Modeling task and the new Sentence Length Prediction and Paired Boxes Direction tasks to incorporate the multi-modal information across text, image, and layout. We evaluate our method for structured text understanding at segment-level and token-level and show it outperforms the state-of-the-art counterparts with significantly superior performance on the FUNSD, SROIE, and EPHOIE datasets.

Comments:	ACM Multimedia 2021. 9 pages
Subjects:	Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL)
Cite as:	arXiv:2108.02923 [cs.CV]
	(or arXiv:2108.02923v3 [cs.CV] for this version)
	https://doi.org/10.48550/arXiv.2108.02923

Submission history

From: Xiameng Qin [view email]
[v1] Fri, 6 Aug 2021 02:57:07 UTC (8,273 KB)
[v2] Tue, 10 Aug 2021 03:44:20 UTC (8,273 KB)
[v3] Mon, 8 Nov 2021 11:29:21 UTC (8,273 KB)

Computer Science > Computer Vision and Pattern Recognition

Title:StrucTexT: Structured Text Understanding with Multi-Modal Transformers

Submission history

Access Paper:

References & Citations

DBLP - CS Bibliography

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Computer Vision and Pattern Recognition

Title:StrucTexT: Structured Text Understanding with Multi-Modal Transformers

Submission history

Access Paper:

References & Citations

DBLP - CS Bibliography

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators