Multi-rate attention architecture for fast streamable Text-to-speech spectrum modeling

He, Qing; Xiu, Zhiping; Koehler, Thilo; Wu, Jilong

Computer Science > Sound

arXiv:2104.00705 (cs)

[Submitted on 1 Apr 2021]

Title:Multi-rate attention architecture for fast streamable Text-to-speech spectrum modeling

Authors:Qing He, Zhiping Xiu, Thilo Koehler, Jilong Wu

View PDF

Abstract:Typical high quality text-to-speech (TTS) systems today use a two-stage architecture, with a spectrum model stage that generates spectral frames and a vocoder stage that generates the actual audio. High-quality spectrum models usually incorporate the encoder-decoder architecture with self-attention or bi-directional long short-term (BLSTM) units. While these models can produce high quality speech, they often incur O($L$) increase in both latency and real-time factor (RTF) with respect to input length $L$. In other words, longer inputs leads to longer delay and slower synthesis speed, limiting its use in real-time applications. In this paper, we propose a multi-rate attention architecture that breaks the latency and RTF bottlenecks by computing a compact representation during encoding and recurrently generating the attention vector in a streaming manner during decoding. The proposed architecture achieves high audio quality (MOS of 4.31 compared to groundtruth 4.48), low latency, and low RTF at the same time. Meanwhile, both latency and RTF of the proposed system stay constant regardless of input lengths, making it ideal for real-time applications.

Subjects:	Sound (cs.SD); Artificial Intelligence (cs.AI); Audio and Speech Processing (eess.AS)
Cite as:	arXiv:2104.00705 [cs.SD]
	(or arXiv:2104.00705v1 [cs.SD] for this version)
	https://doi.org/10.48550/arXiv.2104.00705

Submission history

From: Qing He [view email]
[v1] Thu, 1 Apr 2021 18:15:30 UTC (2,969 KB)

Full-text links:

Access Paper:

view license

Current browse context:

cs.SD

< prev | next >

new | recent | 2021-04

Change to browse by:

cs
cs.AI
eess
eess.AS

References & Citations

DBLP - CS Bibliography

listing | bibtex

Qing He

export BibTeX citation

Computer Science > Sound

Title:Multi-rate attention architecture for fast streamable Text-to-speech spectrum modeling

Submission history

Access Paper:

References & Citations

DBLP - CS Bibliography

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Sound

Title:Multi-rate attention architecture for fast streamable Text-to-speech spectrum modeling

Submission history

Access Paper:

References & Citations

DBLP - CS Bibliography

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators