cuFalcon: An Adaptive Parallel GPU Implementation for High-Performance Falcon Acceleration

Wenqian Li; Hanyu Wei; Shiyu Shen; Hao Yang; Wangchen Dai; Yunlei Zhao

Paper 2025/249

cuFalcon: An Adaptive Parallel GPU Implementation for High-Performance Falcon Acceleration

Wenqian Li, Fudan University

Hanyu Wei, Fudan University

Shiyu Shen, City University of Hong Kong

Hao Yang, City University of Hong Kong

Wangchen Dai, Sun Yat-sen University

Yunlei Zhao, Fudan University, State Key Laboratory of Cryptology

Abstract

The rapid advancement of quantum computing has ushered in a new era of post-quantum cryptography, urgently demanding quantum-resistant digital signatures to secure modern communications and transactions. Among NIST-standardized candidates, Falcon—a compact lattice-based signature scheme—stands out for its suitability in size-sensitive applications. In this paper, we present cuFalcon, a high-throughput GPU implementation of Falcon that addresses its computational bottlenecks through adaptive parallel strategies. At the operational level, we optimize Falcon key components for GPU architectures through memory-efficient FFT, adaptive parallel ffSampling, and a compact computation mode. For signature-level optimization, we implement three versions of cuFalcon: the raw key version, the expanded key version, and the balanced version, which achieves a trade-off between efficiency and memory usage. Additionally, we design batch processing, streaming mechanisms, and memory pooling to handle multiple signature tasks efficiently. Ultimately, performance evaluations show significant improvements, with the raw key version achieving 172k signatures per second and the expanded key version reaching 201k. Compared to the raw key version, the balanced version achieves a 7% improvement in throughput, while compared to the expanded key version, it reduces memory usage by 70%. Furthermore, our raw key version implementation outperforms the reference implementation by 36.75 $\times$ and achieves a 2.94$\times$ speedup over the state-of-the-art GPU implementation.

Metadata

Available format(s): PDF
Category: Implementation
Publication info: Preprint.
Keywords: Post-Quantum Cryptography Falcon Fast Fourier Sampling GPU acceleration
Contact author(s): liwq24 @ m fudan edu cn
hywei24 @ m fudan edu cn
crypto @ sher1e dev
crypto @ d4rk dev
daiwch @ mail sysu edu cn
ylzhao @ fudan edu cn
History: 2025-02-18: approved; 2025-02-17: received; See all versions
Short URL: https://ia.cr/2025/249
License: CC BY

BibTeX

@misc{cryptoeprint:2025/249,
      author = {Wenqian Li and Hanyu Wei and Shiyu Shen and Hao Yang and Wangchen Dai and Yunlei Zhao},
      title = {{cuFalcon}: An Adaptive Parallel {GPU} Implementation for High-Performance Falcon Acceleration},
      howpublished = {Cryptology {ePrint} Archive, Paper 2025/249},
      year = {2025},
      url = {https://eprint.iacr.org/2025/249}
}