Paper 2025/249

cuFalcon: An Adaptive Parallel GPU Implementation for High-Performance Falcon Acceleration

Wenqian Li, Fudan University
Hanyu Wei, Fudan University
Shiyu Shen, City University of Hong Kong
Hao Yang, City University of Hong Kong
Wangchen Dai, Sun Yat-sen University
Yunlei Zhao, Fudan University, State Key Laboratory of Cryptology
Abstract

The rapid advancement of quantum computing has ushered in a new era of post-quantum cryptography, urgently demanding quantum-resistant digital signatures to secure modern communications and transactions. Among NIST-standardized candidates, Falcon—a compact lattice-based signature scheme—stands out for its suitability in size-sensitive applications. In this paper, we present cuFalcon, a high-throughput GPU implementation of Falcon that addresses its computational bottlenecks through adaptive parallel strategies. At the operational level, we optimize Falcon key components for GPU architectures through memory-efficient FFT, adaptive parallel ffSampling, and a compact computation mode. For signature-level optimization, we implement three versions of cuFalcon: the raw key version, the expanded key version, and the balanced version, which achieves a trade-off between efficiency and memory usage. Additionally, we design batch processing, streaming mechanisms, and memory pooling to handle multiple signature tasks efficiently. Ultimately, performance evaluations show significant improvements, with the raw key version achieving 172k signatures per second and the expanded key version reaching 201k. Compared to the raw key version, the balanced version achieves a 7% improvement in throughput, while compared to the expanded key version, it reduces memory usage by 70%. Furthermore, our raw key version implementation outperforms the reference implementation by 36.75 $\times$ and achieves a 2.94$\times$ speedup over the state-of-the-art GPU implementation.

Metadata
Available format(s)
PDF
Category
Implementation
Publication info
Preprint.
Keywords
Post-Quantum CryptographyFalconFast Fourier SamplingGPU acceleration
Contact author(s)
liwq24 @ m fudan edu cn
hywei24 @ m fudan edu cn
crypto @ sher1e dev
crypto @ d4rk dev
daiwch @ mail sysu edu cn
ylzhao @ fudan edu cn
History
2025-02-18: approved
2025-02-17: received
See all versions
Short URL
https://ia.cr/2025/249
License
Creative Commons Attribution
CC BY

BibTeX

@misc{cryptoeprint:2025/249,
      author = {Wenqian Li and Hanyu Wei and Shiyu Shen and Hao Yang and Wangchen Dai and Yunlei Zhao},
      title = {{cuFalcon}: An Adaptive Parallel {GPU} Implementation for High-Performance Falcon Acceleration},
      howpublished = {Cryptology {ePrint} Archive, Paper 2025/249},
      year = {2025},
      url = {https://eprint.iacr.org/2025/249}
}
Note: In order to protect the privacy of readers, eprint.iacr.org does not use cookies or embedded third party content.