信息网络安全 ›› 2020, Vol. 20 ›› Issue (6): 36-43.doi: 10.3969/j.issn.1671-1122.2020.06.005

• 技术研究 • 上一篇    下一篇

GPU上SM4算法并行实现

李秀滢1,2, 吉晨昊1,2, 段晓毅1(), 周长春1,2   

  1. 1.北京电子科技学院电子与通信工程系,北京 100070
    2.密码科学技术国家重点实验室,北京 100878
  • 收稿日期:2020-04-11 出版日期:2020-06-10 发布日期:2020-10-21
  • 通讯作者: 段晓毅 E-mail:duanxiaoyi@besti.edu.cn
  • 作者简介:李秀滢(1975—),女,河北,副教授,硕士,主要研究方向为密码与信息安全、并行计算、端侧人工智能|吉晨昊(1996—),男,河北,硕士研究生,主要研究方向为密码与信息安全、并行计算|段晓毅(1979—),男,贵州,讲师,博士,主要研究方向为密码与信息安全、电子工程|周长春(1963—),男,吉林,正高级工程师,本科,主要研究方向为密码学、通信工程
  • 基金资助:
    国家重点研发计划(2017YFB0801803);中央高校基本科研业务费(328201914);密码科学技术国家重点实验室开放课题(MMKFKT201804)

Parallel Implementation of SM4 Algorithm on GPU

LI Xiuying1,2, JI Chenhao1,2, DUAN Xiaoyi1(), ZHOU Changchun1,2   

  1. 1. Department of Electronic and Communication Engineering, Beijing Electronic Science and Technology Institute, Beijing 100070, China
    2. State Key Laboratory of Cryptology, Beijing 100878, China
  • Received:2020-04-11 Online:2020-06-10 Published:2020-10-21
  • Contact: DUAN Xiaoyi E-mail:duanxiaoyi@besti.edu.cn

摘要:

密码算法的运算速度与算力成正比,一些学者通过提高CPU速度、使用硬件加密卡等方案提高密码算法运算速度。随着图形处理器(GPU)在高性能并行计算领域的广泛应用,国内外学者已经展开了基于GPU加速密码运算的研究,但这些研究基本都是基于DES、AES等国际公开算法的,针对国产商用密码算法SM4的研究还较少。文章在深入研究GPU并行计算机制的基础上,通过研究最优明文数据块、GPU存储类型和线程块对SM4加密的加速比问题,结合CPU与GPU的特性,提出一种GPU上并行SM4算法的最优加解密方案。结果表明,当明文数据块小于8 KB时,加速比(Ep)小于1;明文数据块大小为64 KB时,加速比开始大幅增加;明文数据块大小为256 KB时,加速比达到最大。当选择常量存储作为中间数据存储时,加密速度有所提升,对于大数据量、高速运算的需求来说,这种提升是很有必要的。最优线程块的大小为128~512(必须为32的倍数)个线程数。实验环境下,文章中实现的最优GPU加密方案的速度为普通CPU加密方案速度的26倍。

关键词: 图形处理器, 并行计算, CUDA, SM4

Abstract:

The speed of cryptographic algorithm is proportional to the calculation force. In order to improve the speed of cryptographic algorithm, scholars achieve their goals by increasing CPU speed, using hardware encryption card and other solutions. With the wide application of GPU in the field of high-performance parallel computing, scholars have carried out research on GPU accelerated cryptographic algorithm. Most of these researches are focused on the international open algorithms such as DES and AES, and the research on SM4 of domestic commercial cryptographic algorithm is still rare. On the basis of in-depth study of GPU parallel computer system, the author presents an optimal encryption and decryption scheme for GPU parallel SM4 algorithm by studying the optimal plaintext block, GPU storage type and thread block's speed ratio for SM4 encryption, and combining the characteristics of CPU and GPU. The experimental results are as follows. When the plaintext data block is less than 8 KB, the acceleration ratio (EP) is less than 1. When the plaintext block size is 64 KB, the acceleration ratio starts to increase significantly, and reaches the maximum at 256 KB. When constant storage is selected as the intermediate data storage, the encryption speed is improved, which is necessary for the demand of large data and high-speed operation. The optimal size of thread block is 128~512(must be a multiple of 32) threads. In the experimental environment given in this paper, the optimal GPU encryption scheme can be implemented 26 times faster than the ordinary CPU encryption scheme.

Key words: GPU, parallel operation, CUDA, SM4

中图分类号: