Get Ready for NTT

6 min readDec 15, 2024

Within lattice cryptography, such as in areas of post-quantum cryptography (PQC), Zero Knowledge Proofs (ZKPs) and homomorphic encryption, we have the core concept of a Number-Theoretic Transform (NTT). Overall, in these fields, NTTs can account for over 90% of the FHE (Fully Homomorphic Encryption) computation time [1] and for ZKPs for over 80% of the processing time. With NTT, we have inputs of x_0, x_1 … x_{n-1} and outputs of y_0, y_1 … y_{n-1}, and where the values range from 0 to m-1. alpha_n is the n-th primitive root of unity:

Jung [2] outlines the parameters of HEAAN HE for the Arithmetic of Approximate Numbers method, and also known as CKKS [3]:

L is the level and is the number of multiplications that can be applied before we lose data.
p is the rescaling factor.
Q is the maximum ciphertext modulus.
N is a power-of-two integer.

Overall, we have a ciphertext which has a modulus of Q and which is two polynomials of c.ax and c.bx. These are in a ring of x^N+1. We can then perform a multiplication and then a rescaling process. For ciphertexts of c1 and c2, it is relatively simple to add the coefficients of the polynomials. With multiplication, we compute a tensor product (Figure 1):

c3.ax = c1.ax · c2.ax, c3.cx = c1.bx · c2.bx,
c3.bx = c1.ax · c2.bx + c1.bx · c2.ax

Figure 1 : The overall flow of HE Multiplication in HEAAN [2]

AVX-512

In 2013, Intel proposed 512-bit extensions for the Advanced Vector Extensions SIMD instruction set of the x86 architecture. It was released with the Intel Xeon Phi x200 (Knights Landing) processor and is now supported on a range of devices. AVX uses sixteen registers to perform a single instruction on multiple elements of data, including 32-bit integers or four 64-bit floating-point values. Overall, AVX-512 is capable of dealing with 256-bit vector sizes.

Fu et al. [4] used AVX-512 to implement FHE and with a focus on the Number Theoretic Transform (NTT). For this, the research team implemented modular addition, subtraction, and multiplication for parallel operations. This involves a scalar algorithm with 64-bit integers and then translating each arithmetic and bitwise operation to an AVX-512 equivalent. Figure 2 and Figure 3 outline the subtraction method with AVX-512. Test results show that the vectorized code ran around 35 times faster than code on a GMP. When compared with the SPIRAL NTTX package [5], the code ran 2.2 times faster (Figure 4).

Figure 2: Modular subtraction C code using AVX-512 [4]

Figure 3: Overview of modular subtraction using AVX-512 [4]

Figure 4: Runtime per butterfly for various NTT sizes [4]

Jung et al. [6] used AVX-512 (AVX-MT-24) to compare performance with a generalised GPU and found an approximately 40-fold reduction in processing times (Figure 5).

Figure 5: Comparing the execution time of HE Multiplication [6]

GPU processing

To enhance the processing of homomorphic encryption, we can use GPUs. Ozcan et al. [7] performed an analysis using two GPUs: RTX306Ti (4,864 cores) and GTX 1080 (2,560 cores) — Figure 6. Both have 8GB of RAM, and are compared with the Ryzen7 3800X CPU. The evaluation involved the assessment of add, multiply, re-linearisation and rotation for a range of n and log_2 (q) values (Figure 7). We can see there is a speed increase of around 10 to 50 times faster for the operations. In terms of power consumption, the GPU actually used less power for BFV multiplications (Figure 8).

The libraries used implement Number Theoretic Transform (NTT) and inverse NTT and which optimize GPU kernel function calls. The integrations with the Microsoft SEAL library. For a ring dimension $2^{14}$ and a modulus bit size of 438, it provides a 63.4 times speed-up comparing GPU implementations over a CPU. The assessment of using NTT operations is given in Figure 9.

Figure 7: Comparison results of SEAL BFV scheme operations and literature with our GPU implementations of BFV scheme operations

Figure 8: Power consumption of BFV multiplication on CPU and GPU

Figure 9: Timings of GPU implementation of NTT and inverse NTT operations

FHE-specific processors

While GPUs and SIMD extensions can speed up the process, some researchers outline methods that build architectures which will optimize the processing of homomorphic encryption. Figures 10 and 11 show the F1 architecture from Samardzic et al. Figure 11 and Figure 12 shows the operation of vector multiplication. The team have since advanced their architecture on with Craterlake [9] (Figure 13).

In terms of chip design Soni et al. [10] outline the RPU architecture (Figure 14) and which is optimized for Ring Learning With Errors solutions. Overall, they found that the RPU could fit on a 20.5mm² chip and provided a 1,485 times speed-up over a CPU running a 64k, 128-bit NTT workload.

Conclusions

Whether it’s machine learning, post-quantum cryptography or homomorphic encryption, we are building a new world with Learning With Errors. NTT is a core part of this process.

References

[1] Fan, S., Wang, Z., Xu, W., Hou, R., Meng, D., & Zhang, M. (2023, February). Tensorfhe: Achieving practical computation on encrypted data using gpgpu. In 2023 IEEE International Symposium on High-Performance Computer Architecture (HPCA) (pp. 922–934). IEEE.

[2] Jung, W., Lee, E., Kim, S., Kim, N., Lee, K., Min, C., … & Ahn, J. H. (2021, March). Accelerating Fully Homomorphic Encryption through Microarchitecture-Aware Analysis and Optimization. In 2021 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS) (pp. 237–239). IEEE.

[3] Cheon, J. H., Kim, A., Kim, M., & Song, Y. (2017). Homomorphic encryption for arithmetic of approximate numbers. In Advances in Cryptology–ASIACRYPT 2017: 23rd International Conference on the Theory and Applications of Cryptology and Information Security, Hong Kong, China, December 3–7, 2017, Proceedings, Part I 23 (pp. 409–437). Springer International Publishing.

[4] Fu, S., Zhang, N., & Franchetti, F. Accelerating High-Precision Number Theoretic Transforms using Intel AVX-512.

[5] Zhang, N., & Franchetti, F. (2023). Generating number theoretic transforms for multi-word integer data types. In IEEE/ACM International Symposium on Code Generation and Optimization (CGO).

[6] Jung, W., Lee, E., Kim, S., Kim, N., Lee, K., Min, C., … & Ahn, J. H. (2021, March). Accelerating Fully Homomorphic Encryption through Microarchitecture-Aware Analysis and Optimization. In 2021 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS) (pp. 237–239). IEEE.

[7] Özcan, A. Ş., Ayduman, C., Türkoğlu, E. R., & Savaş, E. (2023). Homomorphic encryption on GPU. IEEE Access, 11, 84168–84186.

[8] Samardzic, N., Feldmann, A., Krastev, A., Devadas, S., Dreslinski, R., Peikert, C., & Sanchez, D. (2021, October). F1: A fast and programmable accelerator for fully homomorphic encryption. In MICRO-54: 54th Annual IEEE/ACM International Symposium on Microarchitecture (pp. 238–252).

[9] Samardzic, N., Feldmann, A., Krastev, A., Manohar, N., Genise, N., Devadas, S., … & Sanchez, D. (2022, June). Craterlake: a hardware accelerator for efficient unbounded computation on encrypted data. In Proceedings of the 49th Annual International Symposium on Computer Architecture (pp. 173–187).

[10] Soni, D., Neda, N., Zhang, N., Reynwar, B., Gamil, H., Heyman, B., … & Reagen, B. (2023, April). Rpu: The ring processing unit. In 2023 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS) (pp. 272–282). IEEE.

[11] Zhao, H., Ding, D., Wang, F., Hua, P., Wang, N., Wu, Q., & Chai, Z. (2022). Hardware acceleration of number theoretic transform for zk‐SNARK. Engineering Reports, e12639.