Abstract
Because of the increasing demand for intensive computation in deep neural networks, researchers have developed both hardware and software mechanisms to reduce the compute and memory burden. A widely adopted approach is to use mixed precision data types. However, it is hard to benefit from mixed precision without hardware specialization because of the overhead of data casting. Recently, hardware vendors offer tensorized instructions specialized for mixed-precision tensor operations, such as Intel VNNI, Nvidia Tensor Core, and ARM DOT. These instructions involve a new computing idiom, which reduces multiple low precision elements into one high precision element. The lack of compilation techniques for this emerging idiom makes it hard to utilize these instructions. In practice, one approach is to use vendor-provided libraries for computationally-intensive kernels, but this is inflexible and prevents further optimizations. Another approach is to manually write hardware intrinsics, which is error-prone and difficult for programmers. Some prior works tried to address this problem by creating compilers for each instruction. This requires excessive efforts when it comes to many tensorized instructions. In this work, we develop a compiler framework, UNIT, to unify the compilation for tensorized instructions. The key to this approach is a unified semantics abstraction which makes the integration of new instructions easy, and the reuse of the analysis and transformations possible. Tensorized instructions from different platforms can be compiled via UNIT with moderate effort for favorable performance. Given a tensorized instruction and a tensor operation, UNIT automatically detects the applicability of the instruction, transforms the loop organization of the operation, and rewrites the loop body to take advantage of the tensorized instruction. According to our evaluation, UNIT is able to target various mainstream hardware platforms. The generated end-to-end inference model achieves 1.3 x speedup over Intel oneDNN on an x86 CPU, 1.75x speedup over Nvidia cuDNN on an Nvidia GPU, and 1.13x speedup over a carefully tuned TVM solution for ARM DOT on an ARM CPU.
Original language | English (US) |
---|---|
Title of host publication | CGO 2021 - Proceedings of the 2021 IEEE/ACM International Symposium on Code Generation and Optimization |
Editors | Jae W. Lee, Mary Lou Soffa, Ayal Zaks |
Publisher | Institute of Electrical and Electronics Engineers Inc. |
Pages | 77-89 |
Number of pages | 13 |
ISBN (Electronic) | 9781728186139 |
DOIs | |
State | Published - Feb 27 2021 |
Event | 19th IEEE/ACM International Symposium on Code Generation and Optimization, CGO 2021 - Virtual, Korea, Korea, Republic of Duration: Feb 27 2021 → Mar 3 2021 |
Publication series
Name | CGO 2021 - Proceedings of the 2021 IEEE/ACM International Symposium on Code Generation and Optimization |
---|
Conference
Conference | 19th IEEE/ACM International Symposium on Code Generation and Optimization, CGO 2021 |
---|---|
Country/Territory | Korea, Republic of |
City | Virtual, Korea |
Period | 02/27/21 → 03/3/21 |
Bibliographical note
Publisher Copyright:© 2021 IEEE.
ASJC Scopus subject areas
- Signal Processing
- Software
- Control and Optimization