UNIT: Unifying Tensorized Instruction Compilation

Jian Weng*, Animesh Jain, Jie Wang*, Leyuan Wang, Yida Wang, Tony Nowatzki*

*Corresponding author for this work

Research output: Chapter in Book/Report/Conference proceedingConference contributionpeer-review

21 Scopus citations

Abstract

Because of the increasing demand for intensive computation in deep neural networks, researchers have developed both hardware and software mechanisms to reduce the compute and memory burden. A widely adopted approach is to use mixed precision data types. However, it is hard to benefit from mixed precision without hardware specialization because of the overhead of data casting. Recently, hardware vendors offer tensorized instructions specialized for mixed-precision tensor operations, such as Intel VNNI, Nvidia Tensor Core, and ARM DOT. These instructions involve a new computing idiom, which reduces multiple low precision elements into one high precision element. The lack of compilation techniques for this emerging idiom makes it hard to utilize these instructions. In practice, one approach is to use vendor-provided libraries for computationally-intensive kernels, but this is inflexible and prevents further optimizations. Another approach is to manually write hardware intrinsics, which is error-prone and difficult for programmers. Some prior works tried to address this problem by creating compilers for each instruction. This requires excessive efforts when it comes to many tensorized instructions. In this work, we develop a compiler framework, UNIT, to unify the compilation for tensorized instructions. The key to this approach is a unified semantics abstraction which makes the integration of new instructions easy, and the reuse of the analysis and transformations possible. Tensorized instructions from different platforms can be compiled via UNIT with moderate effort for favorable performance. Given a tensorized instruction and a tensor operation, UNIT automatically detects the applicability of the instruction, transforms the loop organization of the operation, and rewrites the loop body to take advantage of the tensorized instruction. According to our evaluation, UNIT is able to target various mainstream hardware platforms. The generated end-to-end inference model achieves 1.3 x speedup over Intel oneDNN on an x86 CPU, 1.75x speedup over Nvidia cuDNN on an Nvidia GPU, and 1.13x speedup over a carefully tuned TVM solution for ARM DOT on an ARM CPU.

Original languageEnglish (US)
Title of host publicationCGO 2021 - Proceedings of the 2021 IEEE/ACM International Symposium on Code Generation and Optimization
EditorsJae W. Lee, Mary Lou Soffa, Ayal Zaks
PublisherInstitute of Electrical and Electronics Engineers Inc.
Pages77-89
Number of pages13
ISBN (Electronic)9781728186139
DOIs
StatePublished - Feb 27 2021
Event19th IEEE/ACM International Symposium on Code Generation and Optimization, CGO 2021 - Virtual, Korea, Korea, Republic of
Duration: Feb 27 2021Mar 3 2021

Publication series

NameCGO 2021 - Proceedings of the 2021 IEEE/ACM International Symposium on Code Generation and Optimization

Conference

Conference19th IEEE/ACM International Symposium on Code Generation and Optimization, CGO 2021
Country/TerritoryKorea, Republic of
CityVirtual, Korea
Period02/27/2103/3/21

Bibliographical note

Publisher Copyright:
© 2021 IEEE.

ASJC Scopus subject areas

  • Signal Processing
  • Software
  • Control and Optimization

Fingerprint

Dive into the research topics of 'UNIT: Unifying Tensorized Instruction Compilation'. Together they form a unique fingerprint.

Cite this