TY - JOUR
T1 - IMCA: An Efficient In-Memory Convolution Accelerator
AU - Yantir, Hasan Erdem
AU - Eltawil, Ahmed
AU - Salama, Khaled N.
N1 - KAUST Repository Item: Exported on 2021-01-21
PY - 2021
Y1 - 2021
N2 - Traditional convolutional neural network (CNN) architectures suffer from two bottlenecks: computational complexity and memory access cost. In this study, an efficient in-memory convolution accelerator (IMCA) is proposed based on associative in-memory processing to alleviate these two problems directly. In the IMCA, the convolution operations are directly performed inside the memory as in-place operations. The proposed memory computational structure allows for a significant improvement in computational metrics, namely, TOPS/W. Furthermore, due to its unconventional computation style, the IMCA can take advantage of many potential opportunities, such as constant multiplication, bit-level sparsity, and dynamic approximate computing, which, while supported by traditional architectures, require extra overhead to exploit, thus reducing any potential gains. The proposed accelerator architecture exhibits a significant efficiency in terms of area and performance, achieving around 0.65 GOPS and 1.64 TOPS/W at 16-bit fixed-point precision with an area less than 0.25 mm².
AB - Traditional convolutional neural network (CNN) architectures suffer from two bottlenecks: computational complexity and memory access cost. In this study, an efficient in-memory convolution accelerator (IMCA) is proposed based on associative in-memory processing to alleviate these two problems directly. In the IMCA, the convolution operations are directly performed inside the memory as in-place operations. The proposed memory computational structure allows for a significant improvement in computational metrics, namely, TOPS/W. Furthermore, due to its unconventional computation style, the IMCA can take advantage of many potential opportunities, such as constant multiplication, bit-level sparsity, and dynamic approximate computing, which, while supported by traditional architectures, require extra overhead to exploit, thus reducing any potential gains. The proposed accelerator architecture exhibits a significant efficiency in terms of area and performance, achieving around 0.65 GOPS and 1.64 TOPS/W at 16-bit fixed-point precision with an area less than 0.25 mm².
UR - http://hdl.handle.net/10754/666958
UR - https://ieeexplore.ieee.org/document/9324735/
U2 - 10.1109/TVLSI.2020.3047641
DO - 10.1109/TVLSI.2020.3047641
M3 - Article
SN - 1557-9999
SP - 1
EP - 14
JO - IEEE Transactions on Very Large Scale Integration (VLSI) Systems
JF - IEEE Transactions on Very Large Scale Integration (VLSI) Systems
ER -