Vision and System Design Lab

Co-Design Digital Accelerators

Mixed-Bit Sparse CNN Accelerator

Regarding to pruning, unstructured pruning improves compression ratio, but the irregularity affects the performance of parallel computing. A sparse model not only requires decoding, but also causes weight loading imbalance and activation reading difficulty. The weight compression and input activation scratch are the two most import processes. Since scratching activations is done on-the-fly which require lots of hardware resources, the key point of scratch pad is decreasing hardware resources. Thus, an efficient data fetch method is important for an accelerator that supports sparse CNN. Regarding to quantization, compared with 32-bit float point, fixed point can reduce hardware resources, such as DSP, LUTs, and BRAM. Generally speaking, mixed-bit quantization requires less computation resources than a single bit width while maintaining network accuracy. However, hardware architectures may become extremely complex to support mixed-bit computation, as data flow, memory structure, and PE need to be reconfigured online to support multiple bit-width computations. Thus, it is important for an architecture to support multiple bit-width computations.

Unstructured pruning requires a customized hardware accelerator, and the main issues for an accelerator are how to scratch needed input activations and compress weights efficiently. (b) shows a traditional architecture for weight sparsity. Each PE is independent and calculates asynchronously, so it has local SPad to fetch activations and FIFOs to pre-store the weights and selection signals. We named this traditional data fetch method as single input single output scratch pad (SISO SPad). In this brief, a new fetch method, SIMO SPad, is proposed to reduce hardware overhead, as shown in (a). While sharing SPad reduces hardware consumption, the FIFOs for pre-store weights and selection signals are removed, further saving hardware resources. Compared to SISO SPad method, SIMO SPad method can reduce hardware significantly, saving 13.34% CLB LUTs, 46.24% CLB Registers, and reducing 10% power. Mixed-bit CNNs are better than one-bit in terms of accuracy, model size, memory access and performance. However, few accelerators can support both mixed-bit width and sparsity because it is a challenge to implement 8-bit architecture to effectively support 4-bit sparsity. We design a DSP-base PE which can be configured to implement one signed INT-8 or two signed INT-4 convolution operations as shown in (d). Since lower-bit multiplication would impact upper-bit multiplication when a DSP conducts two signed multiplications, this work introduces a method of using absolute values for multiplication and then recovering sign bit for accumulate operation.

We further proposed a compression-hardware co-design approach, which includes the developed toolchain and implemented accelerator architecture. The toolchain comprises three modules: NN Compression, Compilation, and Hardware DSE. The NN Compression module performs co-design pruning and hardware(HW)-aware quantization. The Compilation module handles co-design encoding and converting NN model to binary (Bin) files, which can be executed by the implemented accelerator as well as the Function Simulation of the Hardware DSE module. The Hardware DSE module, which comprises Hardware Modeling and Function Simulation, evaluates the PPA and NN accuracy simultaneously for given hardware hyper-parameters and NN compression schemes in the design space. The Hardware DSE can adopt search strategy such as greedy algorithm and tabu algorithm, to identify the optimal architecture hyper-parameters, such as PE array dimension, that provides the best PPA performance. We proposed a 4-dimentional CNN accelerator architecture hierarchically organized by N core elements, W computing cores, H Sparse PEs (SPEs) and M PEs, where N, W, H and M map to input channel number, output feature map width, output feature map height and output channel number, respectively. The accelerator calculates a W x H x M block of output feature maps in parallel. We take an target accelerator of 512 PEs as an example and search the hyper-parameter (N, W, H, M).

Reference
  1. Xianghong Hu, Xuejiao Liu, Yu Liu, Haowei Zhang, Xijie Huang, Xihao Guan, Luhong Liang, Chi Ying Tsui, Xiaoming Xiong, Kwang-Ting Cheng. "A Tiny Accelerator for Mixed-Bit Sparse CNN Based on Efficient Fetch Method of SIMO SPad." In IEEE Transactions On Circuits And Systems II: Express Briefs, 2022.