1

Cocco: Hardware-Mapping Co-Exploration towards Memory Capacity-Communication Optimization

Memory is a critical design consideration in current data-intensive DNN accelerators, profoundly affecting energy consumption, bandwidth requirements, and area costs. With the growing complexity of structural design, the on-chip memory capacity …

A Scalable Multi‑Chiplet Deep Learning Accelerator with Hub‑Side 2.5D Heterogeneous Integration

With the slowdown of Moore’s law, the scenario diversity of specialized computing, and the rapid development of application algorithms, an efficient chip design requires modularization, flexibility, and scalability. In this study, we propose a …

PHEP: Paillier Homomorphic Encryption Processors for Privacy Preserving Applications in Cloud Computing

We introduce PHEP - Paillier Homomorphic Encryption Processors for cloud-based privacy-preserving applications. PHEP is built on two Paillier acceleration chips - Paillier engine-1 and Paillier engine-2, both produced on the same wafer. Paillier …

A 28nm 68MOPS 0.18μJ/Op Paillier Homomorphic Encryption Processor with Bit-Serial Sparse Ciphertext Computing

Paillier homomorphic encryption outsources ciphertext securely in exchange for cloud computing services at the expense of a huge computation overhead. Therefore, we introduce a 28nm 500MHz 23~68MOPS Paillier Homomorphic Encryption Processor with …

YOLoC: Deploy Large-Scale Neural Networks by ROM-based Computing-in-Memory using Residual Branch on a Chip

Computing-in-memory (CiM) is a promising technique to achieve high energy efficiency in data-intensive matrix-vector multiplication (MVM) by relieving the memory bo"leneck. Unfortunately, due to the limited SRAM capacity, existing SRAMbased CiM needs …

Finding the Task-Optimal Low-Bit Sub-Distribution in Deep Neural Networks

Quantized neural networks typically require smaller memory footprints and lower computation complexity, which is crucial for efficient deployment. However, quantization inevitably leads to a distribution divergence from the original network, which …

NN-Baton: DNN Workload Orchestration and Chiplet Granularity Exploration for Multichip Accelerators

In *Proceedings of Annual International Symposium on Computer Architecture (ISCA)*.

A 400MHz NPU with 7.8 TOPS²/W High-Performance-Guaranteed Efficiency in 55nm for Multi-Mode Pruning and Diverse Quantization Using Pattern-Kernel Encoding and Reconfigurable MAC Units

Deployment of DNNs for edge devices significantly relies on pruning and quantization. For pruning, prior works only exploited unstructured or coarse-grained pruning. For quantization, UNPU studied various bit-width but not for diverse quantization …

PCNN: Pattern-based Fine-Grained Regular Pruning Towards Optimizing CNN Accelerators

Weight pruning is a powerful technique to realize model compression. We propose PCNN, a fine-grained regular 1D pruning method. A novel index format called Sparsity Pattern Mask (SPM) is presented to encode the sparsity in PCNN. Leveraging SPM with …

SCAN: A Scalable Neural Networks Framework Towards Compact and Efficient Models

Remarkable achievements have been attained by deep neural networks in various applications. However, the increasing depth and width of such models also lead to explosive growth in both storage and computation, which has restricted the deployment of …