<?xml version="1.0" encoding="utf-8" standalone="yes"?><rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom"><channel><title>AI chip | Zhanhong Tan's Homepage</title><link>https://www.zhanhongtan.com/tag/ai-chip/</link><atom:link href="https://www.zhanhongtan.com/tag/ai-chip/index.xml" rel="self" type="application/rss+xml"/><description>AI chip</description><generator>Wowchemy (https://wowchemy.com)</generator><language>en-us</language><copyright>© 2023 · Powered by Zhanhong Tan @ Archip Lab, Tsinghua University</copyright><lastBuildDate>Thu, 10 Sep 2020 00:00:00 +0000</lastBuildDate><image><url>https://www.zhanhongtan.com/media/icon_hub6b6a6edbd02033d3022f322418b318b_1218349_512x512_fill_lanczos_center_2.png</url><title>AI chip</title><link>https://www.zhanhongtan.com/tag/ai-chip/</link></image><item><title>QIMING 920 AI Chip (MUSEv2 Architecture)</title><link>https://www.zhanhongtan.com/project/musev2/</link><pubDate>Thu, 10 Sep 2020 00:00:00 +0000</pubDate><guid>https://www.zhanhongtan.com/project/musev2/</guid><description>&lt;p>To further improve the dataflow and flexibility of QIMING-910, I proposed the architecture of QIMING-920 in the summer of 2019. During the fabrication of QIMING-910, I found its instruction design was not user-friendly, which made it hard to deploy a larger model. On the top of that, the bit-width of the partial-sum register was not enough, leading to a significant accuracy drop for the ImageNet dataset. Therefore, it is necessary to design a next-generation architecture for 920.&lt;/p>
&lt;p>First, I improved the pattern-based pipeline to overlap the data loading as far as possible. For the ALU design, I generalized the computation flow for three quantization modes and increased the bit-width of the partial-sum register to avoid overflow. I also modified the activation buffer to support the kernel-wise pruning.&lt;/p>
&lt;p>Based on MSUEv2, QIMING-920 is a real &amp;ldquo;MUlti-grained Sparsity Engine&amp;rdquo; that supports channel-wise, kernel-wise, pattern-wise pruning and linear, power-of-two, mixed-power-of-two quantization. This work was presented in &lt;a href="https://doi.org/10.1109/CICC51472.2021.9431519" rel="noopener noreferrer" target="_blank">CICC 2021&lt;/a>.&lt;/p></description></item><item><title>QIMING 910 AI Chip (MUSEv1 Architecture)</title><link>https://www.zhanhongtan.com/project/musev1/</link><pubDate>Tue, 24 Dec 2019 00:00:00 +0000</pubDate><guid>https://www.zhanhongtan.com/project/musev1/</guid><description>&lt;p>QIMING-910 was my first solo tape-out project completed at Archip Lab during my senior year. I was in charge of all the development, from architecture design to the final RTL netlist, which was then shifted to UMC for the backend design and fabrication in the summer of 2019. Finally, with the help of some engineers in our team, we successfully verified this chip in December 2019! This experiment gave me insight into IC development, and part of this work was presented in &lt;a href="https://doi.org/10.1109/DAC18072.2020.9218498" rel="noopener noreferrer" target="_blank">DAC 2020&lt;/a>.&lt;/p>
&lt;p>The name of the architecture, MUSE, means MUlti-grained Sparsity Engine. The architecture of the chip leverages ternary quantization (-1, 1, 0) for weights, with which we can simply implement multiplication with MUX-based ALUs instead of the complicated logic. Moreover, using the ternary quantization significantly saves memory overhead by 16$\times$ compared to the single floating-point baseline. It also reflects in a dramatic increase in the percentage of zeros, which provides us with an opportunity for us to skip redundant multiplication when the weight is zero. Therefore, we correspondingly propose a zero-aware processing element to process the data.&lt;/p>
&lt;p>Fabricated in the UMC 55nm SP CMOS process, QIMING-910 achieves peak performance and power efficiency of 204.8 GOPS and 3.16 TOPS/W respectively. With the help of ternary-quantization, multiplier-free ALU design, and zero-aware processing, the power efficiency is 2$\times$ ~ 10$\times$ compared to the state-of-the-art NPU. We designed a testing board connected to XILINX Virtex-7 FPGA VC707 via FMC. VC707 here serves as a controller rather than a processor to transfer data from the PC to our chip. The overall system can accomplish the image classification task enabled by VGG-16 on the CIFAR-10 dataset.&lt;/p></description></item></channel></rss>