HN 표시: Sky130 오픈 소스 노드의 실리콘 DLinear AI 가속기로 직접 연결

hackernews | | 💼 비즈니스
#ai가속기 #dlinear #rtl #sky130 #tip #오픈소스
원문 출처: hackernews · Genesis Park에서 요약 및 분석

요약

한 개발자가 시계열 분석 모델인 DLinear를 오픈 소스 PDK인 Sky130을 사용해 실리콘으로 직접 구현하는 프로젝트를 수행했습니다. 초기에는 타이밍 오류로 어려움을 겪었으나, Chisel로 아키텍처를 리팩토링하고 파이프라인과 비트 시프트 등의 최적화 기법을 적용하여 100MHz 클럭으로 동작하는 8만 6,000개가 넘는 셀을 가진 칩 설계를 완성했습니다. 개발자는 해당 칩이 더 나은 공정을 사용할 경우 1.2GHz까지 구동 가능할 것으로 추정하며, 고속 네트워킹이나 제어 루프에서 소프트웨어의 속도 한계를 극복할 하드웨어 가속기의 가능성을 제시했습니다.

본문

[ Build: Passing ] [ Timing: 100MHz @ 130nm ] [ License: Apache 2.0 ] This repository contains an RTL implementation of the DLinear model optimized for extremely low latency environments. Unlike traditional AI accelerators, our architecture eliminates the instruction layer, turning the neural network into a physical flow circuit (Dataflow Circuit). - Latency: 3.3ns - 4.2ns (Deterministic, 4 clock cycles @ 1.2GHz). - Throughput: 1 prediction per clock cycle (Fully Pipelined). - Area (Est. 7nm): Subtraction -> Multiplier -> 6-level Adder Tree. We have implemented a 3-Stage pipeline (Triple-Stage Pipeline): - Stage 1: Fixing the decomposed vector (Trend/Seasonal). - Stage 2: Isolation of multipliers (MAC). Now the signal "rests" in the registers immediately after multiplying the weights. - Stage 3: Summation. - Optimization of the Adder Tree (Binary Adder Tree) Instead of sequential addition (a+b+c+d...), which creates a linear delay, we have implemented a balanced binary tree. - Result: The delay has been reduced to O(log2N). For a window with 64 inputs, this is only 6 logic levels instead of 64. - Retiming: We allowed OpenLane to automatically move registers (Buffer Re-pipelining) inside the tree to balance the load between levels. - Zero-Delay Math (Bit-Shift Trick) In classical processors, the average = sum/64 operation requires a division computing unit, which introduces a huge delay. - We fixed the window size as 2^6. - In silicon, this made it possible to replace division with a static bit shift (Wire-shift). This is literally a "soldering of wires" with zero power consumption and 0.00ns delay. - Fight against Hold Violations After fixing Setup Slack, we had Hold Violations (-1.95ns) (the signal was flying too fast). - We applied DPL (Detailed Placement) with increased CELL_PADDING. This expanded the cells on the chip, giving way to the OpenROAD tool to automatically insert delay buffers that "slowed down" too fast signals to a safe level. Final Metric (PPA) Thanks to these changes, the 130nm design has become STA Clean (Setup/Hold Pass). This confirms that when migrating to 7nm (FinFET), where the gate delay is 15-20 times less, our architecture is guaranteed to overcome the 1.5GHz barrier. - Hardware: Chisel 6.0 (Scala), SystemVerilog. - Simulation: Verilator (C++ models), Cocotb (Python verification). - Physical Flow: OpenLane / OpenROAD (RTL-to-GDSII). - Analysis: Surfer / Scansion (Cycle-accurate waveform auditing).

Genesis Park 편집팀이 AI를 활용하여 작성한 분석입니다. 원문은 출처 링크를 통해 확인할 수 있습니다.

공유

관련 저널 읽기

전체 보기 →