HN 표시: RTX 3090 처리량의 2배로 M5 Max Tok/w와 일치하는 OS 메가커널

hackernews | 2026년 4월 9일 00:02 | 📦 오픈소스

#cuda #deltanet #llama #llm #megakernel #rtx 3090 #하드웨어/반도체

원문 출처: hackernews · Genesis Park에서 요약 및 분석

요약

하이브리드 델타넷(DeltaNet)과 어텐션 구조를 결합한 Qwen 3.5-0.8B 모델의 연산 과정을 단일 CUDA 디스패치로 처리하는 최초의 '메가커널'이 개발되었습니다. 기존 프레임워크가 토큰 생성 시마다 약 100개의 커널을 실행해 CPU 왕복 및 메모리 재할당으로 인한 전력 낭비가 발생했던 것과 달리, 이 기술은 모든 연산을 하나로 융합해 오버헤드를 제거했습니다. 그 결과, 2020년 출시된 RTX 3090 GPU의 전력을 220W로 제한했을 때 초당 411토큰을 처리하며 1.87 tok/J의 에너지 효율을 달성해, 최신 애플 M5 Max 칩(1.76 tok/J)의 효율성을 뛰어넘으면서도 처리 속도는 1.8배 더 빠르게 기록했습니다. 이는 하드웨어의 한계가 아닌 소프트웨어 최적화 부족이었음을 증명하며, 최적화된 커널을 통해 구형 GPU도 최신 칩과 경쟁할 수 있음을 보여줍니다.

본문

The first megakernel for hybrid DeltaNet/Attention LLMs. All 24 layers of Qwen 3.5-0.8B in a single CUDA dispatch. 1.87 tok/J on a 2020 GPU, matching Apple's latest silicon at 2x the throughput. Blog post · Benchmarks · Discord · lucebox.com Prefill Decode tok/J Megakernel (RTX 3090) 37,800 413 1.87 @220W llama.cpp (RTX 3090) 11,247 267 0.76 Apple M5 Max - 229 1.76 The efficiency gap between NVIDIA and Apple isn't inherent to the silicon. It's an artifact of running generic software on capable hardware. Conventional wisdom says NVIDIA GPUs are fast but power hungry, and Apple Silicon is slower but efficient. On paper, that checks out: llama.cpp on an RTX 3090 gets 267 tok/s at 350W (0.76 tok/J), while an M5 Max gets 229 tok/s at ~130W (1.76 tok/J). NVIDIA is faster, but 2.3x worse on efficiency. Qwen 3.5-0.8B uses a hybrid DeltaNet + Attention architecture (linear attention interleaved with standard attention). No fused kernel existed for this pattern. This is the first. Inspired by Hazy Research's megakernel work on Llama-1B, we asked: can the same idea work for hybrid DeltaNet/Attention models on consumer GPUs? We thought the problem was never the hardware. The RTX 3090 has 936 GB/s memory bandwidth and 142 TFLOPS FP16 compute. Extracting only 267 tok/s from that is a software problem. The culprit: ~100 kernel launches per token. Each layer boundary returns control to the CPU, dispatches the next kernel, re-fetches weights from global memory, and synchronizes threads. For 24 layers, those microseconds add up, and each one burns power doing nothing useful. So we fused everything into one kernel. | Method | Prefill pp520 (tok/s) | Decode tg128 (tok/s) | |---|---|---| | Megakernel | 37,800 | 413 | | llama.cpp BF16 | 11,247 | 267 | | PyTorch HuggingFace | 7,578 | 108 | 3.4x faster prefill, 1.55x faster decode, 3.8x faster than PyTorch. Same hardware, same model, same weights. | Power Limit | Clock | Draw | tok/s | tok/J | vs Stock | |---|---|---|---|---|---| | 420W (stock) | 1980 MHz | 314W | 433 | 1.38 | baseline | | 300W | 1935 MHz | 299W | 432 | 1.44 | 99.8% speed, 5% less power | | 220W | 1635 MHz | 220W | 411 | 1.87 | 95% speed, 30% less power | | 150W | 405 MHz | 150W | 194 | 1.29 | too aggressive | Sweet spot at 220W: 95% of the speed, 30% less power. The curve is nonlinear, tight execution converts directly into saved watts until you starve the GPU too aggressively. | Metric | RTX 3090 (llama.cpp) | M5 Max | RTX 3090 (Megakernel @220W) | |---|---|---|---| | tok/s | 267 | 229 | 411 | | Power | 350W | ~130W | 220W | | tok/J | 0.76 | 1.76 | 1.87 | | GPU price | ~$700 | $2,499+ (system) | ~$700 | A $700 GPU from 2020, power-limited to 220W, matches Apple's latest chip on efficiency while delivering 1.8x the throughput. A single persistent CUDA kernel processes the entire Qwen 3.5-0.8B forward pass in one dispatch. No CPU round-trips between layers. Architecture: Qwen 3.5-0.8B is a hybrid model, 18 DeltaNet layers (linear attention with learned recurrence) and 6 full attention layers, in a 3:1 ratio. DeltaNet scales linearly with context length vs. quadratic for standard attention. It's an emerging pattern in next-gen models (Qwen3-Next, Kimi Linear), but no framework had optimized kernels for it. Kernel specs: - 82 blocks, 512 threads, all SMs on the RTX 3090 kept occupied - BF16 weights and activations, FP32 accumulation where it matters - DeltaNet recurrence via warp-cooperative state updates in F32 registers - Full attention with online softmax (fused QKV, RoPE, causal mask, output projection) - Cooperative grid sync between layers instead of kernel launches (zero inter-layer overhead) - KV cache updates in-kernel - Weights loaded directly from HuggingFace What traditional frameworks do: launch ~100 separate kernels per token, each one paying the cost of CPU dispatch, weight re-fetch, and thread synchronization. The megakernel eliminates all of that. Standard transformers have years of kernel optimization: FlashAttention, PagedAttention, continuous batching. Hybrid DeltaNet/Attention architectures are newer, and the kernel ecosystem is immature: - MLX: no native DeltaNet kernels - llama.cpp: generic DeltaNet support, no fusion - vLLM/SGLang: Triton kernels via flash-linear-attention, but no megakernel fusion As more models go hybrid (and they will, because linear attention scales better), what you run them on matters less than how you run them. When you write a kernel that actually uses what the GPU offers, tensor cores, shared memory, cooperative grid launches, register-resident state, a five-year-old GPU matches Apple's latest chip. grid.sync() inside loops will deadlock silently. We tried synchronizing all blocks within the per-token DeltaNet recurrence loop. No error message, just a hang. The fix: synchronize between layers, not within them. Register pressure kills performance quietly. We attempted S_TILE=16 for more instruction-level parallelism. Silent crash, no CUDA error, registers spilled to l

원문 보기 (hackernews)

Genesis Park 편집팀이 AI를 활용하여 작성한 분석입니다. 원문은 출처 링크를 통해 확인할 수 있습니다.

요약

본문

관련 저널 읽기