Show HN: Rust의 Mamba SSM – 맞춤형 CUDA 커널을 사용한 훈련 및 추론

hackernews | 2026년 3월 23일 14:59 | 📦 오픈소스

#cuda #gpu 가속 #mamba ssm #rust #반도체 #하드웨어/반도체

원문 출처: hackernews · Genesis Park에서 요약 및 분석

요약

Rust로 구현된 Mamba SSM 라이브러리는 별도의 프레임워크 의존성 없이 GPU 가속을 지원하며, 수동으로 유도한 기울기를 통해 완전한 역전파와 순환 상태에 대한 시간축 역전파(BPTT)를 지원합니다. GH200 ARM CPU 환경에서 단일 스텝 추론 시 61μs~13.6ms의 지연 시간을 보이며, CUDA 커널을 통해 학습에 필요한 병합 연산과 효율적인 메모리 관리를 제공합니다. 해당 구현체는 TF32 정밀도와 CUDA Graph 캡처 호환 등을 특징으로 하며, 기존 파이썬 구현체와 달리 독립적인 실행 환경과 수동 학습 방식을 채택했습니다.

본문

Mamba SSM (Selective State Space Model) implementation in Rust with optional CUDA GPU acceleration. Supports both inference and training, including full backward pass with BPTT through recurrent SSM state. Custom CUDA kernels for GPU-accelerated forward and backward passes. Reference: Gu & Dao, Mamba: Linear-Time Sequence Modeling with Selective State Spaces (2023). - Inference — zero-allocation single-step recurrent forward pass - GPU Inference — CUDA kernels with optional CUDA Graph capture - Training — full backward pass with BPTT through SSM hidden state - Burn-in — warm up recurrent state from history before training window - CUDA — custom kernels for SSM recurrence, conv1d, fused activations - Modular — 3-level API: MambaLayer (pure mixer) / MambaBlock (norm+residual) / MambaBackbone (full) - Serialization — save/load weights via safetensors (HuggingFace standard) - Standalone — no framework dependency (no PyTorch, no Burn, no Candle) - f32 — native single precision, TF32 Tensor Cores on Ampere/Hopper [dependencies] mamba-rs = "0.1" use mamba_rs::{MambaConfig, MambaState, MambaStepScratch, MambaWeights, mamba_step}; let cfg = MambaConfig::default(); // d_model=128, 3 layers let weights = load_weights(); // your weight loading let mut state = MambaState::zeros(cfg.n_layers, cfg.d_inner(), cfg.d_state, cfg.d_conv); let mut scratch = MambaStepScratch::new(&cfg); let mut output = vec![0.0f32; cfg.d_model]; // single-step inference (recurrent, O(1) per step) mamba_step(&input, &mut output, &weights, &mut state.layers, &mut scratch, &cfg, input_dim); // reset state on sequence boundary state.reset(); [dependencies] mamba-rs = { version = "0.1", features = ["cuda"] } Requires NVIDIA GPU + CUDA toolkit. Kernels compiled at runtime via NVRTC. input [B, T, input_dim] | input_proj (linear + bias) | v +--------- x N layers ---------+ | | | residual | | | | | RmsNorm | | | | | in_proj ----+---- gate | | | | | | conv1d | | | | | | | SiLU SiLU | | | | | | x_proj | | | / | \ | | | dt B C | | | | | | | dt_proj | | | | | | | softplus | | | | | | | SSM recurrence | | | h = A*h + B*x | | | y = C*h + D*x | | | | | | | +--- gate * --+ | | | | | out_proj | | | | | + residual | | | +-------------------------------+ norm_f (RmsNorm) | output [B, T, d_model] Default config: d_model=128, 3 layers, d_inner=256, d_state=16, 366K params. | Batch | GH200 (H100) | GH200 + CUDA Graph | Intel (RTX 6000) | Ada + CUDA Graph | |---|---|---|---|---| | B=1 | 155 μs | 115 μs | 124 μs | 79 μs | | B=4 | 193 μs | 140 μs | 145 μs | 99 μs | | B=16 | 200 μs | 147 μs | 148 μs | 102 μs | | B=64 | 201 μs | 152 μs | 156 μs | 115 μs | | B=128 | 212 μs | 164 μs | 176 μs | 138 μs | CUDA Graph eliminates kernel launch overhead (~40 μs saved). | GH200 (H100) | Intel (RTX 6000) | | |---|---|---| | Forward | 788 μs | 706 μs | | Forward + Backward | 2,176 μs | 1,739 μs | | Config | d_model | layers | params | GH200 (Grace, 72 cores) | Intel (Xeon Gold 5412U) | |---|---|---|---|---|---| | small | 64 | 2 | 70K | 21 μs | 26 μs | | default | 128 | 3 | 366K | 76 μs | 86 μs | | medium | 256 | 4 | 1.8M | 261 μs | 349 μs | | large | 512 | 6 | 10.4M | 1,226 μs | 2,651 μs | | Config | d_model | layers | params | GH200 fwd/bwd/total | Intel fwd/bwd/total | |---|---|---|---|---|---| | small | 64 | 2 | 70K | 1,096 / 1,901 / 2,998 μs | 1,223 / 1,915 / 3,153 μs | | default | 128 | 3 | 366K | 3,746 / 10,156 / 13,915 μs | 4,252 / 12,024 / 16,276 μs | | medium | 256 | 4 | 1.8M | 12,333 / 68,020 / 80,429 μs | 14,249 / 54,683 / 68,931 μs | | large | 512 | 6 | 10.4M | 49,325 / 428,821 / 478,146 μs | 65,911 / 877,676 / 943,588 μs | GH200 Grace (72 cores): | Batch | Forward | Backward | Total | Samples/sec | Speedup | |---|---|---|---|---|---| | B=16 | 3,990 μs | 13,588 μs | 17,583 μs | 910 | 12.6x | | B=64 | 5,139 μs | 20,898 μs | 26,037 μs | 2,458 | 34.3x | | B=128 | 8,976 μs | 31,151 μs | 40,217 μs | 3,183 | 44.3x | Intel (Xeon Gold 5412U, 24 cores / 48 threads): | Batch | Forward | Backward | Total | Samples/sec | Speedup | |---|---|---|---|---|---| | B=16 | 10,536 μs | 23,466 μs | 34,002 μs | 471 | 7.6x | | B=64 | 21,970 μs | 65,521 μs | 87,491 μs | 728 | 11.9x | | B=128 | 38,216 μs | 103,329 μs | 141,545 μs | 903 | 14.7x | Thread-local gradient accumulation with epoch-based reduce. GH200 61% parallel efficiency (44x/72), Intel 61% (15x/24). | GPU vs CPU | | |---|---| | Inference B=1 (CUDA Graph) | 3.3x (GH200), 4.4x (Intel) | | Training Fwd+Bwd T=32 | 5.5x (GH200), 8.6x (Intel) | Zero heap allocations per inference step. All buffers pre-allocated. GPU uses TF32 Tensor Cores (10-bit mantissa, ~1e-3 per-op precision). Validated against CPU f32: | Check | Tolerance | Actual max diff | |---|---|---| | GPU vs CPU inference (20 steps) | 1e-2 | 0.003 | | GPU vs CPU training forward (T=8) | 1e-2 | 0.003 | | GPU vs CPU training backward (33 weight groups) | 0.15 | 0.124 | | CUDA Graph vs non-graph | 1e-5 | < 1e-6 | | CPU finite-diff gradient check | 5e-2 | < 1e-2 | All 26 correctness

원문 보기 (hackernews)

Genesis Park 편집팀이 AI를 활용하여 작성한 분석입니다. 원문은 출처 링크를 통해 확인할 수 있습니다.

요약

본문

관련 저널 읽기