HN 표시: 16GB GPU에서 실행되는 LLM용 2차 PyTorch 최적화 프로그램을 구축했습니다.

hackernews | 2026년 4월 29일 21:19 | 📰 뉴스

#perplexity #오픈소스

원문 출처: hackernews · Genesis Park에서 요약 및 분석

요약

안녕하세요 HN,저는 Danilo입니다. 저는 로컬에서 LLM을 미세 조정할 때 AdamW의 한계로 인해 어려움을 겪고 있습니다. 2차 최적화 프로그램(예: 샴푸 또는 SOAP)은 크로네커 요소 곡률을 활용하여 훨씬 더 나은 단계 수렴을 제공합니다. 문제? 여기에는 O(d^2) 메모리와 레이어당 O(d^3) 컴퓨팅이 필요하며, 이는 16GB T4 또는 RTX 3090과 같은 소비자 하드웨어를 즉시 OOM합니다.저는 집 설정에서 샴푸 품질의 사전 조정을 원했기 때문에 SCAO(Sparse Curvature-Aware Optimize)를 구축했습니다.

본문

Hi HN,I'm Danilo. I've been struggling with the limitations of AdamW when fine-tuning LLMs locally. Second-order optimizers (like Shampoo or SOAP) offer significantly better step-convergence by exploiting Kronecker-factored curvature. The problem? They require O(d^2) memory and O(d^3) compute per layer, which immediately OOMs consumer hardware like a 16GB T4 or RTX 3090.I wanted Shampoo-quality preconditioning on my home setup, so I built SCAO (Sparse Curvature-Aware Optimizer).It's a PyTorch optimizer that acts as a drop-in replacement for AdamW, but it implements a few strict architectural changes to survive on consumer cards:1. Adaptive Rank Selection: Instead of full-rank Kronecker factors, it truncates the eigenspace to retain >=95% of spectral mass.
2. Int8 EMA Quantization: The curvature accumulators are stored in symmetric int8, which yields a 4x memory reduction with zero degradation in perplexity.
3. Quantization Stability: Standard Shampoo usually crashes at step 1 during 4-bit QLoRA fine-tuning due to SVD ill-conditioning in quantized spaces. SCAO exploits sparse approximations to bypass this.
4. Fused CUDA kernels: I wrote custom kernels to fix an O(k * m^2 * n) complexity bottleneck in the naive projection implementation.The Benchmark:
I recently ran a head-to-head benchmark on a single T4 (16GB VRAM) fine-tuning Qwen2.5-3B (4-bit QLoRA, rank 16):
- Shampoo: Failed at Step 1 (SVD mathematical collapse).
- SCAO: 100% stability, peaked at exactly 7.14 GB VRAM, with a smooth loss descent.It is pip-installable (pip install scao).I've written a technical report detailing the regret bounds, ablation studies, and scaling laws (published on Zenodo), but I really wanted to get this community's eyes on the CUDA kernels and the PyTorch implementation.GitHub: <a href="https://github.com/whispering3/scao" rel="nofollow">https://github.com/whispering3/scao</a>
Technical Report (DOI): <a href="https://doi.org/10.5281/zenodo.19870556" rel="nofollow">https://doi.org/10.5281/zenodo.19870556</a>I'd love any feedback, code roasts, or questions about the math behind it!

원문 보기 (hackernews)

Genesis Park 편집팀이 AI를 활용하여 작성한 분석입니다. 원문은 출처 링크를 통해 확인할 수 있습니다.

요약

본문

관련 저널 읽기