Llamacpp에 TurboQuant 모델 중량 압축 지원이 추가되었습니다.

hackernews | 2026년 4월 4일 18:20 | 📦 오픈소스

#claude #cuda #llama #llamacpp #model optimization #review #turboquant #weight compression

원문 출처: hackernews · Genesis Park에서 요약 및 분석

요약

본문에 따르면, 대규모 언어 모델의 가중치 압축을 지원하는 'TurboQuant'의 TQ4_1S 및 TQ3_1S 방식이 llama.cpp에 추가되어 모델 크기를 27~37% 축소하면서도 PPL(혼란도) 상승 폭은 1.0~1.9% 수준으로 유지하는 성능을 입증했습니다. 특히 CUDA, Metal, CPU 전반에서 양자화 및 역양자화가 지원되며, 기존 비압축 모델들의 추론 속도 저하나 기능 손상 없이 모든 회귀 테스트를 통과했습니다. 아울러 Mac 환경(Metal)에서는 기존 Q8_0 대비 93~100% 수준의 연산 속도를 기록하고, CUDA 환경에서도 최적화를 통해 압축 모델의 성능 갭을 줄이는 등 멀티 플랫폼 호환성을 안정적으로 확보했습니다.

본문

Conversation Adds CUDA dequantization for TQ4_1S (5.0 bpv) and TQ3_1S (4.0 bpv) WHT-rotated weight compression types. These achieve 27-37% model size reduction at +1.0-1.9% PPL on Qwen/Phi families. Base types + Metal + CPU quantize/dequant from TheTom's PR TheTom#45. CUDA additions: - turbo-quant.cuh: weight centroids (N(0,1) Lloyd-Max, 16/8 levels), sign array for 32-element inverse WHT - dequantize.cuh: dequantize_tq4_1s/tq3_1s — full 32-element block inverse RHT (5 butterfly stages + normalize + unsign) - convert.cu: TQ4_1S/TQ3_1S in all 4 dequant dispatchers - ggml-cuda.cu: supports_op for MUL_MAT and GET_ROWS, excluded from mmvq/mmq (uses cuBLAS dequant-to-f16 path) The cuBLAS path is correct for initial support. Future optimization: pre-rotate activations via warp shuffle WHT (same pattern as KV cache Q rotation) to eliminate per-block inverse WHT. Co-Authored-By: Claude Opus 4.6 (1M context) Regression Test Results — PR #45Verified that the TQ4_1S weight compression PR does NOT break existing TurboQuant KV cache functionality or standard inference on non-compressed models. Hardware: M5 Max (128GB) + Mac Mini M2 Pro (32GB) Speed — No Regressions All speeds normal or improved. No regressions. PPL — No Regressions (full wikitext-2 runs) All PPL values match known-good. MUL_MAT_ID (MoE path) verified working. VerdictALL TESTS PASS. 5 models, 2 hardware platforms, 4 KV configs. The | turbo4 SET_ROWS was using turbo3's shared template with wrong 2+1 bit packing. New dedicated kernel_set_rows_turbo4 with correct 3-bit packed indices + QJL signs. PPL: 679 → 6.19. Also added turbo4 prefill FA kernel instantiations (non-vec path). QJL ablation finding: disabling QJL improves PPL from 6.1894 to 6.1756 (identical to turbo3). QJL correction hurts quality in attention context. Consistent with scos-lab issue #45. Co-Authored-By: [email protected] Co-Authored-By: Claude Opus 4.6 (1M context) Update: Rebased on upstream master + regression testBranch force-pushed. Now rebased on latest Upstream conflict: activation rotation (commit 744c0c7)Upstream added graph-level Hadamard rotation for KV cache quantization ( Fix: disabled upstream rotation by default in our fork. Users can re-enable with Regression test (M5 Max, rebased branch cb8bddc) All tests pass. No regressions. Phi-4 crash resolved. | | CUDA port available on our branch: signalnine/llama-cpp-turboquant What's implemented: Results (Qwen2.5-7B TQ4_1S, RTX 5090): The fused kernel pre-rotates the activation vector once per mul_mat via Happy to iterate on this if you have ideas for closing the CUDA gap further. | | This is great work, thank you for turning this around so fast. PPL matching between cuBLAS and fused confirms correctness. One question before we merge: can you confirm that uncompressed models (q8_0, q4_0, etc.) show no decode regression on this branch? i.e. the new code paths only activate for TQ4_1S/TQ3_1S and existing quant types run at the same speed as before the PR. | CUDA kernel review — performance improvement opportunitiesNice work on the V8 pre-rotation approach. PPL matching confirms correctness. Here's what I see for closing the gap from 39% to 70-85% of q8_0: High priority (biggest decode wins) Medium priority Skip / low value Realistic ceilingPer architecture with full tuning (NR0 + load dedup + vectorized + batch): The 39% → 70-85% gap is primarily data reuse, not math precision. The pre-rotation design is correct — it just needs the activation tile shared across more rows per CTA. | Hardware Quantize tool verification M5 Max — Uncompressed weights + TurboQuant KVQwen2.5-1.5B Q8_0 Phi-4 14B Q8_0 (crash fix verification) No crash. Upstream attn_rot disabled by default (commit cb8bddc). Qwen3.5-27B Q8_0 Qwen3.5-35B MoE Q8_0 (MUL_MAT_ID path) M5 Max — TQ4_1S Weight CompressionQwen2.5-1.5B Config I (1.28 GiB, 6.20 BPW) Mac Mini M2 Pro — Qwen2.5-7B Q4_K_M Summary All tests pass. PR is safe for review. | | Cherry-picked the load-time conversion (daf0484 + dca057a). Incredible turnaround @signalnine. Asked for it this morning, shipped by evening, 105 t/s matching native q8_0. That is production-quality work. Metal sanity test on M5 Max, all pass, no regressions: TQ4_1S is now a dual-mode format: runtime on Metal (93% of Q8_0), storage+load-convert on CUDA (100% of Q8_0). 40% smaller downloads. Best of both worlds. | Qwen3.5-27B TQ4_1S + Turbo4 KV (Metal)Hardware: Apple Silicon (MTLGPUFamilyMetal4), MacBook M1 Max 64gb ram File sizes llama-bench -p 512 -n 128 Observations Build: 045316d (8758) | | Windows CUDA regression check on current PR head vs current upstream Result: standard quant types remain within normal measurement range on this Windows CUDA setup; the new TQ code paths do not appear to introduce a decode regression for | | Windows CUDA TurboKV validation on current PR head This gives a Windows CUDA check on the latest head after the turbo4 reference WHT fix. | | Windows CUDA refreshed TQ4_1S results on current PR head | | Windows

원문 보기 (hackernews)

Genesis Park 편집팀이 AI를 활용하여 작성한 분석입니다. 원문은 출처 링크를 통해 확인할 수 있습니다.

요약

본문

관련 저널 읽기