Google의 TurboQuant를 사용하는 Quansloth는 로컬 LLM의 "VRAM 벽"을 허물었습니다.

hackernews | | 📦 오픈소스
#google #llama #local llms #quansloth #turboquant #vram wall
원문 출처: hackernews · Genesis Park에서 요약 및 분석

요약

구글 리서치의 '터보퀀트(TurboQuant)' 알고리즘을 구현한 '쿤슬로스(Quansloth)'가 로컬 대규모 언어 모델(LLM) 실행 시 발생하는 VRAM 메모리 부족 문제를 해결했습니다. 이 오픈소스 AI 서버는 AI의 메모리를 16비트에서 4비트로 압축해 캐시 사용량을 최대 75% 줄여, 일반적으로 고가의 24GB RTX 4090이 필요한 32k 토큰 이상의 대규모 문맥을 6GB VRAM의 RTX 3060 같은 소비자용 GPU에서도 안정적으로 구동할 수 있게 해줍니다. 또한 실시간 하드웨어 모니터링 기능을 통해 장시간 문서 분석 시 발생할 수 있는 시스템 다운을 방지하며, 윈도우와 리눅스 환경에서 완벽하게 지원되는 프라이빗 인터페이스를 제공합니다.

본문

____ _ _ _ / __ \ | | | | | | | | | |_ _ __ _ _ __ ___| | ___ | |_| |__ | | | | | | |/ _` | '_ \ / __| |/ _ \| __| '_ \ | |__| | |_| | (_| | | | |\__ \ | (_) | |_| | | | \___\_\\__,_|\__,_|_| |_||___/_|\___/ \__|_| |_| [ POWERED BY TURBOQUANT+ | NVIDIA CUDA ] Breaking the VRAM Wall: Based on the implementation of Google's TurboQuant (ICLR 2026) — Quansloth brings elite KV cache compression to local LLM inference. Quansloth is a fully private, air-gapped AI server that runs massive context models natively on consumer hardware (like an RTX 3060). By bridging a custom Gradio Python frontend with a highly optimized llama.cpp CUDA backend, Quansloth achieves extreme memory compression, saving up to 75% of VRAM. Standard LLM inference often hits a "Memory Wall" when processing long documents; as the context grows, the GPU runs out of memory (OOM) and the system crashes. Quansloth prevents these crashes by: - 75% Cache Shrink: Compressing the "memory" of the AI from 16-bit to 4-bit (TurboQuant). - Massive Context on Budget GPUs: Run 32k+ token contexts on a 6GB RTX 3060 that would normally require a 24GB RTX 4090. - Hardware-Level Stability: Our interface monitors the CUDA backend to ensure the model stays within your GPU's physical limits, allowing for stable, long-form document analysis without the fear of a system hang. 📸 Interface Preview - Windows 10/11: Fully Supported (via WSL2 Ubuntu). Features a 1-click .bat launcher. - Linux: Fully Supported (Native). - macOS: Not officially supported out-of-the-box (backend optimized for NVIDIA CUDA GPUs). - TurboQuant Cache Compression: Run 8,192+ token contexts natively on 6GB GPUs without Out-Of-Memory (OOM) crashes. - Live Hardware Analytics: The UI physically intercepts the C++ engine logs to report your exact VRAM allocation and savings in real-time. - Context Injector: Upload long documents (PDF, TXT, CSV, MD) directly into the chat stream to test the AI's memory limits. - Dual-Routing: Auto-scan your local models/ folder, or input custom absolute paths to load any.gguf file. - Cyberpunk UI: A sleek, fully responsive dark-mode dashboard built for power users. - Windows with WSL2 (Ubuntu) OR native Linux - NVIDIA GPU with updated drivers - Miniconda or Anaconda installed conda create -n quansloth python=3.10 -y conda activate quansloth git clone https://github.com/PacifAIst/Quansloth.git cd Quansloth pip install -r requirements.txt chmod +x install.sh ./install.sh Download .gguf models (e.g., Llama 3 8B) and place them in: models/ - Use Launch_Quansloth.bat - Double-click → auto-launches WSL, Conda, and server conda activate quansloth python quansloth_gui.py http://127.0.0.1:7860 - Symmetric (Turbo3) → Best overall compression - Asymmetric (Q8/Turbo4) → Better for Q4_K_M models (e.g., Qwen) - Monitor Hardware Stats for real-time VRAM savings - License: This project is licensed under the Apache 2.0 License. - Core Technology: Built upon the TurboQuant+ implementation developed by TheTom (@TheTom). - Research & Algorithms: The underlying algorithm is based on research from Google Research (arXiv:2504.19874). - CUDA Kernels: Special thanks to Gabe Ortiz (signalnine) for porting the CUDA kernels. 👤 Author Dr. Manuel Herrador 📧 [email protected] University of Jaén (UJA) - Spain Made with ❤️ for the Local AI Community by PacifAIst

Genesis Park 편집팀이 AI를 활용하여 작성한 분석입니다. 원문은 출처 링크를 통해 확인할 수 있습니다.

공유

관련 저널 읽기

전체 보기 →