HN 표시: SwiftLM – iPhone의 Qwen Chat, M5 Pro 64GB의 100B+ Moe(Native Swift)
hackernews
|
|
📦 오픈소스
#iphone
#llm
#moe
#qwen
#swift
#llama
#mistral
#mlx
#openai
#openai api
#review
#swiftlm
원문 출처: hackernews · Genesis Park에서 요약 및 분석
요약
순수 스위프트(Swift) 언어로 구현된 오픈소스 프로젝트 'SwiftLM'은 아이폰에서 큐웬(Qwen) 챗 모델을 구동할 수 있게 지원합니다. 특히 64GB 메모리가 탑재된 애플 M5 Pro 칩셋 환경에서 매개변수 1,000억 개 이상의 거대 MoE(혼합 전문가) 모델을 네이티브로 실행할 수 있는 것이 가장 큰 특징입니다. 이를 통해 고성능 기기 내 온디바이스 AI 모델의 활용 가능성이 크게 확대되었습니다.
본문
A blazingly fast, native Swift inference server that serves MLX models with a strict OpenAI-compatible API. No Python runtime, no Global Interpreter Lock (GIL), no unnecessary memory copies. Just bare-metal Apple Silicon performance compiled to a single binary. - 🍎 100% Native Apple Silicon: Powered natively by Metal and Swift. - 🔌 OpenAI-compatible: Drop-in replacement for OpenAI SDKs ( /v1/chat/completions , streaming, etc). - 🧠 Smart Model Routing: Loads HuggingFace format models directly, with native Safetensors parsing. - ⚡️ TurboQuantization Integrated: Custom low-level MLX Metal primitives that apply extremely fast quantization for KV caching out-of-the-box. - 💾 SSD Expert Streaming: Experimental zero-copy streaming that swaps Mixture of Experts (MoE) layers directly from the NVMe SSD to the GPU command buffer without trashing macOS Unified Memory (prevents Watchdog OS kernel panics on 122B+ models). - 🎛️ Granular Memory Control: Integrated Layer Partitioning ( --gpu-layers ) and Wisdom Auto-Calibration for squeezing massive models into RAM. SwiftLM implements a hybrid V2+V3 TurboQuant architecture for on-the-fly KV cache compression. At roughly ~3.6 bits per coordinate overall, the KV cache is compressed ~3.5× vs FP16 with near-zero accuracy loss. Recent reproductions of the TurboQuant algorithm (e.g., turboquant-mlx ) revealed two distinct paths: - V2 (Hardware-Accelerated): Fast, but uses linear affine quantization which degrades quality at 3-bit. - V3 (Paper-Correct): Excellent quality using non-linear Lloyd-Max codebooks, but painfully slow due to software dequantization. We built the "Holy Grail" hybrid: We ported the V3 non-linear Lloyd-Max codebooks directly into the native C++ encoding path, and process the dequantization natively in fused Metal (bggml-metal ) shaders. This achieves V3 quality at V2 speeds, completely detached from Python overhead. K-Cache (3-bit PolarQuant + 1-bit QJL) = 4.25 bits/dim - Extract L2 norm and normalize: x̂ = x / ‖x‖ - Apply Fast Walsh-Hadamard Transform (WHT) rotation to distribute outliers evenly. - Quantize each coordinate using 3-bit non-linear Lloyd-Max centroids. - Compute the residual error between the original vector and the quantized approximation. - Project the residual via a random Johnson-Lindenstrauss (QJL) matrix and store the 1-bit signs. (Why QJL? QJL acts as an additional regularizer that prevents centroid resolution loss from degrading the attention dot-product.) V-Cache (3-bit PolarQuant) = 3.125 bits/dim Because the V-cache matrix is not used for inner-product attention scoring, the QJL error correction provides no benefit. We cleanly disable QJL for the V-cache, extracting an additional 25% memory savings without sacrificing quality. Reference implementations: turboquant-mlx | turboquant_plus | Paper: TurboQuant, Google 2504.19874 To reliably run massive 122B parameter MoE models over SSD streaming, SwiftLM was designed and benchmarked natively on the following hardware: - Machine: MacBook Pro, Apple M5 Pro - Memory: 64 GB Unified Memory - Model: Qwen3.5-122B-A10B-4bit - SSD: Internal Apple NVMe (Zero-Copy Streaming) ⚠️ Quantization Disclaimer: While heavier quantization shrinks the required memory footprint, 4-bit quantization remains the strict production standard for MoE models. Our metrics indicated that aggressive 2-bit quantization heavily destabilizes JSON grammars—routinely producing broken keys like\name\ instead of"name" —which systematically breaks OpenAI-compatible tool calling. A native iPhone & iPad companion app that downloads MLX models directly from HuggingFace and runs inference on-device via MLX Swift. - Tab UI: Chat · Models · Settings - Live download progress with speed indicator and circular progress ring - Model catalog: Qwen3, Phi-3.5, Mistral, Llama — with on-device RAM fit indicators - HuggingFace search — find any mlx-community model by name - Context-aware empty states — downloading ring, loading spinner, idle prompt - iOS lifecycle hardened — model unload only fires on true background (not notification banners); 30-second grace period on app-switch cd SwiftLMChat python3 generate_xcodeproj.py # Generates SwiftLMChat.xcodeproj open SwiftLMChat.xcodeproj Then in Xcode: - Select the SwiftLMChat target → Signing & Capabilities - Set your Team (your Apple Developer account) - Select your iPhone as the run destination - ⌘R to build and run Note for contributors: The .xcodeproj is git-ignored (it contains your personal Team ID). Rungenerate_xcodeproj.py after cloning to regenerate it locally. Your Team ID is never committed. The absolute fastest way to get started is to download the latest pre-compiled macOS binary directly from the Releases page. Just extract it and run! swift build -c release .build/release/SwiftLM \ --model Qwen3.5-122B-A10B-4bit \ --stream-experts true \ --port 5413 (Note: Add --stream-experts=true if you are attempting to run oversized MoE models like Qwen3.5 122B to bypass macOS virtual memory sw
Genesis Park 편집팀이 AI를 활용하여 작성한 분석입니다. 원문은 출처 링크를 통해 확인할 수 있습니다.
공유