NVMe를 통해 LoRA 어댑터 스토리지를 관리하는 vLLM용 간단한 L7 프록시

hackernews | | 📰 뉴스
#ai 모델 #ai 인프라 #l7 프록시 #llama #lora #nvme #openai #vllm
원문 출처: hackernews · Genesis Park에서 요약 및 분석

요약

loraplex는 클라이언트와 vLLM 클러스터 사이에서 동작하는 L7 프록시로, 일관된 해싱(consistent hashing)을 통해 요청을 특정 노드로 라우팅합니다. 이 솔루션은 HuggingFace나 S3 등에서 LoRA 어댑터 파일을 주문형으로 가져와 공유 디스크(NVMe)에 저장하며, vLLM이 이를 독립적으로 읽어와 인퍼런스를 수행하도록 지원합니다. 특히 해시 키를 요청 헤더로 설정해 세션을 고정하거나 RAG 워크로드의 문서 기반 라우팅을 구현할 수 있으며, 쿠버네티스(K8s) 환경에서는 엔드포인트 API를 통해 피어를 자동으로 발견하고 LRU 방식으로 디스크 사용량을 관리합니다.

본문

A simple L7 proxy for vLLM that manages LoRA adapter storage via NVMe, routes requests, and pins workloads to nodes. loraplex sits between your clients and vLLM. It routes requests across a cluster using consistent hashing, manages LoRA adapter files on disk (fetching on demand from HuggingFace, S3, or HTTP, with LRU eviction), and provides node affinity through configurable hash keys. By default it hashes on the adapter name, but it can hash on any request header, enabling session pinning for prefix cache reuse, document-based routing for RAG workloads, or tenant isolation. vLLM's lora_filesystem_resolver reads adapter files from the same directory loraplex writes to. - Quick Start - How It Works - API - Examples and Deployment Modes - Architecture Details - Reference - Development - License go install github.com/shayonj/loraplex/cmd/loraplex@latest Or build from source: git clone https://github.com/shayonj/loraplex.git cd loraplex make build # Start vLLM with filesystem resolver pointed at loraplex's storage directory export VLLM_ALLOW_RUNTIME_LORA_UPDATING=true export VLLM_PLUGINS=lora_filesystem_resolver export VLLM_LORA_RESOLVER_CACHE_DIR=/mnt/nvme/loraplex python -m vllm.entrypoints.openai.api_server \ --model meta-llama/Llama-3.1-8B-Instruct \ --enable-lora --max-loras 200 --max-lora-rank 64 # Start loraplex (writes adapter files to the same directory vLLM reads from) ./bin/loraplex \ --listen :9090 \ --vllm-url http://localhost:8000 \ --dir /mnt/nvme/loraplex Point clients at localhost:9090 instead of localhost:8000 . Base model requests pass through unchanged. LoRA adapter requests trigger loraplex to fetch and store the adapter files, then proxy to vLLM which loads them from the shared directory. curl http://localhost:9090/v1/chat/completions \ -H "Content-Type: application/json" \ -d '{ "model": "your-org/your-lora-adapter", "messages": [{"role": "user", "content": "Hello"}], "max_tokens": 50 }' On each node, pass --self and --peers : # Node 1 ./bin/loraplex --listen :9090 --self 10.0.0.1:9090 \ --vllm-url http://localhost:8000 \ --peers 10.0.0.1:9090,10.0.0.2:9090,10.0.0.3:9090 # Node 2 ./bin/loraplex --listen :9090 --self 10.0.0.2:9090 \ --vllm-url http://localhost:8000 \ --peers 10.0.0.1:9090,10.0.0.2:9090,10.0.0.3:9090 In K8s, loraplex discovers peers automatically via the Endpoints API so you don't need to list peers manually. See examples/k8s/ for full manifests including RBAC, headless Service, and Deployment. kubectl apply -f examples/k8s/manifests.yaml Key requirements: - A headless Service ( clusterIP: None ) selecting loraplex pods so the Endpoints API lists their IPs. - A ServiceAccount with RBAC permission to get Endpoints in the namespace. - Both the vLLM and loraplex containers must share the storage directory via an emptyDir volume. - vLLM environment variables ( VLLM_ALLOW_RUNTIME_LORA_UPDATING ,VLLM_PLUGINS ,VLLM_LORA_RESOLVER_CACHE_DIR ) must be set on the vLLM container. Use --config to load a YAML file instead of CLI flags. See config.example.yaml for all options. CLI flags override config file values. loraplex and vLLM share a directory on disk. loraplex writes adapter files there, and vLLM's filesystem resolver reads from there. loraplex does not modify requests to vLLM or inject paths. It ensures the files exist before proxying, and vLLM's resolver independently discovers them. loraplex writes ──▶ /mnt/nvme/loraplex/{adapter}/adapter_config.json /mnt/nvme/loraplex/{adapter}/adapter_model.safetensors vLLM reads ◀── VLLM_LORA_RESOLVER_CACHE_DIR=/mnt/nvme/loraplex vLLM manages adapters in GPU slots and CPU memory with its own LRU cache (--max-loras , --max-cpu-loras ). loraplex manages the layer below that: files on disk. It ensures adapter files are present in the shared directory before proxying requests to vLLM, handles on-demand fetching from remote origins, bounds disk usage with LRU eviction, and routes requests to the node that already has the adapter stored. Client request (model: "acme/summarizer-lora") │ ▼ loraplex (:9090) │ ├─ base model? ──▶ passthrough to vLLM (no storage needed) │ ├─ consistent hash ──▶ this node owns it? │ │ │ │ │ no │ yes │ ▼ ▼ │ forward to ensure adapter files exist in shared dir │ owner node │ │ │ on disk? ──▶ proxy to vLLM │ ▼ │ │ (owner runs │ not on disk │ same flow) ▼ │ fetch from HuggingFace/S3 │ write to shared dir │ │ │ ▼ │ proxy to vLLM │ ▼ vLLM (:8000) filesystem resolver finds adapter in shared dir loads adapter weights into CPU/GPU memory runs inference The diagram above shows the default behavior where hash_on is set to model . When hash_on uses a request header (e.g., header:X-Session-ID ), the same consistent hashing applies to all requests including base model requests. No adapter files are fetched for base model requests, but the routing still pins the request to a deterministic node. This is useful for workloads where landing on the same vLLM instance matters, like multi-turn conversations that benefit from vLLM's prefix cac

Genesis Park 편집팀이 AI를 활용하여 작성한 분석입니다. 원문은 출처 링크를 통해 확인할 수 있습니다.

공유

관련 저널 읽기

전체 보기 →