AI 워크로드 조정을 위한 필수 언어: Terradev CLI v4.0.11

hackernews | 2026년 5월 5일 10:07 | 🔧 개발도구

#llama #openai #perplexity #하드웨어/반도체 #ai 딜 #anthropic #ci/cd #claude

원문 출처: hackernews · Genesis Park에서 요약 및 분석

요약

Terradev는 크로스 클라우드 컴퓨팅 최적화 플랫폼으로, 데이터셋을 압축하고 배포하여 순차적 방식보다 3~5배 더 빠르게 인스턴스를 프로비저닝합니다. RunPod, AWS, VastAI 등 다양한 클라우드 공급자를 설정해 가격을 비교하고 최적의 GPU를 선택할 수 있습니다. 기본적으로 최적화되지 않은 토폴로지 문제를 해결해 GPU 성능을 극대화하며, 사용자의 API 키는 로컬에 안전하게 저장됩니다.

본문

Cross-Cloud Compute Optimization Platform with Migration & Evaluation Terradev is a cross-cloud compute-provisioning CLI that compresses + stages datasets, provisions optimal instances + nodes, and deploys 3-5x faster than sequential provisioning. 🎯 Production-Grade Automation: Triggers, Environments & Lineage The three critical missing pillars that transform Terradev from a CLI tool into an enterprise-grade ML platform: - Zero-touch automation: Dataset lands → auto-train, Model drifts → auto-retrain - Schedule-based: Cron jobs for weekly evaluations and maintenance - Condition-based: Drift scores, performance thresholds, cost limits - 19-Provider Support: Works across all cloud providers - Manual override: Full control when needed - Dev → Staging → Prod: Proper lifecycle management - Approval workflow: Request → Approve → Execute with audit trail - Environment isolation: Separate artifacts and configurations - Promotion history: Complete audit trail for compliance - Automatic lineage: Links artifacts across environments - Zero manual tagging: Automatic artifact tracking on every execution - Complete provenance: Data → Model → Deployment chain - Execution diffing: Compare any two pipeline runs - Compliance export: JSON/CSV for auditors and regulators - Checkpoint tracing: Work backwards from any artifact - Smart auto-selection: Training → on-demand, Inference → spot - Cost transparency: Real-time savings calculations (60-80%) - Manual override: --spot and--on-demand flags - Safety features: Automatic state checkpointing and recovery - Model Evaluation: terradev eval --model model.pth --dataset test.json - Endpoint Testing: terradev eval --endpoint http://localhost:8000 --metrics latency - Baseline Comparison: Automatic improvement/regression detection - A/B Model Testing: Side-by-side comparison with winner determination - Multiple Metrics: Accuracy, perplexity, latency, throughput, cost - Train → Eval → Deploy: Full workflow now supported - Risk Assessment: Confidence scoring and migration warnings - Cost Optimization: Multi-hop data transfer routing - Production Planning: Detailed downtime and cost estimates Critical Provider Bug Fixes Fixed 6 critical bugs across 20 cloud providers: - 🔴 Alibaba - Fixed missing return inget_instance_quotes (prevented quotes) - 🔴 RunPod - Fixed dead code + volume_id NameError in provisioning - 🔴 TensorDock - Fixed info["model"] KeyError (should beinfo["v0Name"] ) - 🔴 Hetzner - Fixed quote["server_id"] KeyError (should bequote["instance_type"] ) - 🔴 GCP - Fixed lambda closure bug in zone availability checking - 🔴 CoreWeave - Fixed $0.00 pricing when no API key configured Complete SGLang Optimization Stack (v4.0.8) Revolutionary workload-specific auto-optimization for SGLang serving with 7 workload types: - Agentic/Multi-turn Chat: LPM + RadixAttention + cache-aware routing (75-90% cache hit rate) - High-Throughput Batch: FCFS + CUDA graphs + FP8 quantization (maximum tokens/sec) - Low-Latency/Real-Time: EAGLE3 + Spec V2 + capped concurrency (30-50% TTFT improvement) - MoE Models: DeepEP auto + TBO/SBO + EPLB + redundant experts (up to 2x throughput) - PD Disaggregated: Separate prefill/decode configurations with production optimizations - Structured Output/RAG: xGrammar + FSM optimization (10x faster structured output) - Hardware-Specific: H100/H200, H20, GB200, AMD MI300X optimizations # Auto-optimize any model for workload type terradev sglang optimize deepseek-ai/DeepSeek-V3 # Detect workload from description terradev sglang detect meta-llama/Llama-2-7b-hf --user-description "Real-time API" # Multi-replica cache-aware routing terradev sglang router meta-llama/Llama-2-7b-hf --dp-size 8 - Agentic Chat: 1.9x throughput with multi-replica, 95-98% GPU utilization - Batch Inference: Maximum tokens/second with pre-compiled CUDA graphs - Low Latency: 30-50% TTFT improvement, 20-40% TPOT improvement - MoE Models: Up to 2x throughput with Two-Batch Overlap - Cache-Aware Routing: 3.8x higher cache hit rate - H100/H200: FlashInfer + FP8 KV cache optimization - H20: FA3 + MoE→QKV→FP8 stacking + swapAB runner - GB200 NVL72: Rack-scale TP + NUMA-aware placement - AMD MI300X: Triton backend + ROCm EPLB tuning Performance and scalability improvements for enterprise deployments. Revolutionary passive CUDA Graph optimization that automatically analyzes and optimizes GPU topology for maximum graph performance: # Automatic CUDA Graph optimization - no configuration needed terradev provision -g H100 -n 4 # NUMA-aware endpoint selection happens automatically # CUDA Graph compatibility is detected passively # Warm pool prioritizes graph-compatible models - 2-5x speedup for CUDA Graph workloads with optimal NUMA topology - 30-50% bandwidth penalty eliminated through automatic GPU/NIC alignment - Zero configuration - everything runs passively in the background - Model-aware optimization - different strategies for transformers vs MoE models - PIX (Same PCIe Switch): Optimal for CUDA Graphs (1.0 score) - PXB (Same Root Complex): Very good (0.8 score) - PHB (Same NUMA Node): Good (0.6 score) - SYS (Cross-Socket): Poor for graphs (0.3 score) - Transformers: Highest priority (0.9 base score) - benefit most from graphs - CNNs: Moderate priority (0.7 base score) - benefit moderately - MoE Models: Lower priority (0.4 base score) - dynamic routing challenges - Auto-detection: Model types identified automatically from model IDs - Passive Analysis: Runs automatically every 5 minutes - Warm Pool Enhancement: CUDA Graph models get higher priority - Endpoint Selection: Routes to NUMA-optimal endpoints automatically - Performance Tracking: Monitors graph capture time and replay speedup pip install terradev-cli For all cloud provider SDKs and ML integrations: pip install terradev-cli[all] Verify and list commands: terradev --help Terradev supports 19 GPU cloud providers. Start with one, RunPod is the fastest to set up: terradev setup runpod --quick This shows you where to get your API key. Then configure it: terradev configure --provider runpod Paste your API key when prompted. It's stored locally at ~/.terradev/credentials.json, never sent to a Terradev server. Add more providers later: terradev configure --provider vastai terradev configure --provider lambda_labs terradev configure --provider aws The more providers you configure, the better your price coverage. Check pricing across every provider you've configured: terradev quote -g A100 Output is a table sorted cheapest-first: price/hour, provider, region, spot vs. on-demand. Try different GPUs: terradev quote -g H100 terradev quote -g L40S terradev quote -g RTX4090 Most clouds hand you GPUs with suboptimal topology by default. Your GPU and NIC end up on different NUMA nodes, RDMA is disabled, and the kubelet Topology Manager is set to none. That's a 30-50% bandwidth penalty on every distributed operation and you'll never see it in nvidia-smi. When you provision through Terradev, topology optimization is automatic: terradev provision -g H100 -n 4 --parallel 6 What happens behind the scenes: - NUMA alignment — GPU and NIC forced to the same NUMA node - GPUDirect RDMA — nvidia_peermem loaded, zero-copy GPU-to-GPU transfers - CPU pinning — static CPU manager policy, no core migration - SR-IOV — virtual functions created per GPU for isolated RDMA paths - NCCL tuning — InfiniBand enabled, GDR_LEVEL=PIX, GDR_READ=1 You don't configure any of this. It's applied automatically. To preview the plan without launching: terradev provision -g A100 -n 2 --dry-run To set a price ceiling: terradev provision -g A100 --max-price 2.50 Option A — Run a command on your provisioned instance: terradev execute -i -c "nvidia-smi" terradev execute -i -c "python train.py" Option B — One command that provisions, deploys a container, and runs: terradev run --gpu A100 --image pytorch/pytorch:latest -c "python train.py" Option C — Keep an inference server alive: terradev run --gpu H100 --image vllm/vllm-openai:latest --keep-alive --port 8000 # See all running instances and current cost terradev status --live # Stop (keeps allocation) terradev manage -i -a stop # Restart terradev manage -i -a start # Terminate and release terradev manage -i -a terminate # View spend over the last 30 days terradev analytics --days 30 # Find cheaper alternatives for running instances terradev optimize Now that your nodes have correct topology, distributed training actually runs at full bandwidth: # Validate GPUs, NCCL, RDMA, and drivers before launching terradev preflight # Launch training on the nodes you just provisioned terradev train --script train.py --from-provision latest # Watch GPU utilization and cost in real time terradev monitor --job my-job # Check status terradev train-status # 6. List checkpoints when done terradev checkpoint list --job my-job The --from-provision latest flag auto-resolves IPs from your last provision command. Supports torchrun, DeepSpeed, Accelerate, and Megatron. If you're serving a model with vLLM, there are 6 settings most teams leave at defaults — each one costs throughput: | Knob | Default | Optimized | Impact | |---|---|---|---| | max-num-batched-tokens | 2048 | 16384 | 8x throughput | | gpu-memory-utilization | 0.90 | 0.95 | 5% more VRAM | | max-num-seqs | 256/1024 | 512-2048 | Prevent queuing | | enable-prefix-caching | OFF | ON | Free throughput win | | enable-chunked-prefill | OFF | ON | Better prefill | | CPU Cores | 2 + #GPUs | Optimized | Prevent starvation | Auto-tune all six from your workload profile: terradev vllm auto-optimize -s workload.json -m meta-llama/Llama-2-7b-hf -g 4 Or analyze a running server: terradev vllm analyze -e http://localhost:8000 Benchmark: terradev vllm benchmark -e http://localhost:8000 -c 10 For large Mixture-of-Experts models (GLM-5, Qwen 3.5, DeepSeek V4), Terradev's MoE templates include every optimization auto-applied — KV cache offloading, speculative decoding, sleep mode, expert load balancing: terradev provision --task clusters/moe-template/task.yaml \ --set model_id=Qwen/Qwen3.5-397B-A17B Or a smaller model: terradev provision --task clusters/moe-template/task.yaml \ --set model_id=Qwen/Qwen3.5-122B-A10B --set tp_size=4 --set gpu_count=4 What's auto-applied (no flags needed): - KV cache offloading — spills to CPU DRAM, up to 9x throughput - MTP speculative decoding — up to 2.8x faster generation - Sleep mode — idle models hibernate to CPU RAM, 18-200x faster than cold restart - Expert load balancing — rebalances routing at runtime - LMCache — distributes KV cache across instances via Redis This separates inference into two GPU pools optimized for each phase: - Prefill (compute-bound) — processes input prompt, wants high FLOPS - Decode (memory-bound) — generates tokens, wants high HBM bandwidth The KV cache transfers between them via NIXL — zero-copy GPU-to-GPU over RDMA. This is why getting the NUMA topology right in Step 4 matters: NIXL only runs at full speed when the GPU and NIC share a PCIe switch. terradev ml ray --deploy-pd \ --model zai-org/GLM-5-FP8 \ --prefill-tp 8 --decode-tp 1 --decode-dp 24 Terradev's inference router automatically uses sticky routing. Once a prefill GPU hands off a KV cache to a decode GPU, future requests with the same prefix go to that same decode GPU, avoiding redundant transfers. For production, create a topology-optimized K8s cluster: terradev k8s create my-cluster --gpu H100 --count 8 --prefer-spot This auto-configures Karpenter NodePools with NUMA-aligned kubelet Topology Manager, GPUDirect RDMA, and PCIe locality enforcement. # List clusters terradev k8s list # Get cluster info terradev k8s info my-cluster # Tear down terradev k8s destroy my-cluster Each step builds on the one before it: - Step 4: NUMA / RDMA / SR-IOV topology ← foundation - Step 8: Distributed training at full BW ← depends on topology - Step 9: vLLM knob tuning ← depends on correct memory layout - Step 10: KV cache offloading + sleep mode ← depends on CPU bus not saturated - Step 11: Disaggregated P/D ← depends on RDMA for KV transfer If the provisioning layer is wrong, every optimization above it underperforms. A disaggregated P/D setup with a cross-NUMA KV transfer is slower than a monolithic setup with correct topology. Terradev handles the foundation automatically so the rest of the stack works the way it's supposed to. #!/bin/bash # Complete LLM deployment workflow # 1. Find cheapest GPU terradev quote -g A100 --quick # 2. Provision with auto-optimization terradev provision -g A100 -n 2 --parallel 4 # 3. Deploy optimized vLLM terradev ml vllm --start --instance-ip $(terradev status --json | jq -r '.[0].ip') --model meta-llama/Llama-2-7b-hf --tp-size 2 # 4. Set up monitoring terradev monitor --endpoint llama-api --live # 5. Add customer adapter terradev lora add -e http://$(terradev status --json | jq -r '.[0].ip'):8000 -n customer-a -p ./adapters/customer-a #!/bin/bash # GLM-5 production deployment # 1. Deploy MoE cluster terradev provision --task clusters/moe-template/task.yaml --set model_id=zai-org/GLM-5-FP8 --set tp_size=8 # 2. Deploy monitoring terradev k8s monitoring-stack --cluster glm-5-cluster # 3. Set up warm pool for bursty traffic terradev ml warm-pool --configure --strategy traffic_based --max-warm-models 5 --endpoint glm-5-api # 4. Test failover terradev inferx failover --endpoint glm-5-api --test-load 5000 #!/bin/bash # Production deployment with cold start failover and multi-tenant LoRA adapters echo "🚀 Deploying InferX + LoRA Hybrid Inference Service" # 1. Deploy baseline reserved GPUs for steady traffic echo "📍 Step 1: Provision reserved baseline capacity" terradev provision -g H100 -n 2 --parallel 4 \ --tag baseline-llm \ --max-price 2.50 BASELINE_IP=$(terradev status --json | jq -r '.[] | select(.tags[] | contains("baseline-llm")) | .ip' | head -1) # 2. Deploy optimized vLLM with LoRA support on baseline echo "📍 Step 2: Deploy vLLM with LoRA adapter support" terradev ml vllm --start \ --instance-ip $BASELINE_IP \ --model meta-llama/Llama-2-7b-hf \ --tp-size 2 \ --enable-lora \ --enable-kv-offloading \ --enable-sleep-mode \ --port 8000 # 3. Load customer-specific LoRA adapters echo "📍 Step 3: Load multi-tenant LoRA adapters" terradev lora add -e http://$BASELINE_IP:8000 \ -n customer-enterprise-a \ -p ./adapters/customer-enterprise-a terradev lora add -e http://$BASELINE_IP:8000 \ -n customer-startup-b \ -p ./adapters/customer-startup-b terradev lora add -e http://$BASELINE_IP:8000 \ -n customer-internal \ -p ./adapters/customer-internal # 4. Configure InferX for cold start and burst handling echo "📍 Step 4: Configure InferX for serverless burst capacity" terradev inferx deploy \ --endpoint burst-llm-api \ --model-id meta-llama/Llama-2-7b-hf \ --baseline-endpoint http://$BASELINE_IP:8000 \ --cold-start-threshold 100 \ --burst-capacity 10 \ --failover-strategy active-passive # 5. Set up intelligent routing with semantic awareness echo "📍 Step 5: Configure semantic routing for multi-tenant requests" cat >

원문 보기 (hackernews)

Genesis Park 편집팀이 AI를 활용하여 작성한 분석입니다. 원문은 출처 링크를 통해 확인할 수 있습니다.

요약

본문

관련 저널 읽기