$500 GPU는 코딩 벤치마크에서 Claude Sonnet보다 성능이 뛰어납니다.

hackernews | | 📦 오픈소스
#anthropic #atlas #claude #gpt-5 #livecodebench #llama #openai #test-time learning #머신러닝/연구 #모델 경량화 #소비자 gpu
원문 출처: hackernews · Genesis Park에서 요약 및 분석

요약

ATLAS V3는 단일 소비자용 GPU를 기반으로 한 테스트 타임 학습 방식을 통해 LiveCodeBench에서 74.6%의 성적을 거두며, Claude Sonnet 등 상용 모델을 상회하는 효율성을 입증했습니다. 14B 파라미터 모델에 구조화된 생성, 에너지 기반 검증, 그리고 자가 검증을 통한 반복 수정을 적용하여 별도의 파인 튜닝 없이 API 의존도를 제거했습니다. RTX 5060 Ti 16GB 환경에서 완전히 로컬로 구동되며 추론 시간은 더 소요되지만 데이터 유출 없이 저렴한 전력 비용만으로 경쟁력 있는 성능을 발휘합니다.

본문

Adaptive Test-time Learning and Autonomous Specialization A.T.L.A.S achieves 74.6% LiveCodeBench pass@1-v(k=3) with a frozen 14B model on a single consumer GPU -- up from 36-41% in V2 -- through constraint-driven generation and self-verified iterative refinement. The premise: wrap a frozen smaller model in intelligent infrastructure -- structured generation, energy-based verification, self-verified repair -- and it can compete with frontier API models at a fraction of the cost. No fine-tuning, no API calls, no cloud. Fully self-hosted -- no data leaves the machine, no API keys required, no usage metering. One GPU, one box. Hardware: RTX 5060 Ti 16GB | Model: Qwen3-14B-Q4_K_M (frozen) | Benchmark | Score | Tasks | Method | |---|---|---|---| | LiveCodeBench v5 | 74.6% pass@1-v(k=3)* | 599 | V3 pipeline: PlanSearch + self-verified PR-CoT repair, V3 Score | | GPQA Diamond | 47.0% | 198 | k=5, multiple-choice knowledge reasoning, V2 Score | | SciCode | 14.7% (sub-problems) | 341 | k=1, cross-domain scientific coding, V2 Score | *pass@k-v(k=3) = one solution submitted per task, but generated via best-of-3 candidates + Lens selection + iterative repair on failures. Not single-shot generation, it is not pass@1. See methodology. V3 ablation breakdown | Condition | Configuration | Pass Rate | Delta | |---|---|---|---| | A | Baseline (no V3) | 54.9% | -- | | B | +Phase 1 (PlanSearch + BudgetForcing + DivSampling) | 67.3% | +12.4pp | | C | +Phase 1+2 (Lens routing) | 67.3% | +0.0pp | | D | +Phase 1+3 (self-verified refinement) | 74.6% | +7.3pp | Phase 3 uses self-generated test cases for internal verification -- the model never sees the answer key during repair. PR-CoT rescues 36/42 tasks (85.7% of Phase 3 rescues). Full report: V3_ABLATION_STUDY.md | System | LCB pass@1 | Est. cost/task | Notes | |---|---|---|---| | DeepSeek V3.2 Reasoning | 86.2% | ~$0.002 | API, single-shot | | GPT-5 (high) | 84.6% | ~$0.043 | API, single-shot | | ATLAS V3 (pass@1-v(k=3)) | 74.6% | ~$0.004 | Local electricity only, best-of-3 + repair pipeline | | Claude 4.5 Sonnet | 71.4% | ~$0.066 | API, single-shot | | Claude 4 Sonnet | 65.5% | ~$0.066 | API, single-shot | Methodology notes & sources Methodology notes: ATLAS scores are from 599 LCB tasks using the full V3 pipeline (best-of-3 + Lens selection + iterative repair) on a frozen 14B quantized model or "pass@k-v(k=3)". Competitor scores are single-shot pass@1 (zero-shot, temperature 0) from Artificial Analysis on 315 LCB problems -- not the same task set, so this is not a controlled head-to-head. API costs assume ~2,000 input + ~4,000 output tokens per task at current pricing. ATLAS cost = electricity at $0.12/kWh (~165W GPU, ~1h 55m for 599 tasks). ATLAS trades latency for cost -- the pipeline takes longer per task than a single API call, but no data leaves the machine. Sources: Artificial Analysis LCB Leaderboard | AA Benchmarking Methodology | LiveCodeBench Paper (arXiv) | LCB Dataset (HuggingFace) | Pricing: OpenAI, Anthropic, DeepSeek flowchart LR subgraph Phase1["Phase 1: Generate"] PS[PlanSearchConstraint extraction+ diverse plans] BF[Budget ForcingThinking tokencontrol] end subgraph Verify["Score + Test"] GL[Geometric LensC x energy scoring5120-dim self-embeddings] SB[SandboxCode execution] end subgraph Phase3["Phase 3: Repair"] ST[Self-Test GenModel-generatedI/O pairs] PR[PR-CoT RepairMulti-perspectivechain-of-thought] end PS --> BF BF -->|k=3 candidates| GL GL -->|energy-sorted| SB SB -->|all fail| ST ST --> PR PR -->|repaired code| SB style GL fill:#2d5016,color:#fff style PS fill:#1a3a5c,color:#fff style BF fill:#1a3a5c,color:#fff style SB fill:#2d5016,color:#fff style ST fill:#5c3a1a,color:#fff style PR fill:#5c3a1a,color:#fff A single patched llama-server runs on K3s, providing both generation with speculative decoding (~100 tok/s) and 5120-dim self-embeddings for Lens scoring. The Geometric Lens C(x) energy field selects the best candidate (87.8% accuracy on mixed-result tasks). Failed tasks enter Phase 3, where the model generates its own test cases and iteratively repairs solutions via PR-CoT -- real tests are used only for final scoring. Full architecture: docs/ARCHITECTURE.md Before you begin: ATLAS was developed and tested on specific hardware. Read the Hardware & Reproduction section below to check compatibility and tune variables for your setup before running. git clone https://github.com/itigges22/ATLAS.git && cd ATLAS cp atlas.conf.example atlas.conf # set MODEL_PATH, DATA_DIR, GPU device sudo ./scripts/install.sh ./scripts/verify-install.sh # Run V3 benchmark python3 benchmark/v3_runner.py See docs/SETUP.md for full installation instructions. | Resource | Minimum | Tested | |---|---|---| | GPU VRAM | 16 GB | RTX 5060 Ti 16 GB | | System RAM | 14 GB | 16 GB | | Python | 3.10+ | 3.11 | | OS | RHEL 9 / Ubuntu 24 | RHEL 9 (Proxmox VM) | Reproduction details V3 results were produced on RHEL 9 running as a Proxmox VM with an RTX 5060 Ti 16GB passed through via VFIO. Ot

Genesis Park 편집팀이 AI를 활용하여 작성한 분석입니다. 원문은 출처 링크를 통해 확인할 수 있습니다.

공유

관련 저널 읽기

전체 보기 →