Claude Code, Cursor 및 Codex의 토큰 비용 절감

hackernews | 2026년 4월 12일 04:47 | 📦 오픈소스

#claude #claude code #cursor #gemini #gpt-4 #mcp server #openai #review #컨텍스트 압축 #토큰 비용 절감

원문 출처: hackernews · Genesis Park에서 요약 및 분석

요약

Entroly는 전체 코드베이스를 가변 해상도로 압축하여 AI 코딩 도구가 동시에 참조할 수 있는 파일 수를 5~10개에서 전체로 늘리면서도 API 토큰 사용량을 70~95%까지 절감해주는 컨텍스트 엔진입니다. 정보 이론과 배낭 알고리즘을 활용해 중요도에 따라 파일을 전체 코드, 시그니처, 참조의 세 가지 단계로 압축하며, 주요 벤치마크에서 정보 검색 및 코드 생성 능력을 100% 손실 없이 유지함을 증명했습니다. 사용자는 단 한 줄의 설치 명령어와 프록시 설정만으로 Claude Code, Cursor, Copilot 등 다양한 도구에 즉시 적용할 수 있으며, 개인정보를 외부로 전송하지 않고 로컬 환경에서 10ms 미만의 속도로 안전하게 최적화를 수행할 수 있습니다.

본문

The Token-Saving MCP Server & Context Compression Engine Stop paying for useless LLM tokens. Entroly is a zero-config Context Engine (with native HTTP proxy support) that compresses codebase context, reducing Claude, Cursor, and OpenAI API costs by 80% without losing visibility. pip install entroly && entroly go | npm install entroly-wasm && npx entroly-wasm Problem • Solution • Install • Demo • Integrations • Architecture • Community Every AI coding tool — Cursor, Claude Code, GitHub Copilot, Windsurf, Cody — has the same fatal flaw: Your AI can only see 5-10 files at a time. The other 95% of your codebase is invisible. This causes: - Hallucinated function calls — the AI invents APIs that don't exist - Broken imports — it references modules it can't see - Missed dependencies — it changes auth.py without knowing aboutauth_config.py - Wasted tokens — raw-dumping files burns your budget on boilerplate and duplicates - Wrong answers — without full context, even GPT-4/Claude give incomplete solutions You've felt this. You paste code manually. You write long system prompts. You pray it doesn't hallucinate. There's a better way. Entroly compresses your entire codebase into the context window at variable resolution. | What changes | Before Entroly | After Entroly | |---|---|---| | Files visible to AI | 5-10 files | All files (variable resolution) | | Tokens per request | 186,000 (raw dump) | 9,300 - 55,000 (70-95% reduction) | | Cost per 1K requests | ~$560 | $28 - $168 | | AI answer quality | Incomplete, hallucinated | Correct, dependency-aware | | Setup time | Hours of prompt engineering | 30 seconds | | Overhead | N/A | < 10ms | Critical files appear in full. Supporting files appear as signatures. Everything else appears as references. Your AI sees the whole picture — and you pay 70-95% less. | RAG (vector search) | Entroly (context engineering) | | |---|---|---| | What it sends | Top-K similar chunks | Entire codebase at optimal resolution | | Handles duplicates | No — sends same code 3x | SimHash dedup in O(1) | | Dependency-aware | No | Yes — auto-includes related files | | Learns from usage | No | Yes — RL optimizes from AI response quality | | Needs embeddings API | Yes (extra cost + latency) | No — runs locally | | Optimal selection | Approximate | Mathematically proven (knapsack solver) | pip install entroly && entroly demo # see savings on YOUR codebase Open the interactive demo for the animated experience. Python: pip install entroly[full] entroly go Node.js / TypeScript: npm install entroly-wasm npx entroly-wasm serve # MCP server npx entroly-wasm optimize # CLI optimizer npx entroly-wasm demo # see savings on YOUR codebase The WASM package runs the full Rust engine natively in Node.js — no Python required. That's it. entroly go (Python) or npx entroly-wasm serve (Node.js) auto-detects your IDE, starts the engine, and begins optimizing. Point your AI tool to http://localhost:9377/v1 . # Python pip install entroly # core engine entroly init # detect IDE + generate config entroly proxy --quality balanced # start proxy # Node.js npm install entroly-wasm # WASM engine, zero dependencies npx entroly-wasm serve # start MCP server | Package | What you get | |---|---| npm install entroly-wasm | Full Rust engine via WebAssembly — MCP server, CLI, autotune, health | npm install @ebbiforge/entroly-mcp | Bridge to Python engine (requires pip install entroly ) | | Package | What you get | |---|---| pip install entroly | Core — MCP server + Python engine | pip install entroly[proxy] | + HTTP proxy mode | pip install entroly[native] | + Rust engine (50-100x faster) | pip install entroly[full] | Everything | docker pull ghcr.io/juyterman1000/entroly:latest docker run --rm -p 9377:9377 -p 9378:9378 -v .:/workspace:ro ghcr.io/juyterman1000/entroly:latest | AI Tool | Setup | Method | |---|---|---| | Cursor | entroly init | MCP server | | Claude Code | claude mcp add entroly -- entroly | MCP server | | VS Code + Copilot | entroly init | MCP server | | Windsurf | entroly init | MCP server | | Cline | entroly init | MCP server | | OpenClaw | See below | Context Engine | | Cody | entroly proxy | HTTP proxy | | Any LLM API | entroly proxy | HTTP proxy | "I stopped manually pasting code into Claude. Entroly just works." - Zero config — entroly go handles everything. No YAML, no embeddings, no prompt engineering. - Instant results — See the difference on your first request. No training period. - Privacy-first — Everything runs locally. Your code never leaves your machine. - Battle-tested — 436 tests, crash recovery, connection auto-reconnect, cross-platform file locking. - Built-in security — 55 SAST rules catch hardcoded secrets, SQL injection, command injection across 8 CWE categories. - Codebase health grades — Clone detection, dead code finder, god file detection. Get an A-F grade. When developers search for "token saving proxy" or "context compression", Entroly offers distinct advantages over standard alternatives: | Feature | Entroly | Basic Proxies | |---|---|---| | Setup | Zero-config (entroly go ) | Requires YAML/embedding setup | | Codebase Intelligence | Deep (dead code, god files) | Proxy transport only | | Security | 55 SAST rules (catches hardcoded secrets) | None builtin | | Savings Strategy | Information-theoretic Knapsack (retains 100% visibility) | Standard reduction techniques | | Primary Use Case | Context compression for AI agents | Basic token reduction | OpenClaw users get the deepest integration — Entroly plugs in as a Context Engine: | Agent Type | What Entroly Does | Token Savings | |---|---|---| | Main agent | Full codebase at variable resolution | ~95% | | Heartbeat | Only loads changes since last check | ~90% | | Subagents | Inherited context + Nash bargaining budget split | ~92% | | Cron jobs | Minimal context — relevant memories + schedule | ~93% | | Group chat | Entropy-filtered messages — only high-signal kept | ~90% | from entroly.context_bridge import MultiAgentContext ctx = MultiAgentContext(workspace_path="~/.openclaw/workspace") ctx.ingest_workspace() sub = ctx.spawn_subagent("main", "researcher", "find auth bugs") Does compression hurt accuracy? We proved it doesn't. Entroly dynamically compresses context without losing the information your LLM needs. We measure accuracy retention across industry-standard benchmarks: | Benchmark | What it tests | Baseline | Entroly | Retention | |---|---|---|---|---| | NeedleInAHaystack | Info retrieval from long context | 100% | 100% | 100% | | HumanEval | Code generation | 13.3% | 13.3% | 100% | | GSM8K | Math reasoning | 86.7% | 80.0% | 92% | | SQuAD 2.0 | Reading comprehension | 93.3% | 86.7% | 92% | Results fully validated on rigorous token budgets via bench/accuracy.py . Note: Extensive testing has confirmed Entroly's performance persists perfectly across both "mid" and "mini" model tiers (e.g.,gpt-4o-mini ,gemini-1.5-flash ). | Benchmark | Status inside bench/accuracy.py | Validated Results (gpt-4o-mini ) | |---|---|---| | NeedleInAHaystack | Implemented | 100% retention | | HumanEval | Implemented | 100% retention | | GSM8K | Implemented | 92% retention | | SQuAD 2.0 | Implemented | 92% retention | pip install entroly[full] matplotlib # Export your API key export OPENAI_API_KEY="sk-..." # Run the full validation suite python -m bench.accuracy --benchmark all --model gpt-4o-mini --samples 15 # Generate the NeedleInAHaystack Heatmap python -m bench.needle_heatmap --model gpt-4o-mini | Stage | What | Result | |---|---|---| | 1. Ingest | Index codebase, build dependency graph, fingerprint fragments | Complete map in <2s | | 2. Score | Rank by information density — high-value code up, boilerplate down | Every fragment scored | | 3. Select | Mathematically optimal subset fitting your token budget | Proven optimal (knapsack) | | 4. Deliver | 3 resolution levels: full → signatures → references | 100% coverage | | 5. Learn | Track which context produced good AI responses | Gets smarter over time | "If you compress my codebase by 80%, how do I know you didn't strip the code my AI actually needs?" Fair question. Here's the honest answer: Entroly never "strips" code from files the LLM needs. It uses three resolution levels: | Resolution | What the LLM sees | When used | |---|---|---| | Full (100%) | Complete source code — every line, every comment | Files that directly match your query | | Signatures | Function/class signatures with types + docstrings | Tangential imports your query doesn't target | | Reference | File path + 1-line summary | Files the LLM should know exist, but doesn't need to read | Critical guarantee: If you ask about worker.ts , the LLM gets the complete worker.ts . The savings come from compressing node_modules/lodash/fp.js to a signature and README.md to a reference — files you'd never paste manually anyway. Every optimized request includes a visible report inside the LLM context: [Entroly: worker.ts (Full), schema.prisma (Full), types.ts (Full), 8 files (Signatures only), 12 files (Reference only). 8,777 tokens. GET /explain for details.] Your AI sees this. You can see this. No hidden truncation. After any request, call GET localhost:9377/explain to see: - Included — Every included file with its resolution level and why it was included - Excluded — Every excluded file and why it was dropped - Summary — Resolution exact breakdown (e.g., 5 Full, 8 Skeleton, 12 Reference) | Claim | What it actually means | |---|---| | 50-80% token savings | Measured across real codebases (Langfuse, VSCode). Varies by query specificity. | | 100% code visibility | Every file in your codebase is represented at some resolution. Nothing is invisible. | | < 10ms latency | The Rust engine adds < 10ms. Network to the LLM API is unchanged. | We don't claim 95% savings because that's only achievable on trivial queries against massive codebases. Real-world savings on complex monorepo queries are 50-80%. If the ~40 token overhead bothers you: export ENTROLY_CONTEXT_REPORT=0 "The LLM is the CPU, the context window is RAM." | Layer | What it solves | |---|---| | Documentation tools | Give your agent up-to-date API docs | | Memory systems | Remember things across conversations | | RAG / retrieval | Find relevant code chunks | | Entroly (optimization) | Makes everything fit — optimally compresses codebase + docs + memory into the token budget | These layers are complementary. Entroly is the optimization layer that ensures everything fits without waste. While Entroly was built for codebases, its core relies on Shannon Entropy and Knapsack Mathematics, meaning it is completely agnostic to the text it compresses. Entroly is widely used as a universal context compressor for: | Text Type | The Problem | How Entroly Compresses It | |---|---|---| | Massive Server Logs | 100K lines of identical INFO logs bury the one ERROR stack trace. | Drops repetitive logs (low entropy), strictly retains exceptions and novel timestamps. | | Agent Memory | Multi-agent swarms fill up the context window with conversational fluff. | Extracts only the high-signal, decision-making paragraphs to pass to the next agent. | | Legal/Financial Docs | RAG systems retrieve 50 pages of PDFs, blowing the token budget. | Scans the retrieved paragraphs, isolates the exact clauses answering the query, drops the boilerplate. | In our NeedleInAHaystack benchmark, Entroly perfectly compressed 128,000 tokens of Paul Graham essays (pure English text) to 2,000 tokens while maintaining a 100% retrieval success rate. | Command | What it does | |---|---| entroly go | One command — auto-detect, init, proxy, dashboard | entroly wrap claude | Start proxy + launch Claude Code in one command | entroly wrap codex | Start proxy + launch Codex CLI | entroly wrap aider | Start proxy + launch Aider | entroly wrap cursor | Start proxy + print Cursor config | entroly demo | Before/after comparison with dollar savings on YOUR project | entroly dashboard | Live metrics: savings trends, health grade, PRISM weights | entroly doctor | 7 diagnostic checks — finds problems before you do | entroly health | Codebase health grade (A-F): clones, dead code, god files | entroly benchmark | Competitive benchmark: Entroly vs raw context vs top-K | entroly role | Weight presets: frontend , backend , sre , data , fullstack | entroly autotune | Auto-optimize engine parameters | entroly learn | Analyze session for failure patterns, write to CLAUDE.md | entroly digest | Weekly summary: tokens saved, cost reduction | entroly status | Check running services | entroly wrap claude # Starts proxy + launches Claude Code entroly wrap codex # Starts proxy + launches Codex CLI entroly wrap aider # Starts proxy + launches Aider entroly wrap cursor # Starts proxy + prints Cursor config Entroly starts the proxy, sets the base URL environment variable, and launches your tool. Zero configuration. from entroly import compress result = compress(messages, budget=50_000) response = client.messages.create(model="claude-sonnet-4-5-20250929", messages=result) Or compress any content directly: from entroly.universal_compress import universal_compress compressed = universal_compress(huge_json_blob) # auto-detects JSON compressed = universal_compress(log_output) # auto-detects logs compressed = universal_compress(csv_data) # auto-detects CSV Content-type auto-detection routes each input to the best compressor — JSON, logs, code, CSV, XML, stacktraces, tables. | Your setup | Add Entroly | One-liner | |---|---|---| | Any Python app | compress() | result = compress(messages, budget=50_000) | | Any app (proxy) | entroly proxy | Point base URL at localhost:9377 | | LangChain | EntrolyCompressor | chain = compressor | llm | | Multi-agent | MultiAgentContext | ctx = MultiAgentContext(...) | | Claude Code | entroly wrap claude | One command | | Codex / Aider | entroly wrap codex | One command | | MCP tools | entroly init | Auto-config | from langchain_openai import ChatOpenAI from entroly.integrations.langchain import EntrolyCompressor llm = ChatOpenAI(model="gpt-4o") compressor = EntrolyCompressor(budget=30000) chain = compressor | llm result = chain.invoke("Explain the auth module") from entroly.context_bridge import MultiAgentContext ctx = MultiAgentContext(workspace_path="~/.agent/workspace", token_budget=128_000) ctx.ingest_workspace() # NKBE allocates budget optimally across agents budgets = ctx.allocate_budgets(["researcher", "coder", "reviewer"]) # Spawn subagent with inherited context sub = ctx.spawn_subagent("main", "researcher", "find auth bugs") # Schedule cron jobs with minimal context ctx.schedule_cron("monitor", "check error rates", interval_seconds=900) Entroly never permanently discards data. When a fragment is compressed to a skeleton, the original is stored in the Compressed Context Store. The LLM can retrieve the full original on demand: # List all retrievable fragments curl localhost:9377/retrieve # Get full original of a compressed file curl localhost:9377/retrieve?

원문 보기 (hackernews)

Genesis Park 편집팀이 AI를 활용하여 작성한 분석입니다. 원문은 출처 링크를 통해 확인할 수 있습니다.

요약

본문

관련 저널 읽기