AI Agent Proxy to help reduce token usage (Anthropic Only)

hackernews | 2026년 4월 11일 08:57 | 📦 오픈소스

#ai agent #anthropic #api proxy #claude #cost reduction #review #token usage

원문 출처: hackernews · Genesis Park에서 요약 및 분석

요약

에이전트 코드 수정 없이 앤스로픽(Anthropic) API의 토큰 사용량과 비용을 절감해 주는 프록시 도구가 공개되었습니다. 이 도구는 동일한 요청을 로컬 SQLite에 캐싱하여 비용을 발생시키지 않고 즉시 응답하며, 프롬프트의 복잡도에 따라 단순한 작업을 자동으로 저렴한 모델(Haiku)으로 다운그레이드하는 라우팅 기능을 제공합니다. 사용자는 기존 API 엔드포인트 대신 로컬 프록시 주소(localhost:8000)로 연결만 변경하면 기존 코드를 그대로 작동시킬 수 있으며, 실시간 대시보드를 통해 호출 내역과 절감된 비용 등을 상세하게 모니터링할 수 있습니다.

본문

A lightweight, drop-in proxy for the Anthropic API that reduces token usage and cost for AI agents — with zero changes to your agent code. When you build an agent using the Anthropic API, every call costs tokens. Repeated calls, over-specified models, and bloated context all add up. agentic-proxy sits between your agent and the Anthropic API and optimizes each request transparently: - Caching — identical requests are served from a local cache instead of hitting the API. Cache hits cost nothing. - Model routing — each prompt is classified as simple, moderate, or complex. Simple prompts are automatically downgraded to a cheaper model (Haiku) instead of burning Sonnet or Opus. - Live dashboard — a real-time view of every call, token usage, cost, cache hit rate, and savings broken down by module — across sessions. Your agent makes the same API calls it always did. It just points at localhost:8000 instead of api.anthropic.com . Agent → agentic-proxy → Anthropic API ↓ [cache check] → cache hit: return immediately, $0 cost [model router] → classify prompt, downgrade model if appropriate [forward request] → send to Anthropic, stream response back [store in cache] → save for future identical requests [log] → record tokens, cost, latency, routing decision All of this is transparent to the agent. It sends a request and gets a response back — it has no idea any of this happened. git clone https://github.com/yourname/agentic-proxy.git cd agentic-proxy python -m venv venv source venv/bin/activate # Windows: venv\Scripts\activate pip install -r requirements.txt cp .env.example .env Open .env and add your Anthropic API key: ANTHROPIC_API_KEY=your_api_key_here python main.py The proxy starts on http://localhost:8000 . The only change you need to make to your agent is the base URL: import anthropic client = anthropic.Anthropic( api_key="any-value", # proxy handles auth from your .env base_url="http://localhost:8000" ) That's it. All your existing agent code works unchanged. While the proxy is running, open: http://localhost:8000/dashboard You'll see live stats, cache hit rate, cost savings, complexity distribution, and a full call log — updated every 2 seconds. All settings are controlled via .env . Copy .env.example to get started. | Variable | Default | Description | |---|---|---| ANTHROPIC_API_KEY | — | Required. Your Anthropic API key. | CACHE_ENABLED | true | Enable or disable the cache module. | ROUTER_ENABLED | true | Enable or disable model routing. | LOGGER_ENABLED | true | Enable or disable the session logger. | CACHE_TTL_HOURS | 24 | How long cache entries live before expiring. | CACHE_MAX_ENTRIES | 1000 | Maximum cache entries before LRU eviction kicks in. | Stores responses in a local SQLite database keyed by a SHA-256 hash of the full request body. On a cache hit the response is returned immediately with no API call. - TTL expiry — entries expire after CACHE_TTL_HOURS hours - LRU eviction — when the cache hits CACHE_MAX_ENTRIES , the least recently used entries are evicted first - Streaming support — cached responses are replayed as proper SSE events so streaming agents receive a correctly formatted stream Before forwarding a request, the router sends the prompt to a Haiku classifier that returns one of three complexity tiers: | Tier | Examples | Model | |---|---|---| SIMPLE | Summarize, translate, format, extract | claude-haiku-4-5 | MODERATE | Write, explain, review, general coding | claude-sonnet-4-6 | COMPLEX | Debug, architect, deep reasoning | claude-opus-4-6 | The router only downgrades — it never upgrades a model beyond what your agent specified. If your agent already uses Haiku, routing is skipped entirely. The classifier prompt is truncated to 500 characters before being sent to keep classification cost minimal. Every request is logged to SQLite with: - Timestamp, cache hit/miss, complexity, routing decision - Input/output token counts - Actual cost vs what it would have cost without the proxy - API latency and router latency Logs persist across server restarts. The dashboard lets you view individual sessions or aggregate everything under Overall. agentic-proxy fully supports streaming requests (stream: true ). Chunks are forwarded to the agent immediately as they arrive — the proxy buffers in the background to cache and log the response after the stream completes. Cache hits for streaming requests are replayed as proper SSE events so the agent receives a correctly formatted stream regardless of whether the response came from cache or the API. Open http://localhost:8000/dashboard while the proxy is running. Summary stats - Total calls, cache hit rate, API calls - Total saved, cache savings, routing savings - Average API latency, average router latency - Current cache size vs maximum Charts - Cache distribution (hits vs misses) - Savings breakdown (cache savings vs routing savings) - Complexity distribution (simple / moderate / complex) - Routing decisions (downgraded / kept / skipped) Call log Every request with full detail — model, routing decision, complexity, latency, cost, and per-call savings. Session selector Switch between individual sessions or view aggregated stats across all sessions under Overall. Clear cache A button in the topbar wipes the cache without restarting the server. agentic-proxy/ ├── main.py # FastAPI app, routes ├── proxy.py # Forwards requests to Anthropic, handles streaming ├── pipeline.py # Orchestrates modules in order per request ├── config.py # Loads and validates settings from .env ├── requirements.txt ├── .env.example ├── modules/ │ ├── cache.py # SQLite cache with TTL and LRU eviction │ ├── router.py # Prompt classifier and model routing │ ├── logger.py # Persistent session log and stats │ └── dashboard.py # Live HTML dashboard └── demo/ ├── agent.py # Standard demo agent (non-streaming) └── streaming_agent.py # Streaming demo agent Make sure the proxy is running first, then in a separate terminal: # Standard agent — 43 calls with a mix of simple, moderate, complex, and duplicate prompts python demo/agent.py # Streaming agent — demonstrates streaming support and cache replay python demo/streaming_agent.py - Router accuracy — the classifier is good but not perfect. A misclassified prompt may be downgraded to a model that produces a lower quality response. You can set ROUTER_ENABLED=false in.env to disable routing if reliability is a concern. - Router overhead — every non-cached request incurs a small Haiku call to classify the prompt. This is visible in the dashboard as router latency. For very short-lived agents this overhead may outweigh the routing savings. - Single instance — the proxy is designed to run locally for a single developer. It is not designed for multi-user or production deployments. - In-memory session ID — the current session ID resets on server restart, though all logged data persists in SQLite. - Hard-Coded Prices — the current price calculation for tokens is hard-coded so it's possible it may become stale in the future but it likely won't have an extremely large difference. - Semantic caching — cache near-identical prompts using embedding similarity, not just exact match - Context trimming — summarize and compress long conversation histories before forwarding - Router confidence threshold — only downgrade when the classifier is above a configurable confidence level - Docker support — single command setup with docker-compose up MIT

원문 보기 (hackernews)

Genesis Park 편집팀이 AI를 활용하여 작성한 분석입니다. 원문은 출처 링크를 통해 확인할 수 있습니다.

요약

본문

관련 저널 읽기