Claude Code Cache Bug Analysis

hackernews | 2026년 4월 2일 09:56 | 📦 오픈소스

#ai 딜 #anthropic #api #cache bug #claude #claude code #rate limit

원문 출처: hackernews · Genesis Park에서 요약 및 분석

요약

클로드 코드(Claude Code)의 두 가지 클라이언트 측 캐시 버그로 인해 API 서버가 대화 접두사 캐시를 인식하지 못하고 매 턴마다 전체를 재구축하여 토큰 소모가 10~20배로 급증하는 문제가 발생했습니다. 커뮤니티 역분석에 따르면 독립 실행형 바이너리의 잘못된 문자열 치환과 세션 재개(--resume) 시 메시지 구조 불일치가 원인이며, 이로 인해 유료 요금제의 경우 몇 시간이 아닌 단 몇 분 만에 속도 제한이 고갈됩니다. 영향을 받은 세션의 캐시 적중률은 최대 4.3%까지 떨어지지만, 공식 도구를 활용한 우회 방안 적용 시 캐시 적중률을 95~99.5%로 정상화하여 효율적인 토큰 사용이 가능합니다. 또한 백그라운드 숨겨진 호출이나 대형 컨텍스트 파일 로드 등 캐시 버그와 무관한 여러 기능 역시 토큰 소모를 가속화하는 것으로 나타났습니다.

본문

Measured analysis of two cache bugs in Claude Code that cause 10-20x token inflation, leading to rapid rate limit exhaustion on paid plans (Max 5/20). Two client-side cache bugs cause the Anthropic API server to miss cached conversation prefixes, forcing a full rebuild on every turn. This inflates token consumption by 10-20x, exhausting rate limits in minutes instead of hours. Workarounds using only official tools are documented below. Identified through community reverse engineering (Reddit analysis): GitHub Issue: anthropics/claude-code#40524 The standalone binary ships with a custom Bun fork that contains a cch=00000 sentinel replacement mechanism. When conversation content includes certain internal strings, the sentinel in messages gets incorrectly substituted — breaking the cache prefix. Result: Full cache rebuild on every API call. Scope: Standalone binary only. The npm version (npx @anthropic-ai/claude-code ) does not contain this logic. GitHub Issue: anthropics/claude-code#34629 Starting from v2.1.69, deferred_tools_delta was introduced in the message structure. When resuming a session (--resume ), the first message's structure doesn't match what the server cached — resulting in a complete cache miss. Impact: On a 500K token conversation, a single resume costs ~$0.15 in quota. When both bugs are active, the cache hit rate drops to near 0%. Every token on every turn is billed at full price — which is why rate limits are exhausted in minutes instead of hours. I set up a transparent local monitoring proxy using ANTHROPIC_BASE_URL (official environment variable) to log cache_creation_input_tokens and cache_read_input_tokens from each API response. This is a pass-through proxy that does not modify requests or responses — it only reads the usage metadata from API responses for logging. ANTHROPIC_BASE_URL is documented by Anthropic for proxy/gateway routing. Audited via session JSONL files using cache_creation_input_tokens / cache_read_input_tokens : | Session | Turns | Cache Read Ratio | Status | |---|---|---|---| | Session A | 168 | 4.3% | poor | | Session B | 89 | 22.6% | poor | | Session C | 233 | 34.6% | poor | At 4.3% cache read ratio, nearly every token is billed at full price — roughly 20x the expected cost per turn. These "poor" sessions were the primary cause of rate limit exhaustion. | Request | Cache Creation | Cache Read | Read Ratio | |---|---|---|---| | 1 (cold start) | 13,535 | 21,125 | 60.9% | | 2 | 3,827 | 34,660 | 90.1% | | 3 | 693 | 38,487 | 98.2% | | 4 | 4,839 | 39,180 | 89.0% | | 5 | 1,270 | 44,019 | 97.2% | | 6 | 247 | 45,289 | 99.5% | | 7 | 443 | 45,536 | 99.0% | | 8 | 257 | 45,979 | 99.4% | | 9 | 1,025 | 46,236 | 97.8% | | 10 | 659 | 47,261 | 98.6% | | 11 | 655 | 47,920 | 98.7% | | 12 | 1,335 | 48,575 | 97.3% | | 13 | 1,417 | 49,910 | 97.2% | | 14 | 2,467 | 51,327 | 95.4% | | 15 | 2,538 | 53,794 | 95.5% | After cold start warmup, cache read ratio stabilizes at 95-99%. This is normal behavior — the server caches the conversation prefix and only bills the delta on each turn. | Metric | Before (affected) | After (workarounds) | |---|---|---| | Cache read ratio | 4.3% - 34.6% | 89% - 99.5% | | Effective token cost per turn | ~10-20x inflated | ~1x (normal) | | Rate limit (Max 20) | 100% in ~70 min | Stable — 14% after extended session | Beyond the cache bugs, several Claude Code behaviors significantly accelerate token consumption. These apply regardless of whether the cache fix is in place. | Behavior | Why | Measured Impact | |---|---|---| --resume | Replays entire conversation history as billable input tokens. Opaque thinking block signatures (base64) are included in the replay. | 500K+ tokens burned on a single resume of a long session (#42260) | /dream | Triggers background API calls that consume tokens without visible output | Silent drain, difficult to detect | /insights | Same as /dream — hidden background token consumption | Reported to cause "insane token usage" (#40438) | | v2.1.89 (latest) | Cache prefix bug still present + terminal content rendering regression on Linux/IntelliJ | All token inflation issues persist + broken UI (#42244) | | Behavior | Why | Measured Data | |---|---|---| | Sub-agents (Agent tool, Haiku) | Each Haiku sub-agent call creates a fresh context with 0% cache read. No cache sharing with parent session. | 317K input tokens across 31 sub-agent calls (measured via proxy) | | Multiple terminals | Each terminal is an independent session with its own context. No shared quota pacing. | ~2x drain rate with 2 active terminals | | Large CLAUDE.md / context files | Sent as input on every single turn. With broken cache, billed at full price each time. | 30KB CLAUDE.md = 30KB × N turns fully billed | | Slash commands that rewrite files | Trigger large context rebuilds mid-session | 20-27% of session budget per invocation reported | | Session start / compaction | cache_creation spikes are structural and unavoidable at these boundaries | Normal — budget for it | Ev

원문 보기 (hackernews)

Genesis Park 편집팀이 AI를 활용하여 작성한 분석입니다. 원문은 출처 링크를 통해 확인할 수 있습니다.

요약

본문

관련 저널 읽기