10~20배의 토큰 인플레이션을 유발하는 Claude Code 캐시 버그 분석

hackernews | 2026년 4월 13일 19:15 | 📦 오픈소스

#anthropic #claude

원문 출처: hackernews · Genesis Park에서 요약 및 분석

요약

9개는 v2.1.101(최신, 8개 이후 릴리스) 기준으로 수정되지 않은 상태로 남아 있습니다. "출력 효율성" 시스템 프롬프트(P3)가 조용히 제거되었습니다. 이는 353개의 로컬 세션 파일을 검색하여 확인되었습니다.

본문

🇰🇷 한국어 버전 | 🔧 Quick fix guide → — skip the analysis, just fix it Claude Code Hidden Problem Analysis TL;DR: Claude Code has 11 confirmed client-side bugs (B1-B5, B8, B8a, B9, B10, B11, B2a) plus 4 preliminary findings (P1-P4). Cache bugs (B1-B2) are fixed in v2.1.91. Nine remain unfixed as of v2.1.101 (latest, 8 releases later). The "Output efficiency" system prompt (P3) has been quietly removed — confirmed by scanning 353 local session files. Proxy data now covers 27,708 requests over 13 days, with fallback-percentage: 0.5 on every single one. 79% of new sessions still start with a full cache miss on the first turn, even post-fix. Anthropic acknowledged B11 (adaptive thinking zero-reasoning) on HN but has not followed up.Last updated: April 13, 2026 — see changelog cross-reference and 08_UPDATE-LOG.md. Caught up on v2.1.98 and v2.1.101 (v2.1.99/100 don't exist — skipped in the public changelog). Two more releases, still zero fixes for B3–B11. v2.1.98 was mostly security patches (Bash permission bypasses). v2.1.101 fixed resume and MCP bugs — B2a (SendMessage cache miss) may be fixed via the CLI resume path, but the Agent SDK code path is unconfirmed. Changelog cross-reference → The "Output efficiency" system prompt section (P3) appears to be gone. Scanned all 353 local JSONL session files — every session after April 10 shows zero occurrences of the "straight to the point" / "do not overdo" text. The April 8-9 boundary is messy (mixed PRESENT/ABSENT on the same day, likely from running two CC versions concurrently), but after April 10 it's clean. First noticed by @wjordan via system prompt archive diffing. P3 update → Extended the fallback-percentage dataset from 3,702 to 20,083 requests (April 4–13, 10 days). Still 0.5 on every single request, zero variance. Community researchers @cnighswonger (11,502 calls, Max 5x US) and @0xNightDev (Max 5x EU) also report 0.5 — but with a notable difference: our account and cnighswonger's show overage-status: allowed , while 0xNightDev's shows rejected . Same plan, different org-level flags. What fallback-percentage actually means remains undocumented. Full data → Also measured first-turn cache performance across 143 sessions (≥3 requests each): 79% start with cache_read=0 on the first API call, even on v2.1.91+ where B1/B2 are fixed. This is structural — skills and CLAUDE.md land in messages[0] instead of the system[] prefix, breaking prefix-based caching for new sessions. Newer versions are improving this (community data shows ~29% on v2.1.104), but it's still a significant first-turn cost. Details → 5 new bugs + 4 preliminary findings from community-wide issue/comment analysis and fact-checking (April 6-9): | Bug | What | Evidence | Details | |---|---|---|---| | B8a | JSONL non-atomic write → session corruption | ~10+ duplicates in #21321 | 01_BUGS.md | | B9 | /branch context inflation (6%→73%) | 3 duplicate issues | 01_BUGS.md | | B10 | TaskOutput deprecation → 21x context injection → fatal | has repro | 01_BUGS.md | | B11 | Adaptive thinking zero-reasoning → fabrication | Anthropic acknowledged (HN) | 01_BUGS.md | | B2a | SendMessage resume: cache_read=0 (even system prompt) | cnighswonger confirmed | 01_BUGS.md | Preliminary findings (MODERATE): P1/P2 cache TTL dual tiers — two triggers for 1h→5m downgrade: telemetry disabled (has repro ) and quota exceeded. P3 "Output efficiency" system prompt (v2.1.64). P4 third-party detection gap. See 01_BUGS.md — Preliminary Findings. Changelog cross-reference (v2.1.92–v2.1.97): Six releases shipped zero fixes for the nine unfixed bugs. See 01_BUGS.md — Changelog Cross-Reference. cc-relay proxy database now covers 17,610 requests across 129 sessions (April 1-8), with automated bug detection across 532 JSONL files (158.3 MB): | Metric | Previous (Apr 3) | Current (Apr 1-8) | Change | |---|---|---|---| | Budget enforcement (B5) | 261 events | 72,839 events | 279x | | Microcompact (B4) | 327 events | 3,782 events (15,998 items) | 12x | | B8 inflation (bulk scan) | 2.87x (1 session) | 2.37x avg (10 sessions, max 4.42x) | Universal | | Synthetic rate limit (B3) | 24 entries / 6 days | 183/532 files (34.4%) with model entries | Pervasive | | Context growth rate | +575 tok/turn | median 1,845 tok/min (53 sessions) | Statistical | New findings: - Request rate: Mean 2.72 req/min across 78 sessions. Sustained max 8.04 req/min (60+ min sessions). Two very short sessions (2-3 min) averaged 12+ req/min; burst peak 86 req/60s from subagent fan-out. - Per-request cost scales with session length: 0-30min: $0.20/req → 5hr+: $0.33/req (structural, not version-specific) - Cache efficiency stable: 98-99% across all session lengths on v2.1.91 (Bugs 1-2 fully fixed) - Subagent gap: Haiku 58.1% cache vs Opus 98.8% — 40pp gap persists - Microcompact intensifies: 1.6 items/event at <10 messages → 6.6 items/event at 200+ messages Transparent proxy (cc-relay) captured anthropic-ratelimit-unified-* headers across 27,708 requests (April 1-13), revealing the server-side quota architecture: Dual sliding window system: - Two independent counters: 5-hour ( 5h-utilization ) and 7-day (7d-utilization ) representative-claim =five_hour in 100% of requests — the 5h window is always the bottleneck- 5h windows reset on roughly 5-hour intervals; 7d resets weekly (April 10, 12:00 KST for this account) Per-1% utilization cost (measured across 5 active windows on Max 20x / $200/mo): | Metric | Range | Note | |---|---|---| | Output per 1% | 9K-16K | Visible output only (thinking excluded) | | Cache Read per 1% | 1.5M-2.1M | 96-99% of visible token volume | | Total Visible per 1% | 1.5M-2.1M | Output + Cache Read + Input | | 7d accumulation ratio | 0.12-0.17 | 7d_delta relative to 5h_peak | Thinking token blind spot: Extended thinking tokens are not included in the output_tokens field from the API. At 9K-16K visible output per 1%, a full 5h window (100%) = only 0.9M-1.6M visible output tokens — low for several hours of Opus work. The gap is consistent with thinking tokens being counted against the quota, but the exact mechanism can't be confirmed from the client side. Thinking-disabled isolation test planned for the week of April 6. Community cross-validation: - @fgrosswig: 64x budget reduction — dual-machine 18-day JSONL forensics (Mar 26: 3.2B tokens no limit → Apr 5: 88M at 90%) - @Commandershadow9: 34-143x capacity reduction — cache fix confirmed, capacity drop independent of cache bug, thinking token hypothesis v2.1.89 separation: The cache regression (Mar 28 - Apr 1) is a separate, resolved issue. The capacity reduction exists independently — clean comparison: golden period (Mar 23-27, cache 98-99%) vs post-fix (Apr 2+, cache 84-97%), both with healthy cache. Data collection ongoing through April 10 (full 7d cycle). pie title Bug Status (12 identified, verified through v2.1.101) "Fixed (B1, B2)" : 2 "Unfixed (B3-B5, B8-B11, B8a)" : 8 "Possibly Fixed (B2a)" : 1 "By Design (Server)" : 1 Cache regression (v2.1.89) is fixed in v2.1.90-91. Eight client-side bugs remain unfixed through v2.1.101 (latest, 8 releases later). B2a (SendMessage resume) possibly fixed in v2.1.101 (CLI resume path fixed, SDK path unconfirmed). P3 ("Output efficiency" prompt) observed removed (self-verified). Changelog cross-reference: 01_BUGS.md § Changelog Cross-Reference. | Bug | What It Does | Impact | Status | Details | |---|---|---|---|---| | B1 Sentinel | Standalone binary corrupts cache prefix | 4-17% cache read (v2.1.89) | Fixed | 01_BUGS.md | | B2 Resume | --resume replays full context uncached | Full cache miss per resume | Fixed | 01_BUGS.md | | B2a SendMessage | Agent SDK SendMessage resume: full cache miss including system prompt | cache_read=0 on first resume | Possibly Fixed | 01_BUGS.md | | B3 False RL | Client blocks API calls with fake error | Instant "Rate limit reached" | Unfixed | 01_BUGS.md | | B4 Microcompact | Tool results silently cleared mid-session | 5,500 events, 18,858 items cleared | Unfixed | 01_BUGS.md | | B5 Budget cap | 200K aggregate limit on tool results | 167,818 events, 100% truncation | Unfixed | 01_BUGS.md | | B8 Log inflation | Extended thinking duplicates JSONL entries | 2.37x avg (max 4.42x), universal | Unfixed | 01_BUGS.md | | B8a JSONL corruption | Concurrent tool execution drops tool_result → permanent 400 | ~10+ duplicates in #21321 | Unfixed | 01_BUGS.md | | B9 /branch inflation | Message duplication/un-compaction on branch | 6%→73% context in one message | Unfixed | 01_BUGS.md | | B10 TaskOutput thrash | Deprecation message triggers 21x context injection → fatal | 87K vs 4K, triple autocompact | Unfixed | 01_BUGS.md | | B11 Zero reasoning | Adaptive thinking emits zero reasoning → fabrication | Anthropic acknowledged | Investigating | 01_BUGS.md | | Server | Quota architecture + thinking token accounting | Reduced effective capacity | By design | 02_RATELIMIT-HEADERS.md | - Update to v2.1.91+ — fixes the cache regression (worst drain). v2.1.92–101 add no bug fixes for issues tracked here but are safe to use - npm or standalone — both fine on v2.1.91 (Sentinel gap closed) - Don't use --resume or--continue — replays full context as billable input - Start fresh sessions periodically — the 200K tool result cap (B5) silently truncates older results - Avoid /dream and/insights — background API calls that drain silently See 09_QUICKSTART.md for setup guide and self-diagnosis. Full proxy dataset: 13_PROXY-DATA.md. Even with cache at 95-99%, drain persists. At least four server-side issues contribute: 1. Server-side accounting change: Old Docker versions (v2.1.74, v2.1.86 — never updated) started draining fast recently, proving the issue isn't purely client-side (#37394). 2. 1M context billing regression: A late-March regression causes the server to incorrectly classify Max plan 1M context requests as "extra usage." Debug logs show a 429 error at only ~23K tokens (#42616). 3. Dual-window quota + thinking token blind spot: 5h + 7d independent windows. Visible output only 9K-16K per 1% — the gap is likely thinking tokens counted against quota but invisible to clients. Full analysis: 02_RATELIMIT-HEADERS.md. 4. Org-level quota sharing: Accounts under the same organization share rate limit pools. passesEligibilityCache and overageCreditGrantCache are keyed by organizationUuid , not accountUuid . Originally discovered by @dancinlife through client-side analysis of the obfuscated JavaScript bundle. See 09_QUICKSTART.md for the full list of behaviors to avoid and adopt, including /branch , /release-notes , and environment variable recommendations. On April 1, 2026, my Max 20 plan ($200/mo) hit 100% usage in ~70 minutes during normal coding. JSONL analysis showed the session averaging 36.1% cache read (min 21.1%) where it should have been 90%+. Every token was being billed at full price. Downgrading from v2.1.89 to v2.1.68 immediately recovered cache to 97.6% — confirming the regression was version-specific. I set up a transparent monitoring proxy (cc-relay) to capture per-request data going forward. What started as personal debugging quickly expanded. Dozens of users were reporting the same symptoms across what became 91+ GitHub issues. Community members — @Sn3th, @rwp65, @fgrosswig, @Commandershadow9, and 12 others — independently found different pieces of the puzzle. The investigation timeline: | Date | What happened | |---|---| | Apr 1 | 70-minute 100% drain → v2.1.89 regression confirmed, proxy setup | | Apr 2 | Bugs 3-4 discovered (false rate limiter, silent microcompact). Anthropic's Lydia Hallie posts on X | | Apr 3 | Bug 5 discovered (200K budget cap). v2.1.91 benchmark: cache fixed, 4 other active bugs persist (B3-B5, B8). 06_TEST-RESULTS-0403.md | | Apr 4-6 | cc-relay captures 3,702 requests with rate limit headers. Community analysis continues | | Apr 6 | Dual-window quota analysis published. Community cross-validation (fgrosswig 64x, Commandershadow9 34-143x). 02_RATELIMIT-HEADERS.md | Full 14-month chronicle (Feb 2025 – Apr 2026): 07_TIMELINE.md Lydia Hallie (Anthropic, Product) posted on X: "Peak-hour limits are tighter and 1M-context sessions got bigger, that's most of what you're feeling. We fixed a few bugs along the way, but none were over-charging you." She recommended using Sonnet as default, lowering effort level, starting fresh instead of resuming, and capping context with CLAUDE_CODE_AUTO_COMPACT_WINDOW=200000 . Where our data diverges from this assessment: - "None were over-charging you" — Bug 5 silently truncates tool results to 1-49 chars after a 200K aggregate threshold. Users paying for 1M context effectively have a 200K tool result budget for built-in tools. 261 truncation events measured in a single session. - "We fixed a few bugs" — Cache bugs (B1-B2) are fixed, but Bugs 3-5 and B8 remain active in v2.1.91. Client-side false rate limiter (B3) generated 151 synthetic "Rate limit reached" errors across 65 sessions on our setup — zero API calls made. - "Peak-hour limits are tighter" — Our April 6 proxy data shows the bottleneck is always the 5h window ( representative-claim =five_hour in 100% of 3,702 requests), regardless of time of day. Weekend and off-peak data shows the same pattern. - Thinking token accounting — Extended thinking tokens don't appear in output_tokens from the API, yet visible output alone explains less than half the observed utilization cost. If thinking tokens are counted against quota at output-token rate, this is a significant invisible cost that users have no way to monitor or control. GitHub response: bcherny posted 6 comments on #42796 (April 6 only, triggered by HN virality), then went silent. Zero responses on all other 90+ issues including #38335 (478 comments, 15 days). See 10_ISSUES.md for full history. @luongnv89 documented that idle gaps of 13+ hours cause a full cache rebuild. Anthropic documents a 5-minute TTL, though our data shows 5-26 minute gaps sometimes maintaining 96%+ cache — the actual TTL may be longer in practice. Not a bug, but worth knowing about. | File | What | Updated | |---|---|---| | 01_BUGS.md | All 11 bugs (B1-B11, B2a, B8a) + 4 preliminary (P1-P4) + changelog cross-reference (v2.1.92-101) | Apr 13 | | 09_QUICKSTART.md | Quick fix guide — Option A (v2.1.91+) vs Option B (v2.1.63 downgrade), npm vs standalone, diagnosis | Apr 9 | | 07_TIMELINE.md | 14-month chronicle (Phase 1-9) + April 6-9 community acceleration + Anthropic response | Apr 9 | | 08_UPDATE-LOG.md | Daily investigation log + changelog cross-reference | Apr 9 | | 10_ISSUES.md | 91+ tracked issues + community tools + contributors | Apr 9 | | 13_PROXY-DATA.md | Full-week proxy dataset (17,610 requests, 129 sessions) with Mermaid visualizations | Apr 8 | | 02_RATELIMIT-HEADERS.md | Dual 5h/7d window architecture, per-1% cost, thinking token blind spot, fallback-percentage extended data | Apr 13 | | 03_JSONL-ANALYSIS.md | Session log analysis: PRELIM inflation, subagent costs, lifecycle curve, proxy cross-validation | Apr 6

원문 보기 (hackernews)

Genesis Park 편집팀이 AI를 활용하여 작성한 분석입니다. 원문은 출처 링크를 통해 확인할 수 있습니다.

요약

본문

관련 저널 읽기