Claude 프롬프트 캐시 쓰기는 다음 요청에 즉시 표시되지 않을 수 있습니다.

hackernews | | 📰 뉴스
#anthropic #claude #오픈소스
원문 출처: hackernews · Genesis Park에서 요약 및 분석

요약

동일한 캐시 제어 시스템 프롬프트를 사용해 연속적으로 요청을 보낼 때, 두 번째 요청이 약 40%의 확률로 캐시 적중에 실패하고 같은 접두사를 다시 작성하는 문제가 보고되었습니다. 30줄의 독립형 재현 코드를 통해 확인된 이 문제는 첫 번째 응답이 완전히 반환된 후에도 발생하며, 요청 간 2초의 지연을 두면 오류가 해결되는 것으로 나타났습니다.

본문

You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert Two back-to-back client.messages.create() calls with the same cached system prompt produce intermittent cache misses on the second call Summary When two requests are sent back-to-back with an identical cache_control system prompt, the second request misses the cache that the first request just wrote ~40% of the time and re-writes the same prefix. Sleeping 2 seconds between the two requests reliably eliminates the miss. Both observations come from a ~30-line standalone reproducer (no tools, no multi-turn, no beta endpoints, stable Sonnet model). Each "BUG" trial wrote cache_creation_input_tokens=1215 on both requests — the same exact prefix billed twice, despite both requests sharing an identical cached system block. Mitigation: sleep 2 s before R2 # Same script as above, with the following inserted between R1 and R2:importtimetime.sleep(2) Output (one 20-trial run with the sleep) [ 1/20] R1cc=1217 R2cc=0 R2cr=1217 → OK [ 2/20] R1cc=1217 R2cc=0 R2cr=1217 → OK [ 3/20] R1cc=1217 R2cc=0 R2cr=1217 → OK [ 4/20] R1cc=1217 R2cc=0 R2cr=1217 → OK [ 5/20] R1cc=1217 R2cc=0 R2cr=1217 → OK [ 6/20] R1cc=1217 R2cc=0 R2cr=1217 → OK [ 7/20] R1cc=1217 R2cc=0 R2cr=1217 → OK [ 8/20] R1cc=1217 R2cc=0 R2cr=1217 → OK [ 9/20] R1cc=1217 R2cc=0 R2cr=1217 → OK [10/20] R1cc=1217 R2cc=0 R2cr=1217 → OK [11/20] R1cc=1217 R2cc=0 R2cr=1217 → OK [12/20] R1cc=1217 R2cc=0 R2cr=1217 → OK [13/20] R1cc=1217 R2cc=0 R2cr=1217 → OK [14/20] R1cc=1217 R2cc=0 R2cr=1217 → OK [15/20] R1cc=1217 R2cc=0 R2cr=1217 → OK [16/20] R1cc=1217 R2cc=0 R2cr=1217 → OK [17/20] R1cc=1217 R2cc=0 R2cr=1217 → OK [18/20] R1cc=1217 R2cc=0 R2cr=1217 → OK [19/20] R1cc=1217 R2cc=0 R2cr=1217 → OK [20/20] R1cc=1217 R2cc=0 R2cr=1217 → OK Bug reproduced: 0/20 "For concurrent requests, note that a cache entry only becomes available after the first response begins. If you need cache hits for parallel requests, wait for the first response before sending subsequent requests." This guidance does not cover the sequential form documented above, where the second request fires after the first response has fully returned and still misses ~40% of the time. anthropics/claude-code#38356 (closed, unresolved) reported matching symptoms — three requests 762–800 ms apart, each independently writing identical cache_creation_input_tokens=10,553. The reporter framed them as "parallel API calls during a user turn," but with 750+ ms inter-call gaps these were almost certainly sequential requests hitting the same cache-visibility race we describe here. The issue was auto-closed by the duplicate-detection bot after 30 days of inactivity; the root cause was never addressed. Cost impact Each redundant write is billed at the full cache_creation_input_tokens rate. At typical agent prompt sizes (10–100K tokens), the cost adds up across any multi-turn workload that doesn't apply the workaround. Environment anthropic Python SDK 0.97.0 (also reproduced on 0.96.0 — the bug is API-side, SDK-agnostic) Two back-to-back client.messages.create() calls with the same cached system prompt produce intermittent cache misses on the second callSummary When two requests are sent back-to-back with an identical cache_control system prompt, the second request misses the cache that the first request just wrote ~40% of the time and re-writes the same prefix. Sleeping 2 seconds between the two requests reliably eliminates the miss.Both observations come from a ~30-line standalone reproducer (no tools, no multi-turn, no beta endpoints, stable Sonnet model). Reproducer Output (one 20-trial run) Each "BUG" trial wrote cache_creation_input_tokens=1215 on both requests — the same exact prefix billed twice, despite both requests sharing an identical cached system block.Mitigation: sleep 2 s before R2 Output (one 20-trial run with the sleep) Context The official prompt-caching docs (docs.claude.com/en/docs/build-with-claude/prompt-caching) acknowledge a related issue for concurrent requests: This guidance does not cover the sequential form documented above, where the second request fires after the first response has fully returned and still misses ~40% of the time. anthropics/claude-code#38356 (closed, unresolved) reported matching symptoms — three requests 762–800 ms apart, each independently writing identicalcache_creation_input_tokens=10,553 . The reporter framed them as "parallel API calls during a user turn," but with 750+ ms inter-call gaps these were almost certainly sequential requests hitting the same cache-visibility race we describe here. The issue was auto-closed by the duplicate-detection bot after 30 days of inactivity; the root cause was never addressed.Cost impact Each redundant write is billed at the full cache_creation_input_tokens rate. At typical agent prompt sizes (10–100K tokens), the cost adds up across any multi-turn workload that doesn't apply the workaround.Environment anthropic Python SDK 0.97.0 (also reproduced on 0.96.0 — the bug is API-side, SDK-agnostic)claude-sonnet-4-5

Genesis Park 편집팀이 AI를 활용하여 작성한 분석입니다. 원문은 출처 링크를 통해 확인할 수 있습니다.

공유

관련 저널 읽기

전체 보기 →