HN 표시: Warp_cache – Python용 Rust의 SIEVE 캐시, 캐시 도구보다 25배 빠릅니다.

hackernews | 2026년 3월 9일 07:06 | 🔬 연구

#sieve #cache #performance #python #review #rust

원문 출처: hackernews · Genesis Park에서 요약 및 분석

요약

Python의 `functools.lru_cache` 대체재로서 Rust 기반의 캐싱 라이브러리 `warp_cache`가 개발되었습니다. 이 도구는 LRU 대신 캐시 미스를 최대 21.6% 줄이는 SIEVE 알고리즘을 적용했으며, Rust 확장을 통해 단일 스레드에서 1,600만~2,300만 회의 초당 연산 속도를 자랑합니다. 또한 스레드 안전성을 기본 제공하고 `mmap` 기반의 프로세스 간 공유 메모리를 지원하여, 기존 `cachetools` 대비 최대 25배 이상의 빠른 성능을 보여줍니다.

본문

A thread-safe Python caching decorator built in Rust. Uses SIEVE eviction and GIL-conditional locking - no overhead under the GIL, per-shard RwLock when the GIL is disabled. I needed single @cache() decorator that works across sync functions, async functions, and threads - with TTL - without juggling separate solutions. Existing options each had gaps: functools.lru_cache - not thread-safe (needs manualLock ), no TTL, no async awarenesscachetools - has TTL and multiple strategies, but pure Python (slow) and not thread-safecachebox /moka-py - Rust-backed and thread-safe, but weren't designed for GIL-conditional locking or shared memory warp_cache fills this gap: one decorator, thread-safe by default, async-aware (cache hits return without awaiting), TTL support, and shared memory backend for cross-process use. With free-threaded Python (3.13+) removing the GIL, thread-safe caching stops being optional. warp_cache handles both cases via conditional compilation (GilCell under GIL, RwLock under no-GIL). from warp_cache import cache @cache() def expensive(x, y): return x + y expensive(1, 2) # computes and caches expensive(1, 2) # returns cached result If you're already using functools.lru_cache , switching is one-line change. Unlike lru_cache , this is thread-safe out of the box: -from functools import lru_cache +from warp_cache import cache -@lru_cache(maxsize=128) +@cache(max_size=128) def expensive(x, y): return x + y Like lru_cache , all arguments must be hashable. See the usage guide for details. Entire cache lookup happens in single Rust __call__ - no Python wrapper function, no serialization, no key allocation on hits: Python: fn(42) └─ tp_call (PyO3) ─────────────────────────────── one FFI crossing ├─ hash(args) via ffi::PyObject_Hash (raw FFI, no PyO3 wrapper) ├─ shard select hash & shard_mask (power-of-2 bitmask) ├─ GilCell::read() zero-cost under GIL (UnsafeCell) ├─ HashMap lookup hashbrown + passthrough hasher (no re-hash) ├─ equality check via ffi::PyObject_RichCompareBool (borrowed pointer) ├─ SIEVE visited=1 AtomicBool store, lock-free └─ return cached value On cache hit, the lookup uses BorrowedArgs (a raw pointer + precomputed hash) via hashbrown's Equivalent trait, so there is no CacheKey allocation and no refcount churn. CacheKey is only created on miss when the entry needs to be stored. Two backends: - Memory (default) - sharded hashbrown::HashMap with passthrough hasher andGilCell /RwLock locking. Everything stays in-process. - Shared ( backend="shared" ) - mmap'd shared memory with seqlock. Reads don't take any locks (optimistic seqlock path). Serialization uses a fast-path for primitives (serde), with pickle fallback for complex types. Both backends use SIEVE (NSDI'24) for eviction. The main reason: cache hits don't need a write lock. LRU requires reordering a linked list on every hit - that means write lock (or CAS loop) on every read. SIEVE replaces this with a single store (visited = 1 ), which needs no lock on either backend. On eviction, a "hand" scans the entry list: visited entries get a second chance (bit cleared), unvisited entries get evicted. Why not the others: - LRU - write lock on every hit, which defeats the point of GilCell - ARC - two lists + ghost entries, much more complex for small gains - TinyLFU - frequency counting overhead, bloom filter maintenance The hit rate is also better than LRU. Measured with Zipf-distributed keys (1M requests, benchmarks/bench_sieve.py ): | Workload | SIEVE | LRU | Miss Reduction | |---|---|---|---| | Zipf, 10% cache | 74.5% | 67.5% | +21.6% | | Scan resistance (70% hot) | 69.9% | 63.5% | +17.6% | | One-hit wonders (25% unique) | 53.9% | 43.7% | +18.1% | | Working set shift | 75.5% | 69.7% | +16.6% | See eviction quality benchmarks for the full breakdown. Python 3.13.2, Apple M-series (arm64), cache size 256, Zipf-distributed keys (2000 unique), median of 3 rounds. Source: benchmarks/ If you need thread-safe caching, warp_cache is the fastest option available. If you don't need thread safety, lru_cache is faster - it's C code in CPython with no FFI boundary to cross. | Library | ops/s | Notes | |---|---|---| | lru_cache | 31.0M | C code, no FFI, no safety overhead | | warp_cache | 20.4M | Rust via PyO3, thread-safe, SIEVE | | moka_py | 3.9M | Rust (moka), thread-safe | | cachebox | 1.5M | Rust, thread-safe | | cachetools | 826K | Pure Python, not thread-safe | lru_cache wins by about 1.5x single-threaded. The gap comes from PyO3 call dispatch (~5ns) and refcount management (~3ns) - the price you pay for crossing an FFI boundary with a safe wrapper. | Threads | warp_cache | lru_cache + Lock | cachetools + Lock | |---|---|---|---| | 1 | 20.7M | 12.6M | 767K | | 4 | 20.8M | 12.5M | 788K | | 8 | 20.4M | 12.6M | 793K | | 16 | 19.5M | 11.9M | 795K | With multiple threads, warp_cache is about 1.6x faster than lru_cache + Lock . lru_cache itself isn't thread-safe, so real multi-threaded code needs a threading.Lock() on every call, and that lock overh

원문 보기 (hackernews)

Genesis Park 편집팀이 AI를 활용하여 작성한 분석입니다. 원문은 출처 링크를 통해 확인할 수 있습니다.

요약

본문

관련 저널 읽기