나는 더 나은 복제 정책을 고안하려고 노력했습니다. 실패했습니다.
hackernews
|
|
🔬 연구
#review
#데이터베이스
#데이터복제
#복제정책
#분산시스템
#실패기록
원문 출처: hackernews · Genesis Park에서 요약 및 분석
요약
저자는 데이터의 나이가 들수록 복제 빈도를 줄이는 '시간 경과에 따른 일관성(Time-decaying consistency)'을 적용한 분산 키-값 저장소 'Nearsight'를 개발했지만, 이 기술이 기존 방식보다 더 나은 성과를 내지 못했다고 실토했습니다. Rust로 구현하여 안정적인 아키텍처를 갖추기는 했으나, 정량적인 벤치마킹 결과 이론적으로 예상했던 네트워크 트래픽 절감 효과를 얻지 못하고 기본 베이스라인보다 더 많은 데이터를 전송하는 것으로 나타났습니다. 결국 저자는 구현 가능한 아이디어와 실제로 더 우수한 시스템 사이의 괴리를 확인하며 프로젝트의 실패를 인정했습니다.
본문
I Tried to Invent a Better Replication Policy. It Failed. A while ago I had a very smart-sounding distributed systems idea: older data matters less, so why are we still replicating it like it’s on fire? The intuition felt obvious. Fresh writes matter. Cold data often doesn’t. So maybe a system should replicate aggressively at first, then back off as data ages. Less cross-AZ traffic, lower cost, acceptable consistency tradeoffs for workloads like telemetry, logs, and sessions. That idea became Nearsight , a distributed key-value store built around what I called time-decaying consistency. I thought I might be onto a better replication policy. I was not. The pitch The core policy was: \[P(\mathrm{sync}(k,t)) = e^{-\lambda \cdot \mathrm{age}(k,t)}\]In plain English: the older a key gets, the less likely it is to be synchronized across AZs at any given point in time. So synchronization effort decays exponentially with key age. Fresh data is replicated aggressively, and old data gradually falls onto slower cross-AZ schedules. In practice I approximated this with age buckets rather than continuously recomputing a rate for every key: Hot : sync every ~1sWarm : sync every ~10sCool : sync every ~60sCold : sync every ~10mFrozen : sync every ~1hGlacial : sync every ~24h Fresh writes were replicated aggressively. If a key stopped changing, it naturally cooled into slower and slower replication intervals. Cross-AZ replication was also batched. Nodes compared age-bucketed Merkle summaries first, then only transferred divergent keys for buckets that didn’t match. So the intended effect was: - fewer x-AZ syncs for cold data - batched transfers instead of constant per-write chatter - lower steady-state bandwidth for mostly cooling datasets It sounded pretty plausible. Which, in distributed systems, is usually when you should get worried. So I built the thing This did not stay at the level of a sketch. Nearsight ended up with: - age-bucketed replication - Merkle-based divergence detection - WAL, snapshots, tombstones, compaction - hinted handoff - topology-aware routing - quorum/local read and write policies - QUIC transport - membership and failure handling - admin APIs, audit logging, HTTPS/auth - chaos tooling - dashboard - Rust client SDK - benchmark harnesses and reports Under the hood, I built it in Rust using in-memory storage with WAL-backed durability, HLC-style timestamps for last-write-wins conflict resolution, per-age-bucket Merkle trees for divergence detection, and QUIC for node-to-node communication. Cross-AZ sync was periodic and batched rather than write-through, with read policies ranging from local-only to quorum. At some point this stopped being “I’m exploring an idea” and became “I guess I’m building a distributed database now.” The good news The good news is that the idea was real enough to implement. A lot of distributed systems ideas die the minute they run into deletes, retries, stale reads, compaction, failure recovery, topology, or the fact that production systems are mostly unpleasant corner cases. This one didn’t. The architecture held together. The storage layer worked. The replication logic worked. The system was coherent enough to test, benchmark, and reason about. That alone felt like a meaningful result. The bad news The bad news is that “coherent enough to implement” is not the same thing as “actually better”. In early, simple benchmark framing, the idea looked promising. But once I put together more structured benchmarking, the story changed. Nearsight did not beat the baseline on bytes sent. That was the point where the project stopped being “I may have found something” and became “I need to understand why this lost.” What the benchmarks actually said This is the part that killed the original pitch. I benchmarked nearsight_age_tiered against three alternatives in a small 2-node, 4-scenario harness: eventual_constant , bounded_staleness , and leader_hot_follower_cold . The main metric was total bytes sent. eventual_constant was the baseline: ordinary eventual consistency with a fixed replication schedule, regardless of data age. bounded_staleness allowed more drift in exchange for lower traffic. leader_hot_follower_cold was a more asymmetric policy that favored hot-path writes and delayed colder follower sync. Nearsight was supposed to reduce cross-AZ bytes by letting old data cool into slower replication schedules. It didn’t. In the recorded smoke benchmark, it used more bytes than the eventual_constant baseline in every scenario. That’s the whole problem in one sentence. The more interesting nuance is that Nearsight still kept stale-read rates very close to the baseline. So it preserved freshness reasonably well, but failed on the thing it was supposed to optimize: bandwidth. By contrast, bounded_staleness actually did save bytes, but with noticeably worse stale-read behavior. Overall comparison | Policy | Avg Bytes Sent | Avg Reduction vs eventual_constant | Avg Stale Read Rate | Avg Wal
Genesis Park 편집팀이 AI를 활용하여 작성한 분석입니다. 원문은 출처 링크를 통해 확인할 수 있습니다.
공유