HN 표시: 공식적으로 검증된 Kafka용 리더리스 로그 프로토콜
hackernews
|
|
📦 오픈소스
#ai딜
#로그프로토콜
#분산시스템
#카프카
#형식검증
원문 출처: hackernews · Genesis Park에서 요약 및 분석
요약
StreamNative의 레이크하우스 네이티브 스트리밍 엔진인 'Ursa'에 적용된 리더 없는 로그 프로토콜(Leaderless Log Protocol)이 공개되었습니다. 이 프로토콜은 리더 선출이나 내부 합의 알고리즘 없이 외부의 선형화 가능한 스토어에 모든 조정을 위임하여 상태를 유지하지 않는 워커들이 동시에 데이터를 추가할 수 있도록 설계되었습니다. 특히 TLA+와 Fizzbee를 통한 형식 검증을 거쳐 안전성과 활성 속성을 모두 충족하는 것으로 입증되었으며, 외부 조정 스토어 없이 S3 호환 스토리지만으로 작동하는 메시지 큐 예제도 포함되어 있습니다.
본문
A formally verified distributed append-only log protocol with concurrent writers, compaction, and readers — built on the coordination-delegated pattern, where stateless workers offload all coordination to an external linearizable store. The Leaderless Log Protocol implements a distributed append-only log where any writer can append without being elected leader. There is no Raft, no Paxos, no leader election. Instead, writers are stateless processes that delegate ordering and index management to an external linearizable coordination store: - Monotonic offsets are assigned via the store's AtomicIncrement (a sequence counter) - Log index is maintained as linearizable key-value entries with atomic CAS - Compaction rewrites ranges of WAL entries into compacted files via a 3-step CAS update - Fencing prevents stale writers from corrupting the log after a takeover The protocol models three interacting sub-protocols — writer append path, compaction index update, and reader path — and is verified in both TLA+ and Fizzbee against safety and liveness properties (see Expected Results). Full canonical specification: 1-leaderless-log-protocol.md . The Leaderless Log Protocol formalized in this repo is the protocol used in Ursa, the lakehouse-native streaming engine from StreamNative. The formal spec here is the distilled, model-checked core of the protocol that Ursa runs in production. - VLDB 2025 paper — Ursa: A Lakehouse-Native Streaming System: https://www.vldb.org/pvldb/vol18/p5184-guo.pdf - Ursa for Kafka (engineering blog): https://streamnative.io/blog/ursa-for-kafka-native-apache-kafka-service-on-lakestream S3-Queue is a distributed message queue built entirely on S3-compatible object storage — no external coordination store required. It instantiates the Leaderless Log Protocol with S3 conditional writes as the coordination primitive. | Protocol | Leaderless Log | | System SPEC | examples/s3-queue/SPEC.md — 1,100+ lines covering the full design | | Implementation | Rust CLI (s3q ) — generated by a coding agent from the SPEC alone | Try it yourself — tell your favorite coding agent: Implement S3-Queue according to the following spec: https://github.com/lakestream-io/leaderless-log-protocol/blob/main/examples/s3-queue/SPEC.md See examples/ for the full list of examples. The Leaderless Log Protocol is one instance of a broader architectural pattern: coordination-delegated distributed systems. In this pattern, worker nodes carry no consensus logic — they are stateless processes that offload all coordination decisions (ordering, locking, leader election, failure detection) to an external linearizable store. The store becomes the single source of truth for shared state, while workers focus purely on data-path operations. ┌──────────┐ ┌──────────┐ ┌──────────┐ │ Worker A │ │ Worker B │ │ Worker C │ │(stateless)│ │(stateless)│ │(stateless)│ └────┬─────┘ └────┬─────┘ └────┬─────┘ │ │ │ ▼ ▼ ▼ ┌──────────────────────────────────────────┐ │ Coordination Store (linearizable) │ │ CAS · AtomicIncrement · Ephemeral keys │ └──────────────────────────────────────────┘ Unlike embedded consensus (Raft, Paxos), where every node participates in the consensus protocol, coordination-delegated systems separate the coordination plane from the data plane entirely. See 0-coordination-delegated-pattern.md for the full treatment of the pattern, its primitives, and its limitations. This repo formalizes a family of coordination-delegated protocols. The Leaderless Log Protocol is the headliner; additional protocols share the same pattern and coordination primitives. | Layer | Document | Description | |---|---|---| | 0 | 0-coordination-delegated-pattern.md | The foundational pattern: abstract coordination primitives, properties, limitations, and reference implementations | | 1 | 1-leaderless-log-protocol.md | Leaderless Log — distributed append-only log with concurrent writers, compaction, and readers | | 2 | 2-coordination-delegated-task-claiming.md | Task Claiming — distributed task claiming with ephemeral locks and crash recovery | Protocols 1 and 2 are independent and composable. Each instantiates the Layer 0 pattern with a specific subset of coordination primitives. The real value of formally verified protocols: you can write a system SPEC that instantiates a protocol for a specific system, hand that SPEC to a coding agent, and get a correct implementation in one shot. ┌─────────────────────┐ ┌─────────────────────┐ ┌─────────────┐ ┌──────────────────┐ │ Protocol SPEC │ │ System SPEC │ │ Coding │ │ Working │ │ (formally │ → │ (language-agnostic,│ → │ Agent │ → │ Implementation │ │ verified) │ │ self-contained) │ │ │ │ │ └─────────────────────┘ └─────────────────────┘ └─────────────┘ └──────────────────┘ The protocol SPEC guarantees correctness properties (safety, liveness) via model checking. The system SPEC translates those guarantees into a concrete design — domain model, state machines, coordination primitives, pseudocode, error handling, and a test matrix. Together, they give a coding agent everything it needs to produce a correct implementation without back-and-forth. ├── 0-coordination-delegated-pattern.md -- Layer 0: The pattern ├── 1-leaderless-log-protocol.md -- Layer 1: Canonical spec (Leaderless Log) ├── 2-coordination-delegated-task-claiming.md -- Layer 2: Canonical spec (Task Claiming) ├── README.md -- This file ├── tlaplus/ │ ├── LeaderlessLog.tla -- TLA+ implementation of Protocol 1 │ ├── LeaderlessLog.cfg -- Small TLC config (fast, ~10^5 states) │ ├── LeaderlessLog-medium.cfg -- Medium TLC config (thorough, ~10^6-7 states) │ ├── TaskClaiming.tla -- TLA+ implementation of Protocol 2 │ ├── TaskClaiming.cfg -- Small TLC config │ └── TaskClaiming-medium.cfg -- Medium TLC config ├── fizzbee/ │ ├── LeaderlessLog.fizz -- Fizzbee implementation of Protocol 1 │ └── TaskClaiming.fizz -- Fizzbee implementation of Protocol 2 └── examples/ └── s3-queue/ -- Example: S3-Queue (spec + Rust impl) The canonical SPEC docs (Markdown) define each protocol precisely. Both TLA+ and Fizzbee are independent implementations of the same spec. The spec doc is the source of truth. - If TLA+ and Fizzbee disagree on a property verdict, the spec doc resolves which is wrong - Every state variable, action, and property in the spec has a corresponding construct in both TLA+ and Fizzbee Prerequisites: Java 17+ and TLA+ tools (tla2tools.jar). Quick setup: make tlaplus-install Leaderless Log Protocol: # Small config — fast checking (~10^5 states) cd tlaplus java -jar tla2tools.jar -config LeaderlessLog.cfg LeaderlessLog.tla # Medium config — thorough checking (~10^6-7 states) java -jar tla2tools.jar -config LeaderlessLog-medium.cfg LeaderlessLog.tla Task Claiming Protocol: # Small config cd tlaplus java -jar tla2tools.jar -config TaskClaiming.cfg TaskClaiming.tla # Medium config java -jar tla2tools.jar -config TaskClaiming-medium.cfg TaskClaiming.tla | Parameter | Small | Medium | Description | |---|---|---|---| Writers | {w1, w2} | {w1, w2, w3} | Writer processes | MaxOffset | 5 | 8 | Offset upper bound | CompactRangeStart | 1 | 1 | Compaction range start | CompactRangeEnd | 3 | 5 | Compaction range end | MaxBatch | 2 | 3 | Max records per write batch | The MaxBatch parameter controls multi-record entry modeling: MaxBatch = 1 : Dense index (every offset has an entry) — equivalent to the original 1:1 modelMaxBatch > 1 : Sparse index — writes non-deterministically choose batch sizes in1..MaxBatch , producing entries that cover multiple offsets Note: With MaxBatch > 1 , some write sequences produce entries that don't align withCompactRangeEnd . For example, withMaxBatch=2 andCompactRangeEnd=3 : writes of[1,2],[3,4] produce entries at offsets 2 and 4 — offset 3 has no entry boundary. TheCompactStart guard naturally handles this by only firing when all entries are fully contained in the range. | Parameter | Small | Medium | Description | |---|---|---|---| Workers | {w1, w2, w3} | {w1, w2, w3, w4} | Worker processes | Tasks | {t1, t2} | {t1, t2, t3} | Tasks | MaxFailures | 2 | 3 | Max failures before DLQ | # Install Fizzbee CLI locally (one-time setup) make fizzbee-install # Run Fizzbee model checker on all specs make fizzbee # Or run individual specs directly cd fizzbee && ~/.local/fizzbee/fizz LeaderlessLog.fizz cd fizzbee && ~/.local/fizzbee/fizz TaskClaiming.fizz You can also upload .fizz files to fizzbee.io for browser-based checking. | Property | Type | TLA+ | Fizzbee | Notes | |---|---|---|---|---| TypeOK | Invariant | PASS | — | TLA+ only | MonotonicOffsets | Safety | PASS | PASS | SequenceCounterPositive in TLA+ | FencedRejectsAppends | Safety | PASS | — | Uses ENABLED; omitted from Fizzbee | CompactionPreservesData | Safety | PASS | PASS | | NoPhantomEntries | Safety | PASS | PASS | | CursorConsistency | Safety | PASS | PASS | | NoOverlappingRanges | Safety | PASS | PASS | WAL-WAL only; WAL-COMPACTED overlap allowed | NoReaderError | Safety | PASS | PASS | | SequentialCompactionSafety | Safety | PASS | PASS | Compositionality: round 2 preserves round 1 | AppendProgress | Liveness | PASS | — | Requires fairness (TLA+ only) | CompactionCompletes | Liveness | PASS | — | Requires fairness (TLA+ only) | ReaderEventuallySucceeds | Liveness | PASS | — | Vacuously true; retained as regression guard | All properties pass. The ReadEntry action correctly models that both WAL and COMPACTED entries are readable — compaction reorganizes data into compacted files but does not delete it. A reader implementation dispatches on entry file type to route reads to the appropriate storage backend. | Property | Type | Expected | |---|---|---| MutualExclusion | Safety | PASS | NoDoubleExecution | Safety | PASS | DLQOnlyAfterMaxFailures | Safety | PASS | LockConsistency | Safety | PASS | NoOrphanExecution | Safety | PASS | TaskCompletion | Liveness | PASS | CrashRecovery | Liveness | PASS | NoStarvation | Liveness | PASS | All properties should pass for Protocol 2. | Spec Section | TLA+ Construct | Fizzbee Construct | |---|---|---| State: logIndex | VARIABLE logIndex | logIndex = {} | State: sequenceCounter | VARIABLE sequenceCounter | sequenceCounter = 1 | State: logState | VARIABLE logState | logState = OPEN | State: compactionCursor | VARIABLE compactionCursor | compactionCursor = 1 | State: writerState | VARIABLE writerState | writerState = {} | State: writerBatchSize | VARIABLE writerBatchSize | writerBatchSize = {} | State: compactorState | VARIABLE compactorState | compactorState = "IDLE" | Action 1: StartAppend | StartAppend(w) | action StartAppend | Action 2: WALWrite | WALWriteSuccess(w) / WALWriteFail(w) | action WALWrite (either/or) | Action 3: AssignOffset | AssignOffset(w) / AssignOffsetFenced(w) | action AssignOffset | Action 4: AppendComplete | AppendComplete(w) | action AppendComplete | | Actions 5-9: Compaction | CompactStart through CompactReset | action CompactStart through action CompactReset | | Actions 10-11: Fencing | FenceLog / UnfenceLog | action FenceLog / action UnfenceLog | Action 12: ReadEntry | ReadEntry(w, off) | action ReadEntry | Helper: HasCoveringEntry | HasCoveringEntry(off) | hasCoveringEntry(off) | Helper: CoveringEntryOffset | CoveringEntryOffset(off) | coveringEntryOffset(off) | | Safety S2 | MonotonicOffsets | always assertion MonotonicOffsets | | Safety S3 | FencedRejectsAppends | always assertion FencedRejectsAppends | | Safety S4 | CompactionPreservesData | always assertion CompactionPreservesData | | Safety S7 | NoOverlappingRanges | always assertion NoOverlappingRanges | | Safety S8 | NoReaderError | always assertion NoReaderError | | Safety S9 | SequentialCompactionSafety | always assertion SequentialCompactionSafety | | Liveness L3 | ReaderEventuallySucceeds | — (liveness requires fairness; TLA+ only) | | Spec Section | TLA+ Construct | Fizzbee Construct | |---|---|---| State: taskStatus | VARIABLE taskStatus | taskStatus = {} | State: lockState | VARIABLE lockState | lockState = {} | State: workerAlive | VARIABLE workerAlive | workerAlive = {} | State: workerState | VARIABLE workerState | workerState = {} | Action 1: ScanTasks | ScanTasks(w, t) | action ScanTasks | Action 2: TryLockTask | TryLockSuccess(w) / TryLockFail(w) | action TryLockSuccess / action TryLockFail | Action 3: ExecuteTask | ExecuteTaskSuccess(w) / ExecuteTaskFailure(w) | action ExecuteTask (either/or) | Action 4: UnlockTask | UnlockTask(w) | action UnlockTask | | Actions 5-7: Crash/Recovery | WorkerCrash / SessionExpiry / WorkerRecover | Same action names | | Safety S1 | MutualExclusion | always assertion MutualExclusion | | Safety S2 | NoDoubleExecution | always assertion NoDoubleExecution | | Safety S4 | LockConsistency | always assertion LockConsistency | | Liveness L1 | TaskCompletion | always eventually assertion TaskCompletion | | Liveness L2 | CrashRecovery | always eventually assertion CrashRecovery | After running both TLA+ and Fizzbee on both protocols: - Confirm all property verdicts agree between TLA+ and Fizzbee - All safety properties must pass in both protocols and both tools - All liveness properties must pass in both protocols and both tools If results disagree, consult the canonical spec doc to determine which implementation has the bug. This project is licensed under the Apache License, Version 2.0 — see the LICENSE file for the full text. Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
Genesis Park 편집팀이 AI를 활용하여 작성한 분석입니다. 원문은 출처 링크를 통해 확인할 수 있습니다.
공유