HN 표시: Stratum – 35/46 1T 벤치마크에서 DuckDB를 분기하고 능가하는 SQL
hackernews
|
|
🔬 연구
#duckdb
#review
#sql
#stratum
#데이터엔지니어링
#벤치마크
원문 출처: hackernews · Genesis Park에서 요약 및 분석
요약
Stratum은 데이터셋의 불변성과 구조적 공유를 통해 Git과 같은 분기 및 시간 여행 기능을 제공하는 차세대 SQL 컬럼형 엔진입니다. Java Vector API를 활용한 SIMD 실행과 메타데이터 기반의 쿼리 최적화를 통해 네이티브 컴파일 없이도 순수 JVM 코드로 작동합니다. 실제 1,000만 행 기준 벤치마크 결과 단일 스레드 환경에서 46개 쿼리 중 35개에서 DuckDB보다 최대 8.5배 빠른 성능을 보이며, PostgreSQL 와이어 프로토콜을 지원해 표준 클라이언트와 호환됩니다.
본문
Stratum: SQL that branches A few years ago I hit a wall I suspect many data engineers know. I had a million-row analytical dataset and I wanted to run an experiment: modify a few pricing assumptions, re-run a set of aggregation queries, compare the results against the original. Simple enough - except in a mutable database, “compare against the original” means either keeping a copy of the data or hoping nothing changed. Neither scales. Datahike solves this for entity-level data. Its storage is EAVT-indexed - like Datomic, tuned for entity traversal and point lookups. That’s the right structure for a system-of-record, but not for scanning 10M rows to compute a GROUP BY with SIMD. Stratum explores the columnar alternative: the same CoW branching semantics, but over column-oriented storage optimized for analytical scans. SQL is the natural interface for this access pattern - something Datahike doesn’t yet have. The longer-term plan is integration: Stratum’s columnar engine and SQL support as a query path within Datahike’s Datalog planner. The core insight is that a columnar dataset is just a value. Make it immutable with structural sharing and you get git-like semantics for free: fork a dataset in O(1), modify branches independently, time-travel to any snapshot, persist named commits to storage. Then add SIMD execution via the Java Vector API, and it turns out you can beat DuckDB on most single-threaded analytical queries from pure JVM code - no native compilation, no JNI. The SQL interface Stratum speaks the PostgreSQL wire protocol. The quickest entry point is the standalone server: java --add-modules jdk.incubator.vector \ --enable-native-access=ALL-UNNAMED \ - jar stratum-standalone.jar \ --index orders:/data/orders.csv Any PostgreSQL client connects immediately - psql, DBeaver, JDBC, psycopg2: psql - h localhost - p 5432 - U stratum -- Standard analytical SQL SELECT region, SUM(amount * discount) AS revenue, COUNT(*) AS orders FROM orders WHERE ship_date BETWEEN '2024-01-01' AND '2024-12-31' GROUP BY region ORDER BY revenue DESC; -- Query CSV and Parquet files inline - auto-indexed on first access SELECT payment_type, AVG(tip_amount), PERCENTILE_CONT(0.95) WITHIN GROUP (ORDER BY tip_amount) FROM read_csv('/data/taxi.csv') GROUP BY payment_type; Full DML: SELECT, INSERT, UPDATE, DELETE, UPSERT (INSERT ON CONFLICT). CTEs, correlated subqueries, window functions (ROW_NUMBER, RANK, LAG, LEAD, running aggregates), joins (INNER/LEFT/RIGHT/FULL with multi-column keys), set operations (UNION/INTERSECT/EXCEPT). Aggregates: SUM, COUNT, AVG, MIN, MAX, STDDEV, VARIANCE, CORR, MEDIAN, PERCENTILE_CONT, APPROX_QUANTILE, COUNT(DISTINCT). CASE WHEN, COALESCE, date functions, LIKE/ILIKE, FILTER clause. Full SQL reference → How the engine works Every column is split into fixed-size chunks. Each chunk carries pre-computed statistics: minimum, maximum, sum, count. This unlocks two significant optimizations. Zone-map pruning. Each chunk carries pre-computed min, max, sum, and count statistics. DuckDB stores only min and max per segment, used for predicate filter pushdown - skipping segments that can’t contain rows matching a WHERE clause. Both engines do this. What DuckDB doesn’t pre-compute is per-segment SUM or COUNT, so unfiltered aggregates like COUNT(*) , SUM(price) , or AVG(price) require a full data scan in DuckDB. In Stratum, these are answered by traversing the pre-computed metadata at tree nodes - no row data touched. SELECT AVG(price) FROM orders on 10M rows: Stratum 0.1ms, DuckDB 7.1ms. Fused SIMD execution. Most columnar engines evaluate predicates in one pass, then apply the result mask during a separate aggregation pass. Stratum fuses these into a single loop: predicates and accumulation run simultaneously via Java Vector API VectorMask chains, processing four doubles or longs per SIMD cycle. No intermediate arrays, no second pass, no extra allocation. The Vector API (JDK 21+) provides DoubleVector and LongVector operations backed by AVX-512 on x86 and SVE on ARM. The bet was that the JVM incubator API had matured enough to compete with native code on analytical workloads without the deployment complexity of a native library. The benchmarks suggest that bet paid off. Performance Single-threaded comparison vs DuckDB v1.4.4 (JDBC in-process) on 10M rows, Intel Core Ultra 7 258V, JVM 25. Median of 10 iterations, 5 warmup: | Query | Stratum | DuckDB | Ratio | |---|---|---|---| | TPC-H Q6 (filter + sum-product) | 13ms | 28ms | 2.2x faster | | Filtered COUNT (NEQ pred) | 3ms | 12ms | 4.0x faster | | TPC-H Q1 (7 aggs, 4 groups) | 75ms | 93ms | 1.2x faster | | H2O Q3 (100K string groups) | 71ms | 362ms | 5.1x faster | | H2O Q10 (10M groups, 6 cols) | 832ms | 7056ms | 8.5x faster | | LIKE '%search%' | 47ms | 240ms | 5.1x faster | | AVG(LENGTH(URL)) | 38ms | 170ms | 4.5x faster | | H2O Q6 (STDDEV group-by) | 30ms | 81ms | 2.7x faster | | H2O Q9 (CORR) | 61ms | 134ms | 2.2x faster | | MEDIAN(price) | 68ms | 158ms | 2.3x fas
Genesis Park 편집팀이 AI를 활용하여 작성한 분석입니다. 원문은 출처 링크를 통해 확인할 수 있습니다.
공유