HN 표시: ACORN-1을 사용하여 사전 필터링된 HNSW를 위한 DuckDB 커뮤니티 확장

hackernews | | 📦 오픈소스
#acorn-1 #duckdb #hnsw #review #vector search #vss
원문 출처: hackernews · Genesis Park에서 요약 및 분석

요약

이 확장 프로그램은 DuckDB의 HNSW 벡터 검색 기능을 개선하여 WHERE 절 필터 조건이 포함된 쿼리에서도 원하는 개수의 결과를 정확히 반환하도록 돕습니다. 기존에는 필터링이 인덱스 검색 이후에 수행되어 결과가 누락되는 문제가 있었으나, ACORN-1 알고리즘을 통해 필터 조건을 HNSW 그래프 탐색 과정에 통합했습니다. 또한 선택도(Selectivity)에 따라 검색 전략을 자동으로 전환하여 쿼리 성능과 정확도를 최적화하며, 특별한 문법 없이 기존 쿼리 방식 그대로 사용 가능합니다.

본문

A DuckDB extension for vector similarity search with filtered HNSW (ACORN-1) and RaBitQ binary quantization. Fork of duckdb/duckdb-vss. The upstream duckdb-vss extension has two limitations: - Filtered search is broken. WHERE clauses are applied after the HNSW index returns results, soSELECT ... WHERE category = 'X' ORDER BY distance LIMIT 10 often returns fewer than 10 rows. - No vector compression. Every vector is stored as full F32, so the index memory scales linearly with dimensions. This fork fixes both: - ACORN-1 filtered search pushes filter predicates into HNSW graph traversal, ensuring filtered queries return the correct number of results. - RaBitQ quantization compresses vectors to 1 bit per dimension (~21x memory reduction at 128 dims) with a rescore phase that preserves result quality. -- Create a table with vectors CREATE TABLE items AS SELECT i AS id, array_value(random(), random(), random())::FLOAT[3] AS vec, (i % 5) AS category FROM range(10000) t(i); -- Standard HNSW index CREATE INDEX idx ON items USING HNSW (vec); -- Nearest neighbor search SELECT * FROM items ORDER BY array_distance(vec, [0.5, 0.5, 0.5]::FLOAT[3]) LIMIT 10; -- Filtered search (returns exactly 10 results matching the filter) SELECT * FROM items WHERE category = 1 ORDER BY array_distance(vec, [0.5, 0.5, 0.5]::FLOAT[3]) LIMIT 10; -- Metadata join (vectors + metadata in separate tables) CREATE TABLE metadata AS SELECT i AS id, (i % 5) AS category, 'item_' || i AS name FROM range(10000) t(i); SELECT m.name FROM items JOIN metadata m ON items.id = m.id WHERE m.category = 2 ORDER BY array_distance(vec, [0.5, 0.5, 0.5]::FLOAT[3]) LIMIT 10; -- Per-group nearest neighbors SELECT category, min_by(id, array_distance(vec, [0.5, 0.5, 0.5]::FLOAT[3]), 3) FROM items GROUP BY category; No special syntax — the optimizer detects these patterns and uses filtered HNSW search automatically. For large-dimension vectors (128+), RaBitQ dramatically reduces index memory: CREATE INDEX idx ON items USING HNSW (vec) WITH (quantization = 'rabitq'); Everything else works identically — queries, filters, persistence. The index stores binary-quantized vectors and rescores candidates against the original F32 vectors for exact ranking. | Metric | 128 dims | 256 dims | 768 dims | |---|---|---|---| | F32 bytes/vec | 512 | 1024 | 3072 | | RaBitQ bytes/vec | 24 | 40 | 104 | | Compression | 21x | 26x | 30x | -- Oversample factor: search N*oversample candidates, rescore to top-N (default 3) SET hnsw_rabitq_oversample = 10; Higher oversample = better recall, slightly slower queries. | Method | Recall@10 | Vec Memory | Compression | |---|---|---|---| | HNSW | 66.7% | 5000 KB | — | | RaBitQ 3x | 66.7% | 234 KB | 21.3x | | RaBitQ 10x | 83.3% | 234 KB | 21.3x | RaBitQ 10x achieves higher recall than plain HNSW because the rescore phase reranks with exact distances. Run the benchmark yourself: ./build/release/duckdb 60% | Standard HNSW (post-filter) | | 1–60% | ACORN-1 (two-hop expansion) | | = 8.0 | ~5% | 0/10 | 10/10 | When your vectors and metadata live in separate tables, the optimizer rewrites a standard JOIN into a filtered HNSW search: SELECT m.title, m.genre FROM embeddings e JOIN metadata m ON e.id = m.id WHERE m.genre = 'sci-fi' ORDER BY array_distance(e.vec, [0.5, 0.5, 0.5]::FLOAT[3]) LIMIT 10; The optimizer pre-scans the metadata table to find matching join keys, builds an ACORN-1 filter bitset, and runs a single filtered HNSW search. The JOIN remains in the plan to reattach metadata columns. Zone map pruning skips irrelevant segments during the key lookup. Requirements: - Join key must be BIGINT - The HNSW index must be on the embeddings table (not the metadata table) - Query must have ORDER BY distance LIMIT k Per-group top-K search using standard SQL aggregation: SELECT category, min_by(id, array_distance(vec, [0.5, 0.5, 0.5]::FLOAT[3]), 5) FROM items GROUP BY category; For each distinct group value, the optimizer builds a per-group filter bitset and runs a separate ACORN-1 filtered search. This gives exact per-group recall — no oversampling heuristic, no post-filtering. Works with any group column type (integers, strings), single-column GROUP BY only. Multi-column GROUP BY falls back to sequential scan. | Metric | Option | Function | Operator | |---|---|---|---| | Euclidean (L2sq) | l2sq (default) | array_distance | | | Cosine | cosine | array_cosine_distance | | | Inner product | ip | array_negative_inner_product | — | CREATE INDEX my_idx ON items USING HNSW (vec) WITH (metric = 'cosine'); CREATE INDEX my_idx ON items USING HNSW (vec) WITH (metric = 'cosine', quantization = 'rabitq'); CREATE INDEX idx ON items USING HNSW (vec) WITH ( metric = 'l2sq', -- distance metric (l2sq, cosine, ip) quantization = 'rabitq', -- vector quantization (rabitq, none) ef_construction = 200, -- graph construction search width ef_search = 200, -- query-time search width M = 16, -- max connections per node M0 = 32 -- max connections at layer 0 ); Runtime settings: SET hnsw

Genesis Park 편집팀이 AI를 활용하여 작성한 분석입니다. 원문은 출처 링크를 통해 확인할 수 있습니다.

공유

관련 저널 읽기

전체 보기 →