Error 500 (Server Error)!!1500.That’s an error.There was an error. Please try again later.That’s all we know.
hackernews
|
|
🔬 연구
#bm25
#review
#sift
#검색
#로컬 검색
#벡터 검색
원문 출처: hackernews · Genesis Park에서 요약 및 분석
요약
로컬 코드베이스와 문서를 위한 고성능 하이브리드 검색 CLI 도구 'Sift'가 단일 Rust 바이너리로 공개되었습니다. 이 도구는 데몬 없이 검색 확장과 다양한 검색 알고리즘 융합을 지원하며, Zig의 빌드 시스템에서 영감을 받은 캐싱 계층을 통해 변경되지 않은 파일에 대한 중복 연산을 제거합니다. 벤치마크 결과 SciFact 데이터셋에서 0.826의 nDCG@10을 기록했으며, 지연 시간이 중요한 경우 BM25 방식을 사용해 5ms의 초고속 검색이 가능합니다.
본문
Search 7 min readSift: Local Hybrid Search Without the Infrastructure Tax sift is a daemonless, single-binary local search tool that runs a full BM25 + vector + reranking pipeline on demand — no indexing service, no cloud dependency, built for agents and developers who need fast, repeatable retrieval over real corpora. 2026-03-10 sift is a local Rust CLI for document retrieval. Point it at a directory, ask a question, and it runs a full hybrid search pipeline - BM25, dense vector, fusion, optional reranking - and returns ranked results. No daemon. No background indexer. No cloud. One binary. It's built for agents and developers who need reliable, repeatable search over raw codebases, docs, and mixed-format corpora without spinning up infrastructure to get there. You can install it now on Mac, Windows & Linux. Source and build notes are at github.com/rupurt/sift. The retrieval pipeline Every query runs through four stages: - Expansion — query variants are generated to broaden recall before retrieval begins. - Retrieval — BM25 (keyword), phrase match, and dense vector retrieval run against the corpus. Each method captures different signal. - Fusion — results are merged using Reciprocal Rank Fusion (RRF), balancing signal across retrieval methods without manual weight tuning. - Reranking — optional local LLM reranking via Qwen applies semantic disambiguation on the fused candidate set. Each stage is independently tunable. If you only need BM25 speed, skip the vector pass and reranker. If you want best-possible precision, run the full stack. Architecture The implementation is split into domain and adapters: domain objects model search plans, candidates, and scoring outputs; adapters implement the concrete BM25, phrase, vector, and reranking backends. A shared search service executes the same strategy model for CLI, benchmark, and eval flows — nothing changes between a dev run and a CI eval pass. Performance is local-first by design: - SIMD-accelerated dot-product for vector scoring on CPU-heavy workloads. - Zig-inspired incremental cache — a two-layer design borrowed from Zig's build system. A manifest store tracks filesystem metadata (inode, mtime, size) mapped to BLAKE3 content hashes, so sift knows exactly which files have changed without re-reading them. A content-addressable blob store holds pre-extracted text, pre-computed BM25 term frequencies, and pre-embedded dense vectors — meaning repeat queries never touch the neural network at all. Identical files across different projects share a single blob entry. The result: search performance bounded by dot-product speed, not inference latency. - Per-query embedding reuse across multi-stage pipelines. - Mapped I/O and tight tokenization hot loops to keep latency low on large corpora. One concrete tradeoff during development: lowering embedding max_length from 48 to 40 recovered latency budget while keeping quality above the BM25 baseline — a good example of how evidence-driven tuning beats guesswork. The full internals are documented in ARCHITECTURE. Evaluation The numbers below are from a comparative strategy run over 5,185 SciFact documents (~7.8 MB) on an AMD Ryzen Threadripper 3960X. Each strategy is measured on retrieval quality (nDCG@10, MRR@10, Recall@10) and latency (p50 ms). A few things worth noting from this run: - BM25 is fastest at 5.41ms p50 — a strong default for latency-constrained cases where keyword recall is sufficient. - Vector achieves the best nDCG@10 (0.8262) and perfect recall at 25.94ms — the most balanced strategy for most workloads. - LLM reranking (page-index-llm, page-index-qwen) matches legacy-hybrid quality at comparable speed, which validates the local Qwen path as a practical alternative to heavier hybrid pipelines. - page-index-hybrid is the only strategy that underperforms BM25 on nDCG — a useful reminder that adding complexity doesn't always improve quality. Cache hit rates (100/0/100%) confirm the caching layer is working correctly across all strategies. Verbose output (-v , -vv ) surfaces these rates, phase timings, and ranking metadata directly in the CLI — enough observability to diagnose retrieval issues without external tooling. Why this matters for agents For agents, latency and reliability are requirements, not nice-to-haves. Tooling loops fail hard when search is slow, drops context, or depends on services that may be unavailable. sift removes that friction: retrieval is local, deterministic, and cheap to repeat. No daemon to health-check. No embedding service to rate-limit against. No cloud dependency to manage. The binary is self-contained and ships with release automation for Homebrew and a static Linux artifact path, so agents can rely on a pinned version without environment drift. It is opinionated in the right direction: one binary, local-first by default, with explicit quality and evidence guardrails around every architectural change. That combination is rare in search tooling and unusually practical for day-t
Genesis Park 편집팀이 AI를 활용하여 작성한 분석입니다. 원문은 출처 링크를 통해 확인할 수 있습니다.
공유