벡터 검색을 위한 임베딩 크기 축소 방법 비교
hackernews
|
|
🔬 연구
#review
#검색 품질
#벡터 검색
#양자화
#임베딩
#차원 축소
원문 출처: hackernews · Genesis Park에서 요약 및 분석
요약
임베딩 기반 검색 시스템의 저장 비용과 연산 효율성을 개선하기 위해 차원 축소, 정밀도 낮추기(fp16), 양자화의 세 가지 압축 방법을 실험했습니다. 실험 결과, 차원을 급격히 줄이면 검색 품질이 크게 떨어지는 반면, fp16 변환은 저장 공간을 절반으로 줄이면서도 원본과 거의 동일한 성능을 유지하는 것으로 나타났습니다. 또한 4비트 양자화는 검색 품질을 약 96% 수준으로 유지하며 상당한 압축 효과를 거두었습니다. 결론적으로 시스템의 품질 저하를 최소화하면서 비용을 절감하기 위해서는 단순 차원 축소보다는 정밀도 조절과 양자화 전략을 활용하는 것이 효과적입니다.
본문
An experiment in reducing vector size with dimensionality reduction, lower-precision storage, and quantization while watching what it does to retrieval quality. The Problem Embedding-heavy systems hit a simple limit: vectors cost money to store, memory to move, and compute to score. A 512-dimensional embedding in float32 takes 2048 bytes. That is fine at small scale, but once the corpus grows, storage, cache pressure, and memory bandwidth start to matter. The hard part is that the obvious savings can hurt search quality. If a smaller representation changes nearest neighbors, top-k ranking, or retrieval scores too much, the system gets cheaper and worse at the same time. The goal is to reduce size without meaningfully changing retrieval behavior. The Main Compression Options The main options we looked at were: - Dimensionality reduction: replace the original vector with a smaller one, such as 1536 to 512 or 512 to 128 - Lower-precision storage: keep the same dimensions but store values in fp16 instead of float32 - Quantization: keep the same dimensions but store each value with fewer bits These approaches make different tradeoffs. Dimensionality reduction removes information from the representation. Lower precision keeps the same shape but reduces storage cost. Quantization also keeps the same dimensionality, but stores a more approximate version of the vector. Option 1: Dimensionality Reduction Dimensionality reduction is the most direct way to shrink embeddings because it reduces the number of coordinates outright. You can get there by training a smaller embedding model or by applying a projection step after the fact. PCA is the standard example. In our case, the results depended a lot on how far we pushed it. Going from 1536 dimensions to 512 held up fairly well. Going from 512 to 128 did not. A naive 512 to 128 reduction dropped nearest-centroid agreement to about 0.7520, which was bad enough to rule it out quickly. That was the main lesson from this path: the question is not whether fewer dimensions can work at all. The question is how far you can reduce them before retrieval quality falls apart. A future dimensionality-reduction experiment would be to try a learned projection such as PCA instead of naive truncation. Option 2: fp16 fp16 is the simplest lower-precision option. The dimensionality stays the same, but each value drops from 32 bits to 16 bits. That cuts storage roughly in half. Operationally, it is straightforward. There is no projection step, and no change to the basic structure of the vectors. It is just a smaller numeric format. For us, fp16 preserved behavior extremely well. Nearest-centroid agreement was about 0.9998, which made it an easy win. Option 3: Quantization Quantization goes further by storing each value with fewer bits. In our case, we rotate the embedding, quantize each rotated dimension to 4 bits using per-dimension offsets and scales, and then pack the result into bytes. At search time, those packed values are unpacked, dequantized, and rotated back into approximate float vectors before scoring. So the quality question is how much error that process introduces, and how much it changes retrieval behavior. For us, 4-bit quantization held up reasonably well. Nearest-centroid agreement was about 0.9644. That was meaningfully worse than fp16, but still much better than naive 512 to 128 reduction. For our rotated 4-bit quantization setup, fitting means choosing the rotation and estimating the per-dimension offsets and scales that map rotated values into 4-bit bins. Those parameters become the reusable artifact used at write time and search time. Fitting the quantizer train = np.asarray(sample_embeddings, dtype=np.float32) rotation = random_orthogonal_matrix(train.shape[1], seed=seed) rotated = train @ rotation mins = rotated.min(axis=0) scales = (rotated.max(axis=0) - mins) / 15.0 The important distinction is between fitting and transforming. Fitting learns the reusable compression parameters. Transforming applies them to each embedding you store. Compressing an embedding x = np.asarray(embedding, dtype=np.float32) z = x @ rotation codes = np.rint((z - mins) / scales).clip(0, 15).astype(np.uint8) packed = codes[0::2] | (codes[1::2] > 4) & 0x0F z = mins + codes.astype(np.float32) * scales approx = z @ rotation.T Transforming Vectors at Write Time Once the chosen representation is fit, indexing is straightforward: embedding_f32 = embed(document) compressed = compressor.transform(embedding_f32) store(compressed) For dimensionality reduction, transform applies the projection. For fp16, it casts to half precision. For quantization, it rotates the embedding, quantizes each dimension into 4-bit bins, and packs the result into bytes. What to Measure The evaluation loop should not stop at reconstruction error. For retrieval systems, the metrics that mattered for us were the ones tied to actual search behavior: - nearest-centroid agreement - nearest-neighbor overlap with the baseline - to
Genesis Park 편집팀이 AI를 활용하여 작성한 분석입니다. 원문은 출처 링크를 통해 확인할 수 있습니다.
공유