RAG가 깨졌습니다. 고치겠습니다.

hackernews | | 🔬 연구
#embedding drift #openai #rag #review #검색 품질 #리뷰 #임베딩
원문 출처: hackernews · Genesis Park에서 요약 및 분석

요약

1. **Analyze the Request:** * **Role:** News summarization expert. * **Task:** Summarize the provided article. * **Constraint 1 (Length):** Concisely in 2-4 sentences. * **Constraint 2 (Content):** Focus on the most important information. * **Constraint 3 (Originality):** DO NOT copy and paste. DO NOT simply restate the title. Add new info beyond the title. * **Constraint 4 (Language):** English. 2. **Analyze the Input:** * **Title:** "RAG is broken, lets fix it" * **Source:** hackernews (implies it's a discussion, likely linking to a blog post or technical discussion). * **Article Text:** "RAG is broken, lets fix it" (This is extremely short. It looks like the prompt provided *only* the title and a repeated phrase as the text). 3. **Identify the Problem:** The user provided a title and a single sentence as the "Article Text". There is no actual content to summarize *beyond* the title itself. * *Hypothesis:* The user made a mistake and didn't paste the full article text. * *Hypothesis:* The user is testing how I handle empty input. * *Hypothesis:* The input text is literally just the title repeated. 4. **Formulate a Strategy:** * I cannot summarize an article that isn't there. * However, "RAG" (Retrieval-Augmented Generation) is a known technical concept in AI. * If I strictly follow the "DO NOT simply restate the title" rule, I'm stuck because there is *no* other information in the input. * *Correction:* I must

본문

TL;DR Embedding drift is when the same text produces different vectors over time because of model updates, preprocessing changes, or partial re-embedding. It degrades RAG retrieval quality without throwing errors. Detect it by comparing cosine distances on known documents and tracking nearest-neighbor stability. Prevent it by pinning your pipeline, never mixing embedding generations, and versioning your vector data. Your RAG pipeline shipped three months ago. Evaluations looked great, stakeholders were happy, and you moved on to the next project. Then the answers started slipping. Not wrong, exactly, but less right. Users say the system feels “dumber” lately. You check the prompt, the model version, the retrieval config. Nothing changed. Turns out the LLM is fine. The problem is further upstream: your embeddings have drifted. Drift doesn't throw errors. It doesn't trip alerts. It just slowly erodes retrieval quality until someone finally notices the answers have gotten worse. What Embedding Drift Actually Is Here's the core issue: semantically identical text starts producing structurally different vectors over time. The text hasn't changed meaning. But the embedding has changed shape. Vector search works by geometric proximity. When you query, you're asking “which stored vectors are closest to this query vector?” That only works if the stored vectors and the query vector were produced under the same conditions. When they weren't, cosine similarity stops reflecting semantic similarity. The frustrating part is that the system keeps returning results. It looks like it's working. But relevant chunks that used to show up at position 2 are now buried at position 15. Recall drops from 0.92 to 0.74, and there's nothing in the logs to explain why. The Five Causes That Actually Bite You Most discussions about drift focus on model updates, which is the obvious cause. But the things that actually break production systems tend to be less visible. 1. Partial Re-embedding This is the most common cause we see in production. A team re-embeds 20% of their corpus, maybe some updated docs or a new data source backfill. Now the vector store holds embeddings from two different runs. Even if you're using the same model version, small differences in preprocessing or floating-point non-determinism can put vectors in slightly different regions of the space. A document that ranked #2 last week might now rank #8, not because it became less relevant, but because the geometry around it shifted. 2. Preprocessing Pipeline Changes A developer fixes a bug in the HTML stripper. Another adds Unicode normalization. Someone changes the chunk window from 512 to 480 tokens. Each change is small and reasonable. Together, they mean the text being embedded today is structurally different from six months ago, even when the source document is identical. Because models use sub-word tokenization, changing a single space or punctuation mark can alter the entire token sequence for a sentence. 3. Model Version Bumps Vectors from text-embedding-ada-002 and text-embedding-3-small are not in the same space. You cannot compare them with cosine similarity. The real danger is switching models for new documents while old documents stay on the previous version. A mixed-model vector store will produce unreliable neighbor rankings because a v3 query cannot find v2 documents. 4. Chunk Boundary Drift Same text, same model, but segmentation changed. A chunk that used to include the end of paragraph A and the start of paragraph B now only covers B. Different context window, different embedding, different neighbors. 5. Infrastructure and Index Changes HNSW parameters (like ef_construction ) vary between index rebuilds. A database migration changes vector precision from float32 to bfloat16 . These don't always change the raw vectors, but they alter the approximate nearest neighbor graph. None of these show up in a code diff. All of them produce measurably different retrieval behavior. Detecting Drift The good news is that drift is straightforward to detect once you know what to look for. The bad news is that most teams aren't measuring any of this. Check 1: Cosine Distance on Identical Text Re-embed a sample document with your current pipeline. Compare against the stored vector. (Note: Thresholds below are heuristics based on OpenAI's models; exact values vary by provider). | Distance | Status | |---|---| | 0.05 | Severe, likely model or chunking change | Check 2: Nearest-Neighbor Stability Run the same benchmark queries a week apart. Record top-k results each time. - Healthy: 85–95% overlap between runs - Degrading: 70–85% overlap, drift is starting - Broken: <70% overlap, active quality loss Check 3: Vector Count Divergence Compare vectors in your database vs. your source of truth. Count mismatches mean ingestion failed, duplicates crept in, or vectors were deleted externally. Zero tolerance for unexplained deltas. Check 4: Distribution Shift Track L2 norm distribution

Genesis Park 편집팀이 AI를 활용하여 작성한 분석입니다. 원문은 출처 링크를 통해 확인할 수 있습니다.

공유

관련 저널 읽기

전체 보기 →