RAG 및 에이전트에 대한 내 생각에서 비트 오버 랜덤 측정법이 변경된 사항

Towards Data Science | | 🔬 연구
#99% 성공 역설 #iclr2026 #rag #review #검색 평가 #에이전트
원문 출처: Towards Data Science · Genesis Park에서 요약 및 분석

요약

최근 RAG 및 에이전트 워크플로우에서 종이 위에서는 완벽해 보이는 검색 결과가 실제로는 잡음처럼 작동하는 문제가 대두되고 있습니다. 이러한 역설적인 현상을 이해하기 위해 '비트-오버-랜덤(Bits-over-Random)' 지표가 제시되었으며, 이는 검색 시스템의 실질적인 정보 획득 능력을 평가하는 새로운 기준이 되고 있습니다. 이 지표는 단순한 검색 정확도를 넘어, 실제 환경에서 에이전트가 얼마나 유용한 정보를 얻는지를 직관적으로 판단할 수 있게 합니다.

본문

Inspired by the ICLR 2026 blogpost/article, The 99% Success Paradox: When Near-Perfect Retrieval Equals Random Selection - Did we retrieve at least one relevant chunk? - Did recall go up? - Did the ranker improve? - Did downstream answer quality look acceptable on a benchmark? Those are still useful questions. But after reading the recent work on Bits over Random (BoR), I think they are incomplete for the Agentic systems many of us are now actually building. The ICLR blogpost sharpened something I had felt for a while in production LLM systems: retrieval quality should take into account both how much good content we find and also how much irrelevant material we bring along with it. In other words, as we crank up our recall we also increase the risk of context pollution. What makes BoR useful is that it gives us a language for this. BoR tells us whether retrieval is genuinely selective, or whether we are achieving success mostly by stuffing the context window with more material. When BoR falls, it is a sign that the retrieved bundle is becoming less discriminative relative to chance. In practice, that often correlates with the model being forced to read more junk, more overlap, or more weakly relevant material. The important nuance is that BoR does not directly measure what the model “feels” when reading a prompt. It measures retrieval selectivity relative to random chance. But lower selectivity often goes hand in hand with more irrelevant context, more prompt pollution, more attention dilution, and worse downstream performance. Put simply, BoR helps tell us when retrieval is still selective and when it has started to degenerate into context stuffing. That idea matters much more for RAG and agents than it did for classic search. Why retrieval dashboards can mislead agent teams One of the easiest traps in RAG is to look at your retrieval dashboard, see healthy metrics, and conclude that the system is doing well. You might see: - high Success@K, - strong recall, - a good ranking metric, - and a larger K seeming to improve coverage. On paper things may look better but, in reality, the agent might actually behave worse. Your agent may have any number of maladies such as diffuse answers to queries, unreliable tool use or simply a rise in latency and token cost without any real user benefit. This disconnect happens because most retrieval dashboards still reflect a human search worldview. They assume the consumer of the retrieved set can skim, filter, and ignore junk. Humans are surprisingly good at this. LLMs are not consistently good at it. An LLM does not “notice” ten retrieved items and casually focus on the best two in the way a strong analyst would. It processes the full bundle as prompt context. That means the retrieval layer is surfacing evidence that is actively shaping the model’s working memory. This is why I think agent teams should stop treating retrieval as a back-office ranking problem and start treating it as a reasoning-budget allocation problem. When building performant agentic systems, the key question is both: - Did we retrieve something relevant? and: - How much noise did we force the model to process in order to get that relevance? That is the lens BoR pushes you toward, and I have found it to be a very useful one. Context engineering is becoming a first-class discipline One reason this paper has resonated with me is that it fits a broader shift already happening in practice. Software engineers and ML practitioners working on LLM systems are gradually becoming something closer to context engineers. That means designing systems that decide: - what should enter the prompt, - when it should enter, - in what form, - with what granularity, - and what should be excluded entirely. In traditional software, we worry about memory, compute, and API boundaries. In LLM systems, we also need to worry about context purity. The context window is contested cognitive real estate. Every irrelevant passage, duplicated chunk, weakly related example, verbose tool definition, and poorly timed retrieval result competes with the thing the model most needs to focus on. That is why I like the pollution metaphor. Irrelevant context contaminates the model’s workspace. The BoR poster gives this intuition a more rigorous shape by telling us that we should stop evaluating retrieval only by whether it succeeds. We should also ask how much better the retrieval is compared to chance, at the depth (top K retrieved items) that we are actually using. That is a very practitioner-friendly question. Why tool overload breaks agents This is where I think the BoR work becomes especially important for real-world agent systems. In classic RAG, the corpus is often large. You may be retrieving from tens of thousands or millions of chunks. In that regime, random chance remains weak for longer. Tool selection is very different. In an agent, the model may be choosing among 20, 50, or 100 tools. That sounds manageable until you realize that sev

Genesis Park 편집팀이 AI를 활용하여 작성한 분석입니다. 원문은 출처 링크를 통해 확인할 수 있습니다.

공유

관련 저널 읽기

전체 보기 →