하이브리드 검색이 포함된 RAG: 키워드 검색은 어떻게 작동하나요?
Towards Data Science
|
|
💼 비즈니스
#rag
#tip
#머신러닝
#지식 베이스
#키워드 검색
#하이브리드 검색
#팁
원문 출처: Towards Data Science · Genesis Park에서 요약 및 분석
요약
해당 기사는 하이브리드 검색 기반 RAG 시스템의 핵심인 키워드 검색의 작동 원리를 상세히 다룹니다. 주요 키워드 매칭 방식으로서 단어 빈도와 역문서 빈도를 기반으로 문서의 중요도를 평가하는 TF-IDF 기법을 설명합니다. 또한, 이를 발전시켜 문서 길이를 정규화하여 검색 정확도를 높인 BM25 알고리즘의 개념과 차이점도 함께 소개합니다.
본문
The traditional RAG methodology is so useful because it allows for searching for relevant parts of text in a large knowledge base, based on the meaning of the text rather than exact words. In this way, it allows us to utilize the power of AI on our custom documents. Ironically, as useful as this similarity search is, it sometimes fails to retrieve parts of text that are exact matches to the user’s prompt. More specifically, when searching in a large knowledge base, specific keywords (such as specific technical terms or names) may get lost, and relevant chunks may not be retrieved even if the user’s query contains the exact words. Happily, this issue can be easily tackled by utilising an older keyword-based searching technique, like BM25 (Best Matching 25). Then, by combining the results of the similarity search and BM25 search, we can essentially get the best of both worlds and significantly improve the results of our RAG pipeline. . . . In information retrieval systems, BM25 is a ranking function used to evaluate how relevant a document is to a search query. Unlike similarity search, BM25 evaluates the document’s relevance to the user’s query, not based on the semantic meaning of the document, but rather on the actual words it contains. More specifically, BM25 is a bag-of-words (BoW) model, meaning that it doesn’t take into account the order of the words in a document (from which the semantic meaning emerges), but rather the frequency with which each word appears in the document. BM25 score for a given query q containing terms t and a document d can be (not so) easily calculated as follows: 😿 Since this expression can be a bit overwhelming, let’s take a step back and look at it bit by bit. . . . Starting simple with TF-IDF The basic underlying concept of BM25 is TF-IDF (Term Frequency – Inverse Document Frequency). TF-IDF is a fundamental information retrieval concept aiming to measure how important a word is in a specific document in a knowledge base. In other words, it measures in how many documents of the knowledge base a term appears in, allowing in this way to express how specific and informative a term is about a specific document. The rarer a term is in the knowledge base, the more informative it is considered to be for a specific document. In particular, for a document d in a knowledge base and a term t, the Term Frequency TF(t,d) can be defined as follows: and Inverse Document Frequency IDF(t) can be defined as follows: Then, the TF-IDF score can be calculated as the product of TF and IDF as follows: . . . Let’s do a quick example to get a better grip of TF-IDF. Let’s assume a tiny knowledge base containing three movies with the following descriptions: - “A sci-fi thriller about time travel and a dangerous adventure across alternate realities.” - “A romantic drama about two strangers who fall in love during unexpected time travel.” - “A sci-fi adventure featuring an alien explorer forced to travel across galaxies.” After removing the stopwords, we can consider the following terms in each document: - document 1: sci-fi, thriller, time, travel, dangerous, adventure, alternate, realities - size of document 1, |d1| = 8 - document 2: romantic, drama, two, strangers, fall, love, unexpected, time, travel - size of document 2, |d2| = 9 - document 3: sci-fi, adventure, featuring, alien, explorer, forced, travel, galaxies - size of document 3, |d3| = 8 - total documents in knowledge base N = 3 We can then calculate the f(t,d) for each term in each document: Next, for each document, we also calculate the Document Frequency and the Inverse Document Frequency: And then finally we calculate the TF-IDF score of each term. So, we do we get from this? Let’s take a look, for example, at the TF-IDF scores of document 1. The word ‘travel’ is not informative at all, since it is included in all documents of the knowledge base. On the flip side, words like ‘thriller’ and ‘dangerous’ are very informative, specifically for document 1, since they are only included in it. In this way, TF-IDF score provides a simple and straightforward way to identify and quantify the importance of the terms in each document of a knowledge base. To put it differently, the higher the total score of the terms in a document, the rarer the information in this document is in comparison to the information contained in all other documents in the knowledge base. . . . Understanding BM25 score In BM25, we utilise the TF-IDF concept in order to quantify how imformative (how rare or important) each document in a knowledge base is, with respect to a specific query. To do this, for the BM25 calculation, we only take into account the terms of each document that are contained in the user’s query, and perform a calculation somewhat similar to TF-IF. BM25 uses the TF-IDF concept, but with a few mathematical tweaks in order to improve two main weaknesses of TF-IDF. . . . The first pain point of TF-IDF is that TF is linear with the number of times a term t app
Genesis Park 편집팀이 AI를 활용하여 작성한 분석입니다. 원문은 출처 링크를 통해 확인할 수 있습니다.
공유