뉴스피드 큐레이션 SNS 대시보드 저널

LLM에서 프롬프트 캐싱에 관심을 갖는 이유는 무엇입니까?

Towards Data Science | | 💼 비즈니스
#llm #openai #tip #비용 절감 #입력 토큰 #팁 #프롬프트 캐싱

요약

이 기사는 거대 언어 모델(LLM) 사용 시 발생하는 비용과 지연 시간을 줄이기 위한 핵심 기술로서 '프롬프트 캐싱(Prompt Caching)'의 중요성을 다룹니다. 반복되는 프롬프트 처리를 최적화하여 API 호출 비용을 절감하고 응답 속도를 높이는 구체적인 방법과 그 효과를 설명합니다.

왜 중요한가

개발자 관점

검토중입니다

연구자 관점

검토중입니다

비즈니스 관점

검토중입니다

본문

Fortunately, in reality, when making calls to an LLM, the same input tokens are usually repeated across multiple requests. Users are going to ask some specific questions much more than others, system prompts and instructions integrated in AI-powered applications are repeated in every user query, and even for a single prompt, models perform recursive calculations to generate an entire response (remember how LLMs produce text by predicting words one by one?). Similar to other applications, the use of caching mechanisms can significantly help optimize LLM request costs and latency. For instance, according to OpenAI documentation, Prompt Caching can reduce latency by up to an impressive 80% and input token costs by up to 90%. What about caching? In general, caching in computing is no new idea. At its core, a cache is a component that stores data temporarily so that future requests for the same data can be served faster. In this way, we can distinguish between two basic cache states – a cache hit and a cache miss. In particular: - A cache hit occurs when the requested data is found in the cache, allowing for a quick and cheap retrieval. - A cache miss occurs when the data is not in the cache, forcing the application to access the original source, which is more expensive and time-consuming. One of the most typical implementations of cache is in web browsers. When visiting a website for the first time, the browser checks for the URL in its cache memory, but finds nothing (that will be a cache miss). Since the data we are looking for isn’t locally available, the browser has to perform a more expensive and time-consuming request to the web server across the internet, in order to find the data in the remote server where they originally exist. Once the page finally loads, the browser typically copies that data into its local cache. If we try to reload the same page 5 minutes later, the browser will look for it in its local storage. This time, it will find it (a cache hit) and load it from there, without reaching back to the server. This makes the browser work more quickly and consume fewer resources. As you may imagine, caching is particularly useful in systems where the same data is requested multiple times. In most systems, data access is rarely uniform, but rather tends to follow a distribution where a small fraction of the data accounts for the vast majority of requests. A large portion of real-life applications follows the Pareto principle, meaning that about of 80% of the requests are about 20% of the data. If not for the Pareto principle, cache memory would need to be as large as the primary memory of the system, rendering it very, very expensive. Prompt Caching and a Little Bit about LLM Inference The caching concept – storing frequently used data somewhere and retrieving it from there, instead of obtaining it again from its primary source – is utilized in a similar manner for improving the efficiency of LLM calls, allowing for significantly reduced costs and latency. Caching can be utilised in various elements that may be involved in an AI application, most important of which is Prompt Caching. Nevertheless, caching can also provide great benefits by being applied to other aspects of an AI app, such as, for instance, caching in RAG retrieval or query-response caching. Nonetheless, this post is going to solely focus on Prompt Caching. To understand how Prompt Caching works, we must first understand a little bit about how LLM inference – using a trained LLM to generate text – functions. LLM inference is not a single continuous process, but is rather divided into two distinct stages. Those are: - Pre-fill, which refers to processing the entire prompt at once to produce the first token. This stage requires heavy computation, and it is thus compute-bound. We may picture a very simplified version of this stage as each token attending to all other tokens, or something like comparing every token with every previous token. - Decoding, which appends the last generated token back into the sequence and generates the next one auto-regressively. This stage is memory-bound, as the system must load the entire context of previous tokens from memory to generate every single new token. For example, imagine we have the following prompt: What should I cook for dinner? From which we may then get the first token: Here and the following decoding iterations: Here Here are Here are 5 Here are 5 easy Here are 5 easy dinner Here are 5 easy dinner ideas The issue with this is that in order to generate the complete response, the model would have to process the same previous tokens over and over again to produce each next word during the decoding stage, which, as you may imagine, is highly inefficient. In our example, this means that the model would process again the tokens ‘What should I cook for dinner? Here are 5 easy‘ for producing the output ‘ideas‘, even if it has already processed the tokens ‘What should I cook for dinner? Here are

관련 저널 읽기

전체 보기 →