PyTorch 디코더 모델에서 토큰 생성 최적화

rss | 2026년 2월 28일 23:56 | 📰 뉴스

#cuda #pytorch #디코더 모델 #최적화 #토큰 생성 #하드웨어/반도체 #반도체 #하드웨어 최적화

원문 출처: rss · Genesis Park에서 요약 및 분석

요약

이 글은 거대 언어 모델(LLM)의 토큰 생성 효율을 높이기 위해 CUDA 스트림 인터리빙을 활용한 PyTorch 최적화 기법을 소개합니다. HuggingFace의 GPT-2 모델과 NVIDIA L40S GPU를 사용한 벤치마크 설정을 통해, 단순하지만 간과되기 쉬운 병목을 해결하여 성능을 개선하는 방법을 시연합니다. 저자는 해당 코드가 성능 입증용이며, 프로덕션 환경에서는 vLLM이나 TensorRT-LLM 같은 전용 라이브러리 사용을 권장한다는 점을 분명히 하고 있습니다.

본문

In this post, we demonstrate a technique for optimizing token generation in PyTorch using CUDA stream interleaving. While simple to implement, the method addresses a specific, often overlooked bottleneck and can lead to meaningful performance boosts. While pipelining model execution using CUDA streams is common in AI systems engineering, we did not find any tutorial documenting the specific PyTorch-level application we describe here. If you find the technique useful, please be so kind as to reference this post. To facilitate our discussion, we will use a simple GPT-2 PyTorch decoder model from HuggingFace’s transformers (v5.1.0) library. We will run our experiments on an NVIDIA L40S GPU and PyTorch (2.10.0). Disclaimer: The code we will share is intended for demonstrative purposes. Please do not rely on its accuracy or optimality. Please do not interpret our mentions of any library, platform, or service as an endorsement of its use. Importantly, the value of the CUDA stream-based method we will discuss can vary greatly based on the details of your model and runtime environment. Please be sure to run your own benchmarks before integrating its use. Our focus in this post is on PyTorch-native inference workloads which remain extremely prevalent in development and test settings. However, it is important to note that for production environments dedicated LLM inference libraries such as vLLM or NVIDIA TensorRT-LLM tend to deliver greater performance and should be used whenever relevant. A Toy GPT-2 Model To simplify our discussion, we will use a GPT-2 decoder model from the HuggingFace transformers library and have it run autoregressively on a batch of empty prompts. In the following code block, we initialize the model and define a naive token generation function that creates a batch of random streams up to a given length. import torch from transformers import GPT2LMHeadModel, GPT2Config torch.set_float32_matmul_precision('high') DEVICE = "cuda" # define the decoder model config = GPT2Config.from_pretrained("gpt2") model = GPT2LMHeadModel(config).to(DEVICE).eval() @torch.inference_mode() def generate_sequence(model, max_seqlen, batch_size): # Initialize prompts with BOS token all_tokens = torch.full( (batch_size, 1), config.bos_token_id, device=DEVICE, dtype=torch.long ) finished = torch.zeros(batch_size, device=DEVICE, dtype=torch.bool) for i in range(max_seqlen): outputs = model(all_tokens) # extract new token logits = outputs.logits[:, -1, :] new_tokens = torch.argmax(logits, dim=-1) # append new token to sequence all_tokens = torch.cat( [all_tokens, new_tokens.unsqueeze(-1)], dim=-1 ) finished |= (new_tokens == config.eos_token_id) stop_gpu = torch.all(finished) # checking stop condition if stop_gpu.item(): print(f"All sequences finished at step {i+1}") break return all_tokens Next, we define a simple benchmarking function which we use to measure the runtime performance and memory utilization of our token generator in different scenarios. import time, statistics def benchmark(func, num_runs=10): # Warmup func() torch.cuda.synchronize() runtimes = [] for _ in range(num_runs): # reset memory stats before each run torch.cuda.empty_cache() torch.cuda.reset_peak_memory_stats() torch.cuda.synchronize() start = time.perf_counter() _ = func() torch.cuda.synchronize() end = time.perf_counter() runtimes.append(end - start) # Get memory allocator stats from last run mem_stats = torch.cuda.memory_stats() allocated_peak = mem_stats.get('allocated_bytes.all.peak', 0) reserved_peak = mem_stats.get('reserved_bytes.all.peak', 0) f_peak = reserved_peak - allocated_peak f_pct = ( 100 * f_peak / reserved_peak if reserved_peak > 0 else 0 ) print(f"\n{'='*60}") print(f"Runtime Results:") print(f" Mean: {statistics.mean(runtimes):.4f}s") print(f" Std: {statistics.stdev(runtimes):.4f}s") print(f" Min: {min(runtimes):.4f}s") print(f" Max: {max(runtimes):.4f}s") print(f"\nMemory Stats:") print(f" Allocated bytes (peak): {allocated_peak / 1e9:.3f} GB") print(f" Reserved bytes (peak): {reserved_peak / 1e9:.3f} GB") print(f" Fragmentation (peak): {f_peak / 1e9:.3f} GB ({f_pct:.1f}%)") print(f"{'='*60}\n") batch_size = 32 for max_seqlen in [100, 200, 400]: print( f"Benchmarking generation with batch size {batch_size} " f"and max sequence length {max_seqlen}..." ) benchmark( lambda: generate_sequence( model, max_seqlen=max_seqlen, batch_size=batch_size ) ) In the table below we capture the results for a batch size of 32 and several different sequence lengths: As the sequence length doubles, the runtime quadruples — appearing to follow a classic O(N²) scaling pattern. Additionally, high memory fragmentation points to severe strain on the CUDA memory allocator, which can result in frequent memory faults and degrade runtime performance. The fragmentation results from each step asking for slightly larger tensor allocations, a pattern which ends up leaving multiple pockets of unusable memory. Our first optimization, KV caching, addresses the ru

원문 보기 (rss)

Genesis Park 편집팀이 AI를 활용하여 작성한 분석입니다. 원문은 출처 링크를 통해 확인할 수 있습니다.

요약

본문

관련 저널 읽기