다중 GPU의 AI: ZeRO 및 FSDP
Towards Data Science
|
|
📰 뉴스
#ai
#fsdp
#zero
#다중 gpu
#분산 병렬 처리
#하드웨어/반도체
원문 출처: Towards Data Science · Genesis Park에서 요약 및 분석
요약
이 기사는 다중 GPU 환경에서 대규모 AI 모델을 효율적으로 학습시키기 위한 방법인 Zero Redundancy Optimizer(ZeRO)와 PyTorch의 Fully Sharded Data Parallel(FSDP)을 다룹니다. ZeRO의 작동 원리와 기술적 메커니즘을 상세히 설명하며, 독자가 이를 처음부터 직접 구현해볼 수 있도록 돕는 실용적인 가이드를 제공합니다. 또한 실제 PyTorch 프레임워크 내에서 해당 기술을 적용하여 학습 과정을 최적화하는 방법을 구체적인 예시와 함께 소개합니다.
본문
- Part 1: Understanding the Host and Device Paradigm - Part 2: Point-to-Point and Collective Operations - Part 3: How GPUs Communicate - Part 4: Gradient Accumulation & Distributed Data Parallelism (DDP) - Part 5: ZeRO (this article) - Part 6: Tensor Parallelism (coming soon) Introduction In the previous post, we saw how Distributed Data Parallelism (DDP) speeds up training by splitting batches across GPUs. DDP solves the throughput problem, but it introduces a new challenge: memory redundancy. In vanilla DDP, every GPU holds a complete copy of the model parameters, gradients, and optimizer states. For large models like GPT-3 (175B parameters), this redundancy becomes a big waste of precious VRAM. ZeRO (Zero Redundancy Optimizer) solves this. There are three levels: - ZeRO-1 partitions only optimizer states - ZeRO-2 partitions optimizer states + gradients - ZeRO-3 partitions optimizer states + gradients + model parameters ZeRO isn’t a parallelism technique because all GPUs still run the same forward and backward passes. It’s a memory optimization strategy that eliminates redundancy across GPUs, letting you train larger models on the same hardware. The Memory Problem in DDP Let’s break down what actually consumes memory during training. For a model with parameters: - Model Parameters: values (the weights of your neural network) - Gradients: values (one gradient per parameter) - Optimizer States (Adam): values (first moment and second moment for each parameter) - Activations: Intermediate outputs stored during forward pass for use in backward pass The first three scale with model size and are redundant across GPUs in DDP. Activations scale with batch size, sequence length, and # neurons, and are unique per GPU since each GPU processes different data. ZeRO doesn’t touch activation memory. Let’s calculate the memory usage for a 7B-parameter model using Adam and FP32: - Parameters: 7 billion * 4 bytes = 28 GB - Gradients: 7 billion * 4 bytes = 28 GB - Optimizer states: 7 billion * 2 * 4 bytes = 56 GB - Memory per GPU in DDP: 112 GB Activations add significant memory on top of this, but since they’re unique per GPU, ZeRO can’t partition them. Techniques like activation checkpointing can help, it discards some activations and then recomputes them as needed during the backward pass. But that’s outside the scope of this article. Let’s understand how ZeRO works by implementing it from the ground up, starting with ZeRO-1 and working our way to ZeRO-3. ZeRO-1: Optimizer State Partitioning In ZeRO-1, only the optimizer states are partitioned. Each GPU: - Still holds the full model parameters and gradients - Stores only 1/N of the optimizer states (N = number of GPUs) - Updates only the corresponding 1/N of the parameters This is the sequence actions taken during training: - Forward pass: each GPU processes its own micro-batch - Backward pass: compute gradients all-reduce gradients: every GPU gets the all gradients- Optimizer step: Each GPU updates its parameter partition all-gather parameters: sync the updated model across GPUs Here’s a simplified implementation: import torch import torch.distributed as dist class ZeRO_1: def __init__(self, model, optimizer_cls): self.model = model self.rank = dist.get_rank() self.world_size = dist.get_world_size() self.param_shards = list() # each rank holds only its shard of the optimizer states self.param_metadata = list() # metadata to reconstruct shards for param in self.model.parameters(): original_shape = param.data.shape flat = param.data.view(-1) numel = flat.numel() remainder = numel % self.world_size pad_size = (self.world_size - remainder) % self.world_size padded_numel = numel + pad_size shard_size = padded_numel // self.world_size shard_start = self.rank * shard_size shard_end = shard_start + shard_size self.param_metadata.append( { "original_shape": original_shape, "numel": numel, "padded_numel": padded_numel, "shard_size": shard_size, "shard_start": shard_start, "shard_end": shard_end, } ) if pad_size > 0: flat_padded = torch.cat([flat, flat.new_zeros(pad_size)]) else: flat_padded = flat shard = flat_padded[shard_start:shard_end].clone() shard.requires_grad_(True) self.param_shards.append(shard) self.optimizer = optimizer_cls(self.param_shards) def training_step(self, inputs, targets, loss_fn): output = self.model(inputs) # forward loss = loss_fn(output, targets) # compute loss loss.backward() # backward self._sync_gradients() # all-reduce gradients across GPUs self.optimizer.step() # update local shard of parameters self._sync_params() # all gather model params # clear gradients for the next step for param in self.model.parameters(): param.grad = None def _sync_gradients(self): for idx, param in enumerate(self.model.parameters()): meta = self.param_metadata[idx] dist.all_reduce(param.grad, op=dist.ReduceOp.SUM) param.grad /= self.world_size self.param_shards[idx].grad = param.grad.view(-1)[meta["shard_start"]:meta["shard_end"]] def _sync_params(self): for idx, param in
Genesis Park 편집팀이 AI를 활용하여 작성한 분석입니다. 원문은 출처 링크를 통해 확인할 수 있습니다.
공유