딥 러닝 라이브러리가 학습을 활성화하는 방법

hackernews | | 📰 뉴스
#딥러닝 #라이브러리 #머신러닝 #머신러닝/연구 #역전파 #학습과정
원문 출처: hackernews · Genesis Park에서 요약 및 분석

요약

심층 학습 라이브러리 내부의 작동 원리, 특히 모델이 데이터로부터 학습할 수 있게 만드는 메커니즘을 자세히 설명합니다. 작성자가 직접 라이브러리를 구축하며 얻은 경험을 바탕으로, forward pass에서 모델이 예측을 생성하고 loss를 계산하는 과정을 거쳐 backward pass가 수행되는 이유를 분석합니다. 특히 loss.backward() 호출을 통해 기울기(gradient)를 계산하고, 최적화(optimizer)가 이를 사용하여 파라미터를 업데이트하여 손실을 줄이는 전체적인 흐름을 단계별로 이해할 수 있도록 돕습니다.

본문

- 1. The Familiar Training Step - 2. Why Call loss.backward() At All - 3. Why An Update Actually Helps - 4. Values Remember Where They Came From - 5. What One Operation Does During Backward - 6. How loss.backward() Walks the Whole Graph - 7. So We Have All the Gradients in the Graph. What’s Next? - 8. When Tensors Become Models - Appendix: Where To Go Next Last year, I built a deep learning library from scratch to better understand the inner workings of machine learning. Over time, I realized the value of the project was not just the library itself, but the build process and the lessons that came from it. In an earlier post, I wrote about what I learned from building the library. This post looks at a different angle: the mechanics inside a deep learning library that makes learning from data possible. I’ll use my library as a concrete bridge to help build that intuition, especially for readers who haven’t built one before. In standard machine learning training, we’ve seen many training steps look something like this. optimizer.zero_grad() prediction = model(x) loss = loss_function(prediction, true_label) loss.backward() optimizer.step() The core question is: what data do we need to store and compute so that loss.backward() can produce gradients, and an optimizer can use those gradients to update parameters that reduces the loss overtime (a.k.a learning)? Let’s go step by step. 1. The Familiar Training Step Our from-scratch library keeps the user-facing API close to PyTorch on purpose so the surface stays familiar and we can focus on the mechanics underneath. If you have used PyTorch or TensorFlow before, none of this should look strange. The point here is not the API surface. It’s what the library has to do under it. After you understand the core steps that enable the model to learn from data, the high-level APIs will naturally make more sense. Different frameworks wrap these steps differently, but the underlying loop is the same: - a model() call viaModule.__call__(...) - a scalar loss loss.backward() call- an optimizer.step() such asSGD.step(...) The library’s basic value type is Tensor . For now, you can treat a tensor as “an array-like value that may also carry gradient-tracking metadata.” Section 4 will unpack that in detail. 2. Why Call loss.backward() At All A loss is just a single number that says, in some way, how wrong the model currently is. In the tiny example below, error ** 2 turns “prediction versus target” into one scalar error. We usually call backward on a scalar loss because that gives backpropagation one clear starting point: the gradient of the loss with respect to itself is 1 . That scalar tells you how the model is doing, but it does not yet tell you how to improve w and b , the (trainable) weights of the model. To update parameters, you need a more specific question answered: If I nudge each trainable parameter a little, how does the loss change? That quantity is the gradient. It helps to separate the forward and backward passes: - in the forward pass (e.g. model() ormodel.forward() or even justa + b ), you run the model on some inputs to producepred , then keep going until that prediction is turned into a scalarloss . And yes, a simple math operation likea + b is considered a “forward pass” by the deep learning library. - in the backward pass, you start at loss and walk backward through the same chain, asking how a small change in each earlier trainable parameter would change that loss More concretely, in the running example above, the forward values are prod = 6 , pred = 7 , error = -1 , and loss = 1 . In that forward pass (i.e. x * w + b ), the model predicts 7 while the target (i.e. y_true ) is 8 , so the squared error is 1 . What does requires_grad mean? Not every tensor/parameter is trainable. w and b are trainable because they are the quantities we want to update. x and y_true are non-trainable in this example, because they are givens of the current training example, not knobs the optimizer should tune. Even though they still affect the loss. For trainable leaf tensors, part of that bookkeeping is a .grad field. After loss.backward() , .grad stores the gradient of the loss with respect to that tensor. The optimizer later reads those stored gradients on trainable leaves like w and b . As a teaching choice, this library also keeps .grad on intermediate tensors with requires_grad=True , so pred.grad is available here even though PyTorch would usually require retain_grad() for a non-leaf tensor. In the running example above, loss.backward() gives pred.grad = -2 , w.grad = -4 , and b.grad = -2 . So trainable (requires_grad=True ) versus non-trainable (requires_grad=False ) is not just a labeling detail. requires_grad controls gradient bookkeeping, while optimizers usually update the parameter set you hand them. In many frameworks, frozen parameters are practically protected because autograd leaves .grad=None , and optimizers skip parameters whose gradient is None . So loss.backward(

Genesis Park 편집팀이 AI를 활용하여 작성한 분석입니다. 원문은 출처 링크를 통해 확인할 수 있습니다.

공유

관련 저널 읽기

전체 보기 →