YOLOv3 문서 연습: 훨씬 더 좋지만 그다지 많지는 않음

Towards Data Science | 2026년 3월 4일 00:33 | 💼 비즈니스

#pytorch #tip #yolov3 #객체 탐지 #모델 구현 #딥러닝

원문 출처: Towards Data Science · Genesis Park에서 요약 및 분석

요약

이 기사는 YOLOv3 논문을 바탕으로 PyTorch를 사용해 아키텍처를 처음부터 직접 구현하는 방법을 단계적으로 설명합니다. 저자는 모델의 개선된 성능과 기술적 세부 사항을 분석하며, 이전 버전과 비교했을 때 큰 차별점을 두기엔 개선 폭이 미미하다는 평가를 내립니다.

본문

In this article I am going to talk about the modifications the authors made to YOLOv2 to create YOLOv3 and how to implement the model architecture from scratch with PyTorch. I highly recommend you reading my previous article about YOLOv1 [2, 3] and YOLOv2 [4] before this one, unless you already got a strong foundation in how these two earlier versions of YOLO work. What Makes YOLOv3 Better Than YOLOv2 The Vanilla Darknet-53 The modification the authors made was mainly related to the architecture, in which they proposed a backbone model referred to as Darknet-53. See the detailed structure of this network in Figure 1. As the name suggests, this model is an improvement upon the Darknet-19 used in YOLOv2. If you count the number of layers in Darknet-53, you will find that this network consists of 52 convolution layers and a single fully-connected layer at the end. Keep in mind that later when we implement it on YOLOv3, we will feed it with images of size 416×416 rather than 256×256 as written in the figure. If you’re familiar with Darknet-19, you must remember that it performs spatial downsmapling using maxpooling operations after every stack of several convolution layers. In Darknet-53, authors replaced these pooling operations with convolutions of stride 2. This was essentially done because maxpooling layer completely ignores non-maximum numbers, causing us to lose a lot of information contained in the lower intensity pixels. We can actually use average-pooling as an alternative, but in theory, this approach won’t be optimal either because all pixels within the small region are weighted the same. So as a solution, authors decided to use convolution layer with a stride of 2, which by doing so the model will be able to reduce image resolution while capturing spatial information with specific weightings. You can see the illustration for this in Figure 2 below. Next, the backbone of this YOLO version is now equipped with residual blocks which the idea is originated from ResNet. One thing that I want to emphasize regarding our implementation is the activation function within the residual block. You can see in Figure 3 below that according to the original ResNet paper, the second activation function is placed after the element-wise summation. However, based on the other tutorials that I read [6, 7], I found that in the case of YOLOv3 the second activation function is placed right after the weight layer instead (before summation). So later in the implementation, I decided to follow the guide in these tutorials since the YOLOv3 paper does not give any explanations about it. Darknet-53 With Detection Heads Keep in mind that the architecture in Figure 1 is only meant for classification. Thus, we need to replace everything after the last residual block if we want to make it compatible for detection tasks. Again, the original YOLOv3 paper also does not provide the detailed implementation guide, hence I decided to search for it and eventually got one from the paper referenced as [9]. I redraw the illustration from that paper to make the architecture looks clearer as shown in Figure 4 below. There are actually lots of things to explain regarding the above architecture. Now let’s start from the part I refer to as the detection heads. Different from the previous YOLO versions which relied on a single head, here in YOLOv3 we have 2 additional heads. Thus, we will later have 3 prediction tensors for every single input image. These three detection heads have different specializations: the leftmost head (13×13) is the one responsible to detect large objects, the middle head (26×26) is for detecting medium-sized objects, and the one on the right (52×52) is used to detect objects of small size. We can think of the 52×52 tensor as the feature map that contains the detailed representation of an image, hence is suitable to detect small objects. Conversely, the 13×13 prediction tensor is meant to detect large objects because of its lower spatial resolution which is effective at capturing the general shape of an object. Still with the detection head, you can also see in Figure 4 that the three prediction tensors have 255 channels. To understand where this number comes from, we first need to know that each detection head has 3 prior boxes. Following the rule given in YOLOv2, each of these prior boxes is configured such that it can predict its own object category independently. With this mechanism, the feature vector of each grid cell can be obtained by computing B×(5+C), where B is the number of prior boxes, C is the number of object classes, and 5 is the xywh and the bounding box confidence (a.k.a. objectness). In the case of YOLOv3, each detection head has 3 prior boxes and 80 classes, assuming that we train it on 80-class COCO dataset. By plugging these numbers to the formula, we obtain 3×(5+80)=255 prediction values for a single grid cell. In fact, using multi-head mechanism like this allows the model to detect more objects as comp

원문 보기 (Towards Data Science)

Genesis Park 편집팀이 AI를 활용하여 작성한 분석입니다. 원문은 출처 링크를 통해 확인할 수 있습니다.

요약

본문

관련 저널 읽기