If we speak about object detection, one model that likely involves our mind first is YOLO — well, at the least for me, because of its popularity in the sector of computer vision.
The very first version of this model, known as YOLOv1, was released back in 2015 within the research paper titled “ [1]. Before YOLOv1 was invented, certainly one of the state-of-the-art algorithms for performing object detection was R-CNN (Region-based Convolutional Neural Network), during which it uses multi-stage mechanism to do the duty. It initially employs selective search algorithm to create region proposal, then uses CNN-based model to extract the features inside all these regions, and at last classifies the detected objects using SVM [2]. Here you may clearly imagine how long the method is simply to perform object detection on a single image.
The motivation of YOLO in the primary place was to enhance speed. In truth, not only achieving low computational complexity, however the authors proved that their proposed deep learning model was also in a position to achieve high accuracy. As this text is written, YOLOv13 has just published several days ago [3]. But let’s just speak about its very first ancestor for now so that you would be able to see the fantastic thing about this model ranging from the time it first got here out. This text goes to debate how YOLOv1 works and the best way to construct this neural network architecture from scratch with PyTorch.
The Underlying Theory Behind YOLOv1
Before we get into the architecture, it will be higher if we understand the thought behind YOLOv1 upfront. Let’s start with an example. Suppose we’ve got an image of a cat, and we’re about to make use of it as a training sample of a YOLOv1 model. And so, we want to create a ground truth for that. It’s mentioned in the unique paper that we want to define the parameter , which denotes the variety of grid cells we’re going to divide our image into along each spatial dimension. By default, this parameter is ready to 7, so we can have 7×7=49 cells in total. Take a take a look at Figure 1 below to raised understand this concept.
Next, we want to find out which cell corresponds to the midpoint of the article. Within the above case, the cat is situated almost exactly at the middle of the image, hence the midpoint must lie at cell (3, 3). Later within the inference phase, we will consider this cell because the one responsible to predict the cat. Now taking a better take a look at the cell, we want to find out the precise position of the midpoint. Here you may see that along the vertical axis it’s situated exactly in the center, but within the horizontal axis it’s barely shifted to the left from the center. So, if I were to approximate, the coordinate could be (0.4, 0.5). This coordinate value is relative to the cell and is normalized to the range of 0 to 1. It is likely to be value noting that the (, ) coordinate of the midpoint should neither be lower than 0 nor greater than 1, since a worth outside this range would mean the midpoint lies in one other cell. Meanwhile, the width and the peak of the bounding box are roughly 2.4 and three.2, respectively. These numbers are relative to the cell size, meaning that if the article is larger than the cell, then the worth will probably be greater than 1. In a while, if we were to create a ground truth for a picture, we want to store all these , , and information within the so-called .
Goal Vector
The length of the goal vector itself is 25 for every cell, during which the primary 20 elements (index 0 to 19) store the category of the article in type of one-hot encoding. This is basically because YOLOv1 was originally trained on PASCAL VOC dataset which has that variety of classes. Next, index 20 is used to store the boldness of the bounding box prediction, which within the training phase this is ready to 1 at any time when there’s an object midpoint throughout the cell. Lastly, the (, ) coordinate of the midpoint are placed at indices 21 and 22, whereas and are stored at indices 23 and 24. The illustration in Figure 2 below displays what the goal vector for cell (3, 3) looks like.

Again, do not forget that the above goal vector only corresponds to a single cell. To create the bottom truth for the whole image, we want to have a bunch of comparable vectors concatenated, forming the so-called as shown in Figure 3. Note that the category probabilities in addition to the bounding box confidences, locations, and sizes from all other cells are set to zero because there is no such thing as a other object appearing throughout the image.

Prediction Vector
The is kind of a bit different. If the goal vector consists of 25 elements, the prediction vector consists of 30. It’s because by default YOLOv1 predicts two bounding boxes for a similar object during inference. Thus, we want 5 additional elements to store the data concerning the second bounding box generated by the model. Despite predicting two bounding boxes, later we are going to only take the one which has greater confidence.

This unique goal and prediction vector dimensions required the authors to rethink the loss function. For regression problems, we typically use MAE, MSE or RMSE, whereas for classification tasks we often use cross entropy loss. But YOLOv1 is greater than only a regression and classification problem, considering that we’ve got each continuous (bounding box) and discrete (class) values within the vector representation. Because of this reason, the authors created a brand new loss function specialized for this model as shown in Figure 5. This loss function is kind of complex (you see, right?), so I made a decision to write down it in a separate article because there are numerous things to elucidate about it — stay tuned, I’ll publish it very soon.

The YOLOv1 Architecture
Similar to typical earlier computer vision models, YOLOv1 uses CNN-based architecture because the backbone of the model. It comprises 24 convolution layers stacked in keeping with the structure in Figure 6. In case you take a better take a look at the figure, you’ll notice that the output layer produces a tensor of shape 30×7×7. This dimension indicates that each single cell has its corresponding prediction vector of length 30 containing the category and the bounding box information of the detected object, during which this matches exactly with our previous discussion.

Well, I believe I’ve covered all the basics of YOLOv1, so now let’s start implementing the architecture from scratch with PyTorch. Before doing anything, what we want to do first is to import the required modules and initialize the parameters , , and . See Codeblock 1 below.
# Codeblock 1
import torch
import torch.nn as nn
S = 7
B = 2
C = 20
The three parameters I initialized above are the default values given within the paper, during which S represents the variety of grid cells along the horizontal and vertical axes, B denotes the variety of bounding boxes generated by each cell, and C is the variety of classes available within the dataset. Since we use S=7 and B=2, our YOLOv1 will produce7×7×2=98 bounding boxes in total for every image.
The Constructing Block
Next, we’re going to create the ConvBlock class, during which it accommodates a single convolution layer (line #(1)), a leaky ReLU activation function (#(2)), and an optional maxpooling layer (#(3)) as shown in Codeblock 2.
# Codeblock 2
class ConvBlock(nn.Module):
def __init__(self,
in_channels,
out_channels,
kernel_size,
stride,
padding,
maxpool_flag=False):
super().__init__()
self.maxpool_flag = maxpool_flag
self.conv = nn.Conv2d(in_channels=in_channels, #(1)
out_channels=out_channels,
kernel_size=kernel_size,
stride=stride,
padding=padding)
self.leaky_relu = nn.LeakyReLU(negative_slope=0.1) #(2)
if self.maxpool_flag:
self.maxpool = nn.MaxPool2d(kernel_size=2, #(3)
stride=2)
def forward(self, x):
print(f'originalt: {x.size()}')
x = self.conv(x)
print(f'after convt: {x.size()}')
x = self.leaky_relu(x)
print(f'after leaky relu: {x.size()}')
if self.maxpool_flag:
x = self.maxpool(x)
print(f'after maxpoolt: {x.size()}')
return x
In modern architectures, we normally use the structure, but on the time YOLOv1 was created, it looks as if batch normalization layer was not quite popular just yet, because it got here out only several months before YOLOv1. So, I suppose this might be the explanation that the authors didn’t utilize this normalization layer. As a substitute, it only uses a stack of convolutions and leaky ReLUs throughout the whole network.
Just a fast refresher, leaky ReLU is an activation function much like the usual ReLU, except that the negative values are multiplied with a small number as a substitute of being zeroed out. Within the case of YOLOv1, we set the multiplier to 0.1 (#(2)) in order that it may possibly still preserve a bit of bit amount of data contained within the negative input numbers.

Because the ConvBlock class has been defined, now I’m going to check it just to envision if it really works properly. In Codeblock 3 below I attempt to implement the very first layer within the network and pass a dummy tensor through it. You possibly can see within the codeblock that in_channels is ready to three (#(1)) and out_channels is ready to 64 (#(2)) because we would like this initial layer to simply accept an RGB image because the input and return a 64-channel image. The dimensions of the kernel is 7×7 (#(3)), hence we want to set the padding to three (#(5)). Normally, this configuration allows us to preserve the spatial dimension of the image, but since we use stride=2 (#(4)), this padding size ensures that the image is strictly halved. Next, in the event you return to Figure 6, you’ll notice that some conv layers are followed by a maxpooling layer and a few others usually are not. For the reason that first convolution utilizes a maxpooling layer, we want to set the maxpool_flag parameter to True (#(6)).
# Codeblock 3
convblock = ConvBlock(in_channels=3, #(1)
out_channels=64, #(2)
kernel_size=7, #(3)
stride=2, #(4)
padding=3, #(5)
maxpool_flag=True) #(6)
x = torch.randn(1, 3, 448, 448) #(7)
out = convblock(x)
Afterwards, we will simply generate a tensor of random values with the dimension of 1×3×448×448 (#(7)) which simulates a batch of a single RGB image of size 448×448 after which pass it through the network. You possibly can see within the resulting output below that our convolution layer successfully increased the variety of channels to 64 and halved the spatial dimension to 224×224. The halving was done once more all of the method to 112×112 because of the maxpooling layer.
# Codeblock 3 Output
original : torch.Size([1, 3, 448, 448])
after conv : torch.Size([1, 64, 224, 224])
after leaky relu : torch.Size([1, 64, 224, 224])
after maxpool : torch.Size([1, 64, 112, 112])
The Backbone
The following thing we’re going to do is to create a sequence of ConvBlocks to construct the whole backbone of the network. In case you’re still not conversant in the term , on this case it is basically every thing before the 2 fully-connected layers (seek advice from Figure 6). Now take a look at the Codeblock 4a and 4b below to see how I define the Backbone class.
# Codeblock 4a
class Backbone(nn.Module):
def __init__(self):
super().__init__()
# in_channels, out_channels, kernel_size, stride, padding, maxpool_flag
self.stage0 = ConvBlock(3, 64, 7, 2, 3, maxpool_flag=True) #(1)
self.stage1 = ConvBlock(64, 192, 3, 1, 1, maxpool_flag=True) #(2)
self.stage2 = nn.ModuleList([
ConvBlock(192, 128, 1, 1, 0),
ConvBlock(128, 256, 3, 1, 1),
ConvBlock(256, 256, 1, 1, 0),
ConvBlock(256, 512, 3, 1, 1, maxpool_flag=True) #(3)
])
self.stage3 = nn.ModuleList([])
for _ in range(4):
self.stage3.append(ConvBlock(512, 256, 1, 1, 0))
self.stage3.append(ConvBlock(256, 512, 3, 1, 1))
self.stage3.append(ConvBlock(512, 512, 1, 1, 0))
self.stage3.append(ConvBlock(512, 1024, 3, 1, 1, maxpool_flag=True)) #(4)
self.stage4 = nn.ModuleList([])
for _ in range(2):
self.stage4.append(ConvBlock(1024, 512, 1, 1, 0))
self.stage4.append(ConvBlock(512, 1024, 3, 1, 1))
self.stage4.append(ConvBlock(1024, 1024, 3, 1, 1))
self.stage4.append(ConvBlock(1024, 1024, 3, 2, 1)) #(5)
self.stage5 = nn.ModuleList([])
self.stage5.append(ConvBlock(1024, 1024, 3, 1, 1))
self.stage5.append(ConvBlock(1024, 1024, 3, 1, 1))
What we do within the above codeblock is to instantiate ConvBlock instances in keeping with the architecture given within the paper. There are several things I need to emphasise here. First, the term I take advantage of within the code will not be explicitly mentioned within the paper. Nevertheless, I made a decision to make use of that word to explain the six groups of convolutional layers in Figure 6. Second, notice that we want to set the maxpool_flag to True for the last ConvBlock in the primary 4 groups to perform spatial downsampling (#(1–4)). For the fifth group, the downsampling is completed by setting the stride of the last convolution layer to 2 (#(5)). Third, Figure 6 doesn’t mention the padding size of the convolution layers, so we want to calculate them manually. There may be indeed a particular formula to search out padding size based on the given kernel size. Nevertheless, I feel prefer it is far easier to memorize it. Just remember that if we use kernel of size 7×7, then we want to set the padding to three to preserve the spatial dimension. Meanwhile, for five×5, 3×3 and 1×1 kernels, the padding needs to be set to 2, 1, and 0, respectively.
As all layers within the backbone have been instantiated, we will now connect all of them using the forward() method below. I don’t think I want to elucidate anything here because it principally only works by passing the input tensor x through the layers sequentially.
# Codeblock 4b
def forward(self, x):
print(f'originalt: {x.size()}n')
x = self.stage0(x)
print(f'after stage0t: {x.size()}n')
x = self.stage1(x)
print(f'after stage1t: {x.size()}n')
for i in range(len(self.stage2)):
x = self.stage2[i](x)
print(f'after stage2 #{i}t: {x.size()}')
print()
for i in range(len(self.stage3)):
x = self.stage3[i](x)
print(f'after stage3 #{i}t: {x.size()}')
print()
for i in range(len(self.stage4)):
x = self.stage4[i](x)
print(f'after stage4 #{i}t: {x.size()}')
print()
for i in range(len(self.stage5)):
x = self.stage5[i](x)
print(f'after stage5 #{i}t: {x.size()}')
return x
Now let’s confirm if our implementation is correct by running the next testing code.
# Codeblock 5
backbone = Backbone()
x = torch.randn(1, 3, 448, 448)
out = backbone(x)
In case you attempt to run the above codeblock, the next output should appear in your screen. Here you may see that the spatial dimension of the image appropriately got reduced after the last ConvBlock of every stage. This process continued all of the method to the last stage until eventually we obtained a tensor of size 1024×7×7, during which this matches exactly with the illustration in Figure 6.
# Codeblock 5 Output
original : torch.Size([1, 3, 448, 448])
after stage0 : torch.Size([1, 64, 112, 112])
after stage1 : torch.Size([1, 192, 56, 56])
after stage2 #0 : torch.Size([1, 128, 56, 56])
after stage2 #1 : torch.Size([1, 256, 56, 56])
after stage2 #2 : torch.Size([1, 256, 56, 56])
after stage2 #3 : torch.Size([1, 512, 28, 28])
after stage3 #0 : torch.Size([1, 256, 28, 28])
after stage3 #1 : torch.Size([1, 512, 28, 28])
after stage3 #2 : torch.Size([1, 256, 28, 28])
after stage3 #3 : torch.Size([1, 512, 28, 28])
after stage3 #4 : torch.Size([1, 256, 28, 28])
after stage3 #5 : torch.Size([1, 512, 28, 28])
after stage3 #6 : torch.Size([1, 256, 28, 28])
after stage3 #7 : torch.Size([1, 512, 28, 28])
after stage3 #8 : torch.Size([1, 512, 28, 28])
after stage3 #9 : torch.Size([1, 1024, 14, 14])
after stage4 #0 : torch.Size([1, 512, 14, 14])
after stage4 #1 : torch.Size([1, 1024, 14, 14])
after stage4 #2 : torch.Size([1, 512, 14, 14])
after stage4 #3 : torch.Size([1, 1024, 14, 14])
after stage4 #4 : torch.Size([1, 1024, 14, 14])
after stage4 #5 : torch.Size([1, 1024, 7, 7])
after stage5 #0 : torch.Size([1, 1024, 7, 7])
after stage5 #1 : torch.Size([1, 1024, 7, 7])
The Fully-Connected Layers
After the backbone is completed, we will now move on to the fully-connected part, which I write in Codeblock 6 below. This a part of the network may be very easy because it mainly only consists of two linear layers. Speaking of the main points, it’s mentioned within the paper that the authors apply a dropout layer with the speed of 0.5 (#(3)) between the primary (#(1)) and the second (#(4)) linear layers. It is necessary to notice that the leaky ReLU activation function continues to be used (#(2)) but only after the primary linear layer. It’s because the second acts because the output layer, hence it doesn’t require any activation applied to it.
# Codeblock 6
class FullyConnected(nn.Module):
def __init__(self):
super().__init__()
self.linear0 = nn.Linear(in_features=1024*7*7, out_features=4096) #(1)
self.leaky_relu = nn.LeakyReLU(negative_slope=0.1) #(2)
self.dropout = nn.Dropout(p=0.5) #(3)
self.linear1 = nn.Linear(in_features=4096, out_features=(C+B*5)*S*S)#(4)
def forward(self, x):
print(f'originalt: {x.size()}')
x = self.linear0(x)
print(f'after linear0t: {x.size()}')
x = self.leaky_relu(x)
x = self.dropout(x)
x = self.linear1(x)
print(f'after linear1t: {x.size()}')
return x
Run the Codeblock 7 below to see how the tensor transforms because it is processed by the stack of linear layers.
# Codeblock 7
fc = FullyConnected()
x = torch.randn(1, 1024*7*7)
out = fc(x)
# Codeblock 7 Output
original : torch.Size([1, 50176])
after linear0 : torch.Size([1, 4096])
after linear1 : torch.Size([1, 1470])
We will see within the above output that the fc block takes an input of shape 50176, which is basically the flattened 1024×7×7 tensor. The linear0 layer works by mapping this input into 4096-dimensional vector, after which the linear1 layer eventually maps it further to 1470. Later within the post-processing stage we want to reshape it to 30×7×7 in order that we will take the bounding box and the article classification results easily. Technically speaking, this reshaping process might be done either internally by the model or outside the model. For the sake of simplicity, I made a decision to go away the output flattened, meaning the reshaping will probably be handled externally.
Connecting the FC Part to the Backbone
At this point we have already got our backbone and the fully-connected layers done. Thus, they at the moment are able to be assembled to construct the whole YOLOv1 architecture. There will not be much thing I can explain regarding the next code, as what we do here is simply instantiating each parts and connect them within the forward() method. Just don’t forget to flatten (#(1)) the output of backbone to make it compatible with the input of the fc block.
# Codeblock 8
class YOLOv1(nn.Module):
def __init__(self):
super().__init__()
self.backbone = Backbone()
self.fc = FullyConnected()
def forward(self, x):
x = self.backbone(x)
x = torch.flatten(x, start_dim=1) #(1)
x = self.fc(x)
return x
In an effort to test our model, we will simply instantiate the YOLOv1 model and pass a dummy tensor that simulates an RGB image of size 448×448 (#(1)). After feeding the tensor into the network (#(2)), I also attempt to simulate the post-processing step by reshaping the output tensor to 30×7×7 as shown at line #(3).
# Codeblock 9
yolov1 = YOLOv1()
x = torch.randn(1, 3, 448, 448) #(1)
out = yolov1(x) #(2)
out = out.reshape(-1, C+B*5, S, S) #(3)
And below is what the output looks like after the code above is run. Here you may see that our input tensor successfully flows through all layers inside the whole network, indicating that our YOLOv1 model works properly and thus is able to train.
# Codeblock 9 Output
original : torch.Size([1, 3, 448, 448])
after stage0 : torch.Size([1, 64, 112, 112])
after stage1 : torch.Size([1, 192, 56, 56])
after stage2 #0 : torch.Size([1, 128, 56, 56])
after stage2 #1 : torch.Size([1, 256, 56, 56])
after stage2 #2 : torch.Size([1, 256, 56, 56])
after stage2 #3 : torch.Size([1, 512, 28, 28])
after stage3 #0 : torch.Size([1, 256, 28, 28])
after stage3 #1 : torch.Size([1, 512, 28, 28])
after stage3 #2 : torch.Size([1, 256, 28, 28])
after stage3 #3 : torch.Size([1, 512, 28, 28])
after stage3 #4 : torch.Size([1, 256, 28, 28])
after stage3 #5 : torch.Size([1, 512, 28, 28])
after stage3 #6 : torch.Size([1, 256, 28, 28])
after stage3 #7 : torch.Size([1, 512, 28, 28])
after stage3 #8 : torch.Size([1, 512, 28, 28])
after stage3 #9 : torch.Size([1, 1024, 14, 14])
after stage4 #0 : torch.Size([1, 512, 14, 14])
after stage4 #1 : torch.Size([1, 1024, 14, 14])
after stage4 #2 : torch.Size([1, 512, 14, 14])
after stage4 #3 : torch.Size([1, 1024, 14, 14])
after stage4 #4 : torch.Size([1, 1024, 14, 14])
after stage4 #5 : torch.Size([1, 1024, 7, 7])
after stage5 #0 : torch.Size([1, 1024, 7, 7])
after stage5 #1 : torch.Size([1, 1024, 7, 7])
original : torch.Size([1, 50176])
after linear0 : torch.Size([1, 4096])
after linear1 : torch.Size([1, 1470])
torch.Size([1, 30, 7, 7])
Ending
It is likely to be value noting that each one the codes I show you throughout this whole article is for the bottom YOLOv1 architecture. It’s mentioned within the paper that the authors also proposed the lite version of this model which they seek advice from as . This smaller YOLOv1 version offers faster computation time because it only consists of 9 convolution layers as a substitute of 24. Unfortunately, the paper doesn’t provide the implementation details, so I cannot reveal you the best way to implement that one.
Here I encourage you to mess around with the above code. In theory, it is feasible to interchange the CNN-based backbone with other deep learning models, similar to ResNet, ResNeXt, ViT, etc. All it’s good to do is simply to match the output shape of the backbone with the input shape of the fully-connected part. Not only that, I also want you to try training this model from scratch. But in the event you decided to achieve this, you may probably need to make this model smaller by reducing the depth (no of convolution layers) or the width (no of kernels) of the model. This is basically since the authors mentioned that they required around per week simply to do the pretraining on ImageNet dataset, not to say the time for high quality tuning on the article detection task.
And well, I believe that’s just about every thing I can explain you about how YOLOv1 works and its architecture. Please let me know in the event you spot any mistake in this text. Thanks!
References
[1] Joseph Redmon You Only Look Once: Unified, Real-Time Object Detection. Arxiv. https://arxiv.org/pdf/1506.02640 [Accessed July 5, 2025].
[2] Ross Girshick Wealthy feature hierarchies for accurate object detection and semantic segmentation. Arxiv. https://arxiv.org/pdf/1311.2524 [Accessed July 5, 2025].
[3] Mengqi Lei YOLOv13: Real-Time Object Detection with Hypergraph-Enhanced Adaptive Visual Perception. Arxiv. https://arxiv.org/abs/2506.17733 [Accessed July 5, 2025].
[4] Image generated by creator with Gemini, edited by creator.
[5] Image originally created by creator.
[6] Bing Xu Empirical Evaluation of Rectified Activations in Convolutional Network. Arxiv. https://arxiv.org/pdf/1505.00853 [Accessed July 5, 2025].
[7] MuhammadArdiPutra. The Day YOLO First Saw the World — YOLOv1. GitHub. https://github.com/MuhammadArdiPutra/medium_articles/blob/important/The%20Day%20YOLO%20First%20Saw%20the%20World%20-%20YOLOv1.ipynb [Accessed July 7, 2025].
