Home Artificial Intelligence What Should Be Considered When Making a Custom Dataset for Working with YOLO?

What Should Be Considered When Making a Custom Dataset for Working with YOLO?

What Should Be Considered When Making a Custom Dataset for Working with YOLO?

Based on my experience & experiment

When you desire to train your individual model using a custom dataset, you might have some questions on what to do, especially if you happen to’ve just began working with YOLO. Should you wish to learn to swim by jumping into the water, I would really like to discuss just a few of my experiences that may open latest doors for you.

Take it easy!

On this part, it is rather necessary to decide on the suitable algorithm for our goal. If we decide the correct algorithm, we are able to deepen our knowledge of the model, and at the top of the day, we’ve got a wealth of data about it. Due to this fact, when starting a project, especially if you desire to create your individual model, it is beneficial to come to a decision beforehand by examining which algorithms work more successfully for which purpose. For instance, I made a decision to work with the YOLO algorithm for my object detection studies.

We will even talk concerning the hyper-parameters within the section on the aspects affecting the success rate of the model we’ve got trained, but for now, it is beneficial to know what the hyper-parameters are and why they’re used.

If we’ve got the answers to those questions, let’s start the warm-up laps!

Getting Began

COCO is a large-scale object detection, segmentation, and captioning dataset.

It’s a dataset that has been trained on 330K images, of which greater than 220K has been labeled, and permits you to classify with 80 object categories.

The YOLO weights used were obtained consequently of the training on this data set. When you desire to train along with your own dataset:
1. You possibly can only export your custom dataset and use the weights of the COCO dataset.
2. You possibly can create your individual weights from scratch while training your custom dataset.
3. When training your custom dataset, you’ll be able to get well results through the use of your weight results from previous training results.

On this section, if you happen to should not creating your individual weights from scratch, you might be progressing your learning by transferring the previously learned weights, here we call it Transfer Learning.

So when should we decide to make use of our own weights and when to make use of YOLO’s weights? The reply to this query is what we wish. If the style of object we wish to work with is within the 80 classes of COCO, we are able to achieve good leads to a short while with the weights of YOLO, for instance, you’ll be able to easily detect a toothbrush. Yow will discover which objects are included in these 80 classes from data files reminiscent of coco.yaml or coco128.yaml (whichever your model is using). You possibly can access these files by clicking the Ultralytics GitHub account link.

Note: You possibly can select the suitable model in line with the scale of your custom dataset, reminiscent of YOLOv8s (small), YOLOv8m (medium), YOLOv8l (large), etc.

YOLOv8 Models

One other possibility is that the thing you desire to work on isn’t included within the COCO dataset or you desire to create your individual model. On this case, we’re faced with our major query; “What Should Be Considered When Making a Custom Dataset for Working with YOLO?”.

Ready for Take-off 🚀

Reminder: The next points are principally for making a custom dataset. In case your custom dataset is already ready and also you’re getting good results, it’s essential to fine-tune the optimizations.

As the scale of the dataset increases, the variety of images you reserve for train, validation, and testing will even increase, that’s what we wish. Crucial point here is to qualitatively increase the variety of images in your dataset. One in every of the things I’ve experienced as well is that a smaller dataset can easily do the job of a giant dataset if it’s more qualified!

Selecting the information suitable for the scenarios that the model will encounter in real life will make your dataset more qualified. Should you are going to make use of the model leads to a selected environment in real life, you’ll be able to take samples from that environment so the model will serve your purpose. The standard of the dataset will save on many issues reminiscent of time, memory, and value that your work will spend.

For any object to be meaningful to our model, it’s essential to inform the model what the given data is. Thus, the pc processes and is smart of the information, we call this annotation and we do it due to the labeling process.

I would really like to clarify labeling with an example; When we wish to detect a dog, we surround the dog within the image with a bounding box, whose edges must touch the outermost pixels of the labeled object. Our label text file comprises the coordinates of this bounding box. We annotate our label text files with the pictures within the dataset, so we make sense of the information by telling our model that the thing within the bounding box is a dog. It is amazingly necessary to offer annotation with essentially the most appropriate labeling. To attenuate errors, labeling will be done human-led or it is feasible to do it quickly and more accurately with various programs. I used Plainsight and I like to recommend it, it is rather practical to make use of.

(a) Should be as tight as possible (b) Must include all visible parts (c) Correct bounding

Preferring polygons as a substitute of rectangle bounding boxes will enable us to attain higher results.

In case your labels are ready and the labeling tool you employ has a model trial feature, you’ll be able to quickly measure your custom dataset. It’s best to export it in the suitable format with YOLO in order that the file directory will probably be suitable to be used in your model.

Export of custom dataset in YOLO format. Percentages of distribution within the dataset are given by default.

The custom dataset folder directory exported from the tool I used before is as within the image below. All of the pictures within the dataset are in a folder named img. Images and labels are in the identical folder.

So how will the model know which data is train data and which data is reserved for validation? The train.txt, validation.txt, and test.txt files contain paths for every of the separated images.

The custom dataset folder structure.

Usage with Model

Directories of the information prepared for train, validation, and testing within the custom dataset needs to be written into the “data” file.

Should you prefer to make use of Google Colab in your work, you’ll be able to upload your custom dataset to your Drive.

I discussed that the train.txt, validation.txt, and test.txt files contain paths for every of the separated images. The YOLO model accesses the custom dataset in line with the paths within the data file.

Giving the relevant path for train, validation, and testing in the information file.

Should you prefer transfer learning using one among YOLO’s models, it’s best to write the trail of the train, validation, and test images in your coco128.yaml file. Should you are going to create your individual model, it can be enough to create a YAML file in the identical format and add the essential paths.

Congratulations, you now have your individual custom dataset and might train it!🎉


  1. I don’t think the title of your article matches the content lol. Just kidding, mainly because I had some doubts after reading the article.


Please enter your comment!
Please enter your name here