Nice-Tune a Semantic Segmentation Model with a Custom Dataset

This guide shows how you’ll be able to fine-tune Segformer, a state-of-the-art semantic segmentation model. Our goal is to construct a model for a pizza delivery robot, so it may see where to drive and recognize obstacles 🍕🤖. We’ll first use an available segmentation dataset from the 🤗 hub. Then we’ll fine-tune a pre-trained SegFormer model through the use of 🤗 transformers, an open-source library that gives easy-to-use implementations of state-of-the-art models. Along the best way, you will learn methods to work with the Hugging Face Hub, the biggest open-source catalog of models and datasets.

Semantic segmentation is the duty of classifying each pixel in a picture. You possibly can see it as a more precise way of classifying a picture. It has a big selection of use cases in fields equivalent to medical imaging and autonomous driving. For instance, for our pizza delivery robot, it will be significant to know exactly where the sidewalk is in a picture, not only whether there may be a sidewalk or not.

Because semantic segmentation is a style of classification, the network architectures used for image classification and semantic segmentation are very similar. In 2014, a seminal paper by Long et al. used convolutional neural networks for semantic segmentation. More recently, Transformers have been used for image classification (e.g. ViT), and now they’re also getting used for semantic segmentation, pushing the state-of-the-art further.

SegFormer is a model for semantic segmentation introduced by Xie et al. in 2021. It has a hierarchical Transformer encoder that does not use positional encodings (in contrast to ViT) and a straightforward multi-layer perceptron decoder. SegFormer achieves state-of-the-art performance on multiple common datasets. Let’s have a look at how our pizza delivery robot performs for sidewalk images.

Let’s start by installing the needed dependencies. Because we’ll push our dataset and model to the Hugging Face Hub, we’d like to put in Git LFS and log in to Hugging Face.

The installation of git-lfs may be different in your system. Note that Google Colab has Git LFS pre-installed.

pip install -q transformers datasets evaluate segments-ai
apt-get install git-lfs
git lfs install
huggingface-cli login

1. Create/select a dataset

Step one in any ML project is assembling a very good dataset. As a way to train a semantic segmentation model, we’d like a dataset with semantic segmentation labels. We are able to either use an existing dataset from the Hugging Face Hub, equivalent to ADE20k, or create our own dataset by annotating images with corresponding segmentation maps.

For our pizza delivery robot, we could use an existing autonomous driving dataset equivalent to CityScapes or BDD100K. Nevertheless, these datasets were captured by cars driving on the road. Since our delivery robot can be driving on the sidewalk, there can be a mismatch between the pictures in these datasets and the information our robot will see in the actual world.

We don’t need our delivery robot to get confused, so we now have created our own semantic segmentation dataset using images captured on sidewalks. It’s available at segments/sidewalk-semantic. This could be done using annotation platforms like CVAT orSegments.ai.

Use a dataset from the Hub

We’ll load the complete labeled sidewalk dataset here. Note which you can try the examples directly in your browser.

hf_dataset_identifier = "segments/sidewalk-semantic"

2. Load and prepare the Hugging Face dataset for training

Now that we have created a brand new dataset and pushed it to the Hugging Face Hub, we are able to load the dataset in a single line.

from datasets import load_dataset

ds = load_dataset(hf_dataset_identifier)

Let’s shuffle the dataset and split the dataset in a train and test set.

ds = ds.shuffle(seed=1)
ds = ds["train"].train_test_split(test_size=0.2)
train_ds = ds["train"]
test_ds = ds["test"]

We’ll extract the variety of labels and the human-readable ids, so we are able to configure the segmentation model appropriately afterward.

import json
from huggingface_hub import hf_hub_download

repo_id = f"datasets/{hf_dataset_identifier}"
filename = "id2label.json"
id2label = json.load(open(hf_hub_download(repo_id=hf_dataset_identifier, filename=filename, repo_type="dataset"), "r"))
id2label = {int(k): v for k, v in id2label.items()}
label2id = {v: k for k, v in id2label.items()}

num_labels = len(id2label)

Image processor & data augmentation

A SegFormer model expects the input to be of a certain shape. To rework our training data to match the expected shape, we are able to use SegFormerImageProcessor. We could use the ds.map function to use the image processor to the entire training dataset upfront, but this will take up loads of disk space. As an alternative, we’ll use a transform, which is able to only prepare a batch of information when that data is definitely used (on-the-fly). This fashion, we are able to start training without waiting for further data preprocessing.

In our transform, we’ll also define some data augmentations to make our model more resilient to different lighting conditions. We’ll use the ColorJitter function from torchvision to randomly change the brightness, contrast, saturation, and hue of the pictures within the batch.

from torchvision.transforms import ColorJitter
from transformers import SegformerImageProcessor

processor = SegformerImageProcessor()
jitter = ColorJitter(brightness=0.25, contrast=0.25, saturation=0.25, hue=0.1) 

def train_transforms(example_batch):
    images = [jitter(x) for x in example_batch['pixel_values']]
    labels = [x for x in example_batch['label']]
    inputs = processor(images, labels)
    return inputs


def val_transforms(example_batch):
    images = [x for x in example_batch['pixel_values']]
    labels = [x for x in example_batch['label']]
    inputs = processor(images, labels)
    return inputs



train_ds.set_transform(train_transforms)
test_ds.set_transform(val_transforms)

3. Nice-tune a SegFormer model

Load the model to fine-tune

The SegFormer authors define 5 models with increasing sizes: B0 to B5. The next chart (taken from the unique paper) shows the performance of those different models on the ADE20K dataset, in comparison with other models.

Source

Here, we’ll load the smallest SegFormer model (B0), pre-trained on ImageNet-1k. It’s only about 14MB in size!
Using a small model will be sure that that our model can run easily on our pizza delivery robot.

from transformers import SegformerForSemanticSegmentation

pretrained_model_name = "nvidia/mit-b0" 
model = SegformerForSemanticSegmentation.from_pretrained(
    pretrained_model_name,
    id2label=id2label,
    label2id=label2id
)

Arrange the Trainer

To fine-tune the model on our data, we’ll use Hugging Face’s Trainer API. We want to establish the training configuration and an evalutation metric to make use of a Trainer.

First, we’ll arrange the TrainingArguments. This defines all training hyperparameters, equivalent to learning rate and the variety of epochs, frequency to save lots of the model and so forth. We also specify to push the model to the hub after training (push_to_hub=True) and specify a model name (hub_model_id).

from transformers import TrainingArguments

epochs = 50
lr = 0.00006
batch_size = 2

hub_model_id = "segformer-b0-finetuned-segments-sidewalk-2"

training_args = TrainingArguments(
    "segformer-b0-finetuned-segments-sidewalk-outputs",
    learning_rate=lr,
    num_train_epochs=epochs,
    per_device_train_batch_size=batch_size,
    per_device_eval_batch_size=batch_size,
    save_total_limit=3,
    evaluation_strategy="steps",
    save_strategy="steps",
    save_steps=20,
    eval_steps=20,
    logging_steps=1,
    eval_accumulation_steps=5,
    load_best_model_at_end=True,
    push_to_hub=True,
    hub_model_id=hub_model_id,
    hub_strategy="end",
)

Next, we’ll define a function that computes the evaluation metric we wish to work with. Because we’re doing semantic segmentation, we’ll use the mean Intersection over Union (mIoU), directly accessible within the evaluate library. IoU represents the overlap of segmentation masks. Mean IoU is the common of the IoU of all semantic classes. Take a have a look at this blogpost for an outline of evaluation metrics for image segmentation.

Because our model outputs logits with dimensions height/4 and width/4, we now have to upscale them before we are able to compute the mIoU.

import torch
from torch import nn
import evaluate

metric = evaluate.load("mean_iou")

def compute_metrics(eval_pred):
  with torch.no_grad():
    logits, labels = eval_pred
    logits_tensor = torch.from_numpy(logits)
    
    logits_tensor = nn.functional.interpolate(
        logits_tensor,
        size=labels.shape[-2:],
        mode="bilinear",
        align_corners=False,
    ).argmax(dim=1)

    pred_labels = logits_tensor.detach().cpu().numpy()
    metrics = metric.compute(
        predictions=pred_labels,
        references=labels,
        num_labels=len(id2label),
        ignore_index=0,
        reduce_labels=processor.do_reduce_labels,
    )
    
    
    per_category_accuracy = metrics.pop("per_category_accuracy").tolist()
    per_category_iou = metrics.pop("per_category_iou").tolist()

    metrics.update({f"accuracy_{id2label[i]}": v for i, v in enumerate(per_category_accuracy)})
    metrics.update({f"iou_{id2label[i]}": v for i, v in enumerate(per_category_iou)})
    
    return metrics

Finally, we are able to instantiate a Trainer object.

from transformers import Trainer

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=train_ds,
    eval_dataset=test_ds,
    compute_metrics=compute_metrics,
)

Now that our trainer is ready up, training is so simple as calling the train function. We need not worry about managing our GPU(s), the trainer will care for that.

trainer.train()

Once we’re done with training, we are able to push our fine-tuned model and the image processor to the Hub.

This will even routinely create a model card with our results. We’ll supply some extra information in kwargs to make the model card more complete.

kwargs = {
    "tags": ["vision", "image-segmentation"],
    "finetuned_from": pretrained_model_name,
    "dataset": hf_dataset_identifier,
}

processor.push_to_hub(hub_model_id)
trainer.push_to_hub(**kwargs)

4. Inference

Now comes the exciting part, using our fine-tuned model! On this section, we’ll show how you’ll be able to load your model from the hub and use it for inference.

Nevertheless, you can too check out your model directly on the Hugging Face Hub, due to the cool widgets powered by the hosted inference API. In case you pushed your model to the Hub within the previous step, it’s best to see an inference widget in your model page. You possibly can add default examples to the widget by defining example image URLs in your model card. See this model card for instance.

Use the model from the Hub

We’ll first load the model from the Hub using SegformerForSemanticSegmentation.from_pretrained().

from transformers import SegformerImageProcessor, SegformerForSemanticSegmentation

processor = SegformerImageProcessor.from_pretrained("nvidia/segformer-b0-finetuned-ade-512-512")
model = SegformerForSemanticSegmentation.from_pretrained(f"{hf_username}/{hub_model_id}")

Next, we’ll load a picture from our test dataset.

image = test_ds[0]['pixel_values']
gt_seg = test_ds[0]['label']
image

To segment this test image, we first need to organize the image using the image processor. Then we forward it through the model.

We also need to recollect to upscale the output logits to the unique image size. As a way to get the actual category predictions, we just need to apply an argmax on the logits.

from torch import nn

inputs = processor(images=image, return_tensors="pt")
outputs = model(**inputs)
logits = outputs.logits  


upsampled_logits = nn.functional.interpolate(
    logits,
    size=image.size[::-1], 
    mode='bilinear',
    align_corners=False
)


pred_seg = upsampled_logits.argmax(dim=1)[0]

Now it is time to display the result. We’ll display the result next to the ground-truth mask.

What do you’re thinking that? Would you send our pizza delivery robot on the road with this segmentation information?

The result may not be perfect yet, but we are able to all the time expand our dataset to make the model more robust. We are able to now also go train a bigger SegFormer model, and see the way it stacks up.

5. Conclusion

That is it! You now know methods to create your personal image segmentation dataset and methods to use it to fine-tune a semantic segmentation model.

We introduced you to some useful tools along the best way, equivalent to:

We hope you enjoyed this post and learned something. Be at liberty to share your personal model with us on Twitter (@TobiasCornille, @NielsRogge, and @huggingface).

Source link

Nice-Tune a Semantic Segmentation Model with a Custom Dataset

1. Create/select a dataset

Use a dataset from the Hub

2. Load and prepare the Hugging Face dataset for training

Image processor & data augmentation

3. Nice-tune a SegFormer model

Load the model to fine-tune

Arrange the Trainer

4. Inference

Use the model from the Hub

5. Conclusion

What are your thoughts on this topic?
Let us know in the comments below.

Share this article

Recent posts

How Pokémon Go is giving delivery robots an inch-perfect view of the world

Lessons from 16 Open-Source RL Libraries

Anthropic takes U.S. government to court

Machine Learning at Scale: Managing More Than One Model in Production

Enhancing Distributed Inference Performance with the NVIDIA Inference Transfer Library

Nice-Tune a Semantic Segmentation Model with a Custom Dataset

1. Create/select a dataset

Use a dataset from the Hub

2. Load and prepare the Hugging Face dataset for training

Image processor & data augmentation

3. Nice-tune a SegFormer model

Load the model to fine-tune

Arrange the Trainer

4. Inference

Use the model from the Hub

5. Conclusion

What are your thoughts on this topic? Let us know in the comments below.

Share this article

Recent posts

What are your thoughts on this topic?
Let us know in the comments below.