An Interactive Guide to 4 Fundamental Computer Vision Tasks Using Transformers

and Vision Model?

Computer Vision is a subdomain in artificial intelligence with a big selection of applications specializing in image processing and understanding. Traditionally addressed through Convolutional Neural Networks (CNNs), this field has been revolutionized by the emergence of transformer architecture. While transformers are well-known for his or her applications in language processing, they will be effectively adapted to form the backbone of many vision models. In this text, we are going to explore state-of-the-art vision and multimodal models, resembling , that concentrate on various computer vision tasks including image classification, segmentation, image-to-text conversion, and visual query answering. These tasks have a wide range of real-world applications, from annotating images at scale, detecting abnormalities in medical images to extracting text from documents and generating text responses based on visual data.

Comparisons with CNNs

Before the wide adoption of foundation models, CNNs were the dominant solutions for many computer vision tasks. In a nutshell, CNNs form a hierarchical deep learning architecture that consists of feature maps, pooling, linear layers and fully connected layers. In contrast, vision transformers leverage the self-attention mechanism that enables image patches to attend to one another. Additionally they have less inductive bias, meaning they’re less constrained by specific model assumptions as CNNs, but consequently require significantly more training data to realize strong performance on generalized tasks.

Comparisons with LLMs

Transformer-based vision models adapt the architecture utilized by LLMs (Large Language Models), adding extra layers that convert image data into numerical embeddings. In an NLP task, text sequences undergo the technique of tokenization and embedding before they’re consumed by the transformer encoder. Similarly, image/visual data undergo the procedure of patching, position encoding, image embedding before feeding into the vision transformer encoder. Throughout this text, we are going to further explore how the vision transformer and its variants construct upon the transformer backbone and extend capabilities from language processing to image understanding and image generation.

Extensions to Multimodal Models

Advancements in vision models have driven the interest in developing multimodal models able to process each image and text data concurrently. While vision models concentrate on uni-directional transformation of image data to numerical representation and typically produce score-based output for classification or object detection (i.e. image-classification and image-segmentation task), multimodal models require bidirectional processing and integration between different data types. For instance, an image-text multimodal model can generate coherent text sequences from image input for image captioning and visual query answering tasks.

4 Kinds of Fundamental Computer Vision Tasks

0. Project Overview

We are going to explore the small print of those 4 fundamental computer vision tasks and the corresponding transformer models specialized for every task. These models differ primarily of their encoder and decoder architectures, which give them distinct capabilities for interpreting, processing, and translating across different textual or visual modality.

To make this guide more interactive, I even have designed a Streamlit web app for instance and compare outputs of those computer vision tasks and models. We are going to introduce the tip to finish app development at the tip of this text.

Below is a sneak peek of output based on the uploaded image, displaying , by running the default models from Hugging Face pipelines.

Streamlit Web App for Computer Vision Tasks

1. Image Classification

Firstly, let’s introduce image classification — a basics computer vision task that assigns images to a predefined set of labels, which will be achieved by a basic Vision Transformer.

ViT (Vision Transformer)

Vision Transformer (ViT) serves because the cornerstone for a lot of computer vision models later introduced in this text. It consistently outperforms CNN on image classification tasks through its encoder-only transformer architecture. It processes image inputs and outputs probability scores for candidate labels. Since image classification is only a picture understanding task without generation requirements, ViT’s encoder-only architecture is well-suited for this purpose.

A ViT architecture consists of following components:

Patching: break down input images into small, fixed size patches of pixels (typically 16×16 pixels per patch) in order that local features are preserved for downstream processing.
Embedding: convert image patches into numerical representations, also referred to as vector embeddings, in order that images with similar features are projected as embeddings with closer proximity within the vector space.
Classification Token (CLS): extract and aggregate information from all image patches into one numeric representation, making it particularly effective for classification.
Position Encoding: preserve the relative positions of the unique image pixels. CLS token is all the time at position 0.
Transformer Encoder: process the embeddings through layers of multi-headed attention and feed-forward networks.

The mechanism behind ViT ends in its efficiency in capturing global dependencies, whereas CNN primarily relies on local processing through convolutional kernels. However, ViT has the disadvantage of requiring a large amount of coaching data (normally hundreds of thousands of images) to iteratively adjust model parameters in attention layers to realize strong performance.

Implementation

Hugging Face pipeline significantly simplifies the implementation of image classification task by abstracting away the low-level image processing steps.

from transformers import pipeline
from PIL import Image

image = Image.open(image_url)
pipe = pipeline(task="image-classification", model=model_id)
output = pipe(image=image)

input parameters:
- model: you may select your individual model or use the default model (i.e. “google/vit-base-patch16-224”) when the model parameter is just not specified.
- task: provide a task name (e.g. “image-classification”, “image-segmentation”)
- image: provide a picture object through an URL or a picture file path.
output: the model generates scores for the candidate labels.

We compared results of the default image classification model “google/vit-base-patch16-224” by providing two similar images with different compositions. As we are able to see, this baseline model is well confused, producing significantly different outputs (“espresso” vs. “mircowave”), despite each images containing the identical important object.

[
  { "label": "espresso", "score": 0.40687331557273865 },
  { "label": "cup", "score": 0.2804579734802246 },
  { "label": "coffee mug", "score": 0.17347976565361023 },
  { "label": "desk", "score": 0.01198530849069357 },
  { "label": "eggnog", "score": 0.00782513152807951 }
]

[
  { "label": "microwave, microwave oven", "score": 0.20218633115291595 },
  { "label": "dining table, board", "score": 0.14855517446994781 },
  { "label": "stove", "score": 0.1345038264989853 },
  { "label": "sliding door", "score": 0.10262308269739151 },
  { "label": "shoji", "score": 0.07306522130966187 }
]

Try a distinct model yourself using our Streamlit web app and see if it generates higher results.

2. Image Segmentation

Image segmentation is one other common computer vision task that requires a vision-only model. The target is analogous to object detection but requires higher precision on the pixel level, producing masks for object boundaries as a substitute of drawing bounding boxes as required for object detection.

There are three important sorts of image segmentation:

Semantic segmentation: predict a mask for every object class.
Instance segmentation: predict a mask for every instance of the thing class.
Panoptic segmentation: mix instance segmentation and semantic segmentation by assigning each pixel an object class and an instance of that class.

DETR (Detection Transformer)

Although DETR is widely used for object detection, it might be prolonged to perform panoptic segmentation task by adding a segmentation mask head. As shown within the diagram, it utilizes the encoder-decoder transformer architecture with a CNN backbone for feature map extraction. DETR model learns a set of object queries and it’s trained to predict bounding boxes for these queries, followed by a mask prediction head to perform precise pixel-level segmentation.

Mask2Former

Mask2Former can be a standard selection for image segmentation task. Developed by Facebook AI Research, Mask2Former generally outperforms DETR models with higher precision and computational efficiency. It’s achieved by applying a masked attention mechanism as a substitute of worldwide cross-attention to focus specifically on foreground information and important objects in a picture.

Implementation

We use the pipeline implementation identical to image classification, by simply swapping the duty parameter to “image-segmentation”. To process the output, we extract the thing labels and masks, then display the masked image using st.image()

from transformers import pipeline
from PIL import Image
import streamlit as st

image = Image.open(image_url)
pipe = pipeline(task="image-segmentation", model=model_id)
output = pipe(image=image)

output_labels = [i['label'] for i in output]
output_masks = [i['mask'] for i in output]

for m in output_masks:
		st.image(m)

We compared the performance of DETR (“facebook/detr-resnet-50-panoptic”) and Mask2Former (“facebook/mask2former-swin-base-coco-panoptic”) that are each fine-tuned on panoptic segmentation. As displayed within the segmentation outputs, each DETR and Mask2Former successfully discover and extract the “cup” and the “dining table”. Mask2Former makes inference at a faster speed (2.47s in comparison with 6.3s for DETR) and in addition manages to discover “window-other” from the background.

[
	{
		'score': 0.994395, 
		'label': 'dining table', 
		'mask': 
	}, 
	{
		'score': 0.999692, 
		'label': 'cup', 
		'mask': 
	}
]

[
	{
		'score': 0.999554, 
		'label': 'cup', 
		'mask': 
	}, 
	{
		'score': 0.971946, 
		'label': 'dining table', 
		'mask': 
	}, 
	{
		'score': 0.983782, 
		'label': 'window-other', 
		'mask': 
	}
]

3. Image Captioning

Image Captioning, also referred to as image to text, translates images into text sequences that describe the image contents. This task requires capabilities of each image understanding and text generation, subsequently well fitted to a multimodal model that may process image and text data concurrently.

Visual Encoder-Decoder

Visual Encoder-Decoder is a multimodal architecture that mixes a vision model for image understanding with a pretrained language model for text generation. A standard example is ViT-GPT2, which chains together the Vision Transformer (introduced in section 1. Image Classification) because the visual encoder and the GPT-2 model because the decoder to perform autoregressive text generation.

BLIP (Boostrapping Language-Image Pretraining)

BLIP, developed by Salesforce Research, leverages 4 core modules – a picture encoder, a text encoder, followed by an image-grounded text encoder that fuses visual and textual features via attention mechanisms, in addition to an image-grounded text decoder for text sequence generation. The pretraining process involves minimizing image-text contrastive loss, image-text matching loss and language modeling loss, with the objectives of aligning the semantic relationship between visual information and text sequences. It offers higher flexibility in applications and will be applied for VQA (visual query answering), nevertheless it also introduces more complexity within the architectural design.

Implementation

We use the code snippet below to generate output from a picture captioning pipeline.

from transformers import pipeline
from PIL import Image

image = Image.open(image_url)
pipe = pipeline(task="image-to-text", model=model_id)
output = pipe(image=image)

We tried three different models below they usually all generates reasonably accurate image descriptions, with the larger model performs higher than the bottom one.

[{'generated_text': 'a cup of coffee sitting on a wooden table'}]

[{'generated_text': 'a cup of coffee on a table'}]

[{'generated_text': 'there is a cup of coffee on a saucer on a table'}]

4. Visual Query Answering

Visual Query Answering (VQA) has gained increasing popularity because it enables users to ask questions on a picture and receive coherent text responses. It also requires a multimodal model that may extract key information in visual data while also able to generating text responses. What it differentiates from image captioning is accepting user prompts as input along with a picture, subsequently requiring an encoder that interprets each modalities at the identical time.

ViLT (Vision Language Transformer)

ViLT is a computationally efficient model architecture for executing VQA task. ViLT incorporates image patch embeddings and text embeddings into an unified transformer encoder which is pre-trained for 3 objectives:

image-text matching: learn the semantic relationship between image-text pairs
masked language modeling: learn to predict the masked word/token from the vocabulary based on the text and image input
word patch alignment: learn the associations between words and image patches

ViLT adopts an encoder-only architecture with task specific heads (e.g. classification head, VQA head), with this minimal design achieving ten times faster speed than a VLP (Vision-and-Language Pretraining) model that relies on region supervision for object detection and convolutional architecture for feature extraction. Nonetheless, this simplified architecture ends in suboptimal performance on complex tasks and relies on massive training data for achieving generalized functionality. As demonstrated later, one drawback is that ViLT model produces token-based outputs for VQA quite than coherent sentences, very very like a picture classification task with a considerable amount of candidate labels.

BLIP

As introduced within the section 3. Image Captioning, BLIP is a more extensive model that can be fine-tuned for performing visual query answering task. As the results of it encoder-decoder architecture, it generates complete text sequences as a substitute of tokens.

Implementation

VQA is implemented using the code snippet below, taking each a picture and a text prompt because the model inputs.

from transformers import pipeline
from PIL import Image
import streamlit as st

image = Image.open(image_url)
query='describe this image'
pipe = pipeline(task="image-to-text", model=model_id, query=query)
output = pipe(image=image)

When comparing ViLT and BLIP models for the query “describe this image”, the outputs differ significantly resulting from their distinct model architectures. ViLT predicts the best scoring tokens from its existing vocabulary, while BLIP generates more coherent and sensible results.

[
  { "score": 0.044245753437280655, "answer": "kitchen" },
  { "score": 0.03294338658452034, "answer": "tea" },
  { "score": 0.030773703008890152, "answer": "table" },
  { "score": 0.024886665865778923, "answer": "office" },
  { "score": 0.019653357565402985, "answer": "cup" }
]

[{'answer': 'coffee cup on saucer'}]

End-to-End Computer Vision App Development

Let’s break down the online app development into 6 steps you may easily follow to construct your individual interactive Streamlit app or customize it in your needs. Take a look at our GitHub repository for the end-to-end implementation.

1. Initialize the online app and configure the page layout.

def initialize_page():
    """Initialize the Streamlit page configuration and layout"""
    st.set_page_config(
        page_title="Computer Vision",
        page_icon="🤖",
        layout="centered"
    )
    st.title("Computer Vision Tasks")
    content_block = st.columns(1)[0]

    return content_block

2. Prompt the user to upload a picture.

def get_uploaded_image():

    uploaded_file = st.file_uploader(
        "Upload your individual image", 
        accept_multiple_files=False,
        type=["jpg", "jpeg", "png"]
    )
    if uploaded_file:
        image = Image.open(uploaded_file)
        st.image(image, caption='Preview', use_container_width=False)

    else:
        image = None

    return image

3. Select a number of computer vision tasks using a multi-select dropdown list (also accept user entered options e.g. “document-question-answering”). It is going to prompt user to enter the query if ‘visual-question-answering’ or ‘document-question-answering’ is chosen, because these two tasks require “query” as an extra input parameter.

def get_selected_task():
    options = st.multiselect(
        "Which tasks would you wish to perform?",
        [
            "visual-question-answering",
            "image-to-text",
            "image-classification",
            "image-segmentation",
        ],
        max_selections=4,
        accept_new_options=True,
    )

    #prompt for query input if the duty is 'VQA' and 'DocVQA' - parameter "query"
    if 'visual-question-answering' in options or 'document-question-answering' in options:
        query = st.text_input(
            "Please enter your query:"
        )
        
    elif "Other (specify task name)" in options:
        task = st.text_input(
            "Please enter the duty name:"
        )
        options = task
        query = ""
        
    else:
        query = ""

    return options, query

4. Prompt the user to choose from the default model built into the cuddling face pipeline or enter their very own model.

def get_selected_model():
    options = ["Use the default model", "Use your selected HuggingFace model"]
    selected_option = st.selectbox("Select an option:", options)
    if selected_option == "Use your chosen HuggingFace model":
        model = st.text_input(
            "Please enter your chosen HuggingFace model id:"
        )
    else:
        model = None

    return model

5. Create task pipelines based on the user-entered parameters, then collects the model outputs and processing times. The result’s displayed in a table format using st.dataframe() to check the several . For image segmentation tasks, the segmentation mask can be displayed using st.image().

def display_results(image, task_list, user_question, model):

    results = []
    for task in task_list:
        if task in ['visual-question-answering', 'document-question-answering']:
            params = {'query': user_question}
        else:
            params = {}
            
        row = {
            'task': task,
        }

        try:
            model = i['model']
            row['model'] = model
            pipe = pipeline(task, model=model)

        except Exception as e:
            pipe = pipeline(task)
            row['model'] = pipe.model.name_or_path

        start_time = time.time()
        output = pipe(
            image,
            **params
        )
        execution_time = time.time() - start_time
        
        row['model_type'] = pipe.model.config.model_type
        row['time'] = execution_time
        

        # display image segentation visual output
        if task == 'image-segmentation':
            output_masks = [i['mask'] for i in output]

        row['output'] = str(output)
        
        results.append(row)
        results_df = pd.DataFrame(results)
        
    st.write('Model Responses')
    st.dataframe(results_df)

    if 'image-segmentation' in task_list:
        st.write('Segmentation Mask Output')
        
        for m in output_masks:
            st.image(m)
    
    return results_df

6. Lastly, chain these functions together using the important function. Use a “Generate Response” button to trigger these functions and display the ends in the app.

def important():
    initialize_page()
    image = get_uploaded_image()
    task_list, user_question = get_selected_task()
    model = get_selected_model()
    
    # generate reponse spinning wheel
    if st.button("Generate Response", key="generate_button"):
        display_results(image, task_list, user_question, model)

# run the app
if __name__ == "__main__":
    important()

Takeaway Message

We introduced the evolution from traditional CNN-based approaches to transformer architectures, comparing vision models with language models and multimodal models. We also explored 4 fundamental computer vision tasks and their corresponding techniques, providing a practical Streamlit implementation guide to constructing your individual computer vision web applications for further explorations.

The elemental Computer Vision tasks and models include:

Image Classification: Analyze images and assign them to at least one or more predefined categories or classes, utilizing model architectures like ViT (Vision Transformer).
Image Segmentation: Classify image pixels into specific categories, creating detailed masks that outline object boundaries, including DETR and Mask2Former model architectures.
Image Captioning: Generates descriptive text for images, demonstrating models like visual encoder-decoder and BLIP that mix visual encoding with language generation capabilities.
Visual Query Answering (VQA): Process each image and text queries to reply open-ended questions based on image content, comparing architectures like ViLT (Vision Language Transformer) with its token-based outputs and BLIP with more coherent responses.

An Interactive Guide to 4 Fundamental Computer Vision Tasks Using Transformers

and Vision Model?

4 Kinds of Fundamental Computer Vision Tasks

0. Project Overview

1. Image Classification

2. Image Segmentation

3. Image Captioning

4. Visual Query Answering

End-to-End Computer Vision App Development

Takeaway Message

What are your thoughts on this topic?
Let us know in the comments below.

Share this article

Recent posts

Our Transformers Code Agent beats the GAIA benchmark 🏅

What Advent of Code Has Taught Me About Data Science

From prophet to product: How AI got here back right down to earth in 2025

Accelerating Protein Language Model ProtST on Intel Gaudi 2

Meta’s next big AI bet: Manus

An Interactive Guide to 4 Fundamental Computer Vision Tasks Using Transformers

and Vision Model?

4 Kinds of Fundamental Computer Vision Tasks

0. Project Overview

1. Image Classification

2. Image Segmentation

3. Image Captioning

4. Visual Query Answering

End-to-End Computer Vision App Development

Takeaway Message

What are your thoughts on this topic? Let us know in the comments below.

Share this article

Recent posts

What are your thoughts on this topic?
Let us know in the comments below.