Introducing TextImage Augmentation for Document Images

-



On this blog post, we offer a tutorial on tips on how to use a brand new data augmentation technique for document images, developed in collaboration with Albumentations AI.



Motivation

Vision Language Models (VLMs) have an immense range of applications, but they often must be fine-tuned to specific use-cases, particularly for datasets containing document images, i.e., images with high textual content. In these cases, it’s crucial for text and image to interact with one another in any respect stages of model training, and applying augmentation to each modalities ensures this interaction. Essentially, we would like a model to learn to read properly, which is difficult in essentially the most common cases where data is missing.

Hence, the necessity for effective data augmentation techniques for document images became evident when addressing challenges in fine-tuning models with limited datasets. A standard concern is that typical image transformations, similar to resizing, blurring, or changing background colours, can negatively impact text extraction accuracy.

image/png

We recognized the necessity for data augmentation techniques that preserve the integrity of the text while augmenting the dataset. Such data augmentation can facilitate generation of recent documents or modification of existing ones, while preserving their text quality.



Introduction

To handle this need, we introduce a latest data augmentation pipeline developed in collaboration with Albumentations AI. This pipeline handles each images and text inside them, providing a comprehensive solution for document images. This class of knowledge augmentation is multimodal because it modifies each the image content and the text annotations concurrently.

As discussed in a previous blog post, our goal is to check the hypothesis that integrating augmentations on each text and pictures during pretraining of VLMs is effective. Detailed parameters and use case illustrations may be found on the Albumentations AI Documentation. Albumentations AI enables the dynamic design of those augmentations and their integration with other kinds of augmentations.



Method

To reinforce document images, we start by randomly choosing lines throughout the document. A hyperparameter fraction_range controls the bounding box fraction to be modified.

Next, we apply one in all several text augmentation methods to the corresponding lines of text, that are commonly utilized in text generation tasks. These methods include Random Insertion, Deletion, and Swap, and Stopword Substitute.

After modifying the text, we black out parts of the image where the text is inserted and inpaint them, using the unique bounding box size as a proxy for the brand new text’s font size. The font size may be specified with the parameter font_size_fraction_range, which determines the range for choosing the font size as a fraction of the bounding box height. Note that the modified text and corresponding bounding box may be retrieved and used for training. This process leads to a dataset with semantically similar textual content and visually distorted images.



Predominant Features of the TextImage Augmentation

The library may be used for 2 fundamental purposes:

  1. Inserting any text on the image: This feature permits you to overlay text on document images, effectively generating synthetic data. Through the use of any random image as a background and rendering completely latest text, you’ll be able to create diverse training samples. The same technique, called SynthDOG, was introduced within the OCR-free document understanding transformer.

  2. Inserting augmented text on the image: This includes the next text augmentations:

    • Random deletion: Randomly removes words from the text.
    • Random swapping: Swaps words throughout the text.
    • Stop words insertion: Inserts common stop words into the text.

Combining these augmentations with other image transformations from Albumentations allows for simultaneous modification of images and text. You’ll be able to retrieve the augmented text as well.

Note: The initial version of the information augmentation pipeline presented in this repo, included synonym alternative. It was removed on this version since it caused significant time overhead.



Installation

!pip install -U pillow
!pip install albumentations
!pip install nltk
import albumentations as A
import cv2
from matplotlib import pyplot as plt
import json
import nltk

nltk.download('stopwords')
from nltk.corpus import stopwords



Visualization

def visualize(image):
    plt.figure(figsize=(20, 15))
    plt.axis('off')
    plt.imshow(image)



Load data

Note that for such a augmentation you should use the IDL and PDFA datasets. They supply the bounding boxes of the lines that you need to modify.
For this tutorial, we are going to concentrate on the sample from IDL dataset.

bgr_image = cv2.imread("examples/original/fkhy0236.tif")
image = cv2.cvtColor(bgr_image, cv2.COLOR_BGR2RGB)

with open("examples/original/fkhy0236.json") as f:
    labels = json.load(f)

font_path = "/usr/share/fonts/truetype/liberation/LiberationSerif-Regular.ttf"

visualize(image)

image/png

We’d like to accurately preprocess the information, because the input format for the bounding boxes is the normalized Pascal VOC. Hence, we construct the metadata as follows:

page = labels['pages'][0]

def prepare_metadata(page: dict, image_height: int, image_width: int) -> list:
    metadata = []

    for text, box in zip(page['text'], page['bbox']):
        left, top, width_norm, height_norm = box

        metadata.append({
            "bbox": [left, top, left + width_norm, top + height_norm],
            "text": text
        })
    
    return metadata

image_height, image_width = image.shape[:2]
metadata = prepare_metadata(page, image_height, image_width)



Random Swap

transform = A.Compose([A.TextImage(font_path=font_path, p=1, augmentations=["swap"], clear_bg=True, font_color = 'red', fraction_range = (0.5,0.8), font_size_fraction_range=(0.8, 0.9))])
transformed = transform(image=image, textimage_metadata=metadata)
visualize(transformed["image"])

image/png



Random Deletion

transform = A.Compose([A.TextImage(font_path=font_path, p=1, augmentations=["deletion"], clear_bg=True, font_color = 'red', fraction_range = (0.5,0.8), font_size_fraction_range=(0.8, 0.9))])
transformed = transform(image=image, textimage_metadata=metadata)
visualize(transformed['image'])

image/png



Random Insertion

In random insertion we insert random words or phrases into the text. On this case, we use stop words, common words in a language which can be often ignored or filtered out during natural language processing (NLP) tasks because they carry less meaningful information in comparison with other words. Examples of stop words include “is,” “the,” “in,” “and,” “of,” etc.

stops = stopwords.words('english')
transform = A.Compose([A.TextImage(font_path=font_path, p=1, augmentations=["insertion"], stopwords = stops, clear_bg=True, font_color = 'red', fraction_range = (0.5,0.8), font_size_fraction_range=(0.8, 0.9))])
transformed = transform(image=image, textimage_metadata=metadata)
visualize(transformed['image'])

image/png



Can we mix with other transformations?

Let’s define a posh transformation pipeline using A.Compose, which incorporates text insertion with specified font properties and stopwords, Planckian jitter, and affine transformations. Firstly, with A.TextImage we insert text into the image using specified font properties, with a transparent background and red font color. The fraction and size of the text to be inserted are also specified.
Then with A.PlanckianJitter we alter the colour balance of the image. Finally, using A.Affine we apply affine transformations, which may include scaling, rotating, and translating the image.

transform_complex = A.Compose([A.TextImage(font_path=font_path, p=1, augmentations=["insertion"], stopwords = stops, clear_bg=True, font_color = 'red', fraction_range = (0.5,0.8), font_size_fraction_range=(0.8, 0.9)),
                               A.PlanckianJitter(p=1),
                               A.Affine(p=1)
                              ])
transformed = transform_complex(image=image, textimage_metadata=metadata)
visualize(transformed["image"])

image/png

To extract the knowledge on the bounding box indices where text was altered, together with the corresponding transformed text data
run the next cell. This data may be used effectively for training models to acknowledge and process text changes in images.

transformed['overlay_data']
[{'bbox_coords': (375, 1149, 2174, 1196),
  'text': "Lionberger, Ph.D., (Title: if Introduction to won i FDA's yourselves Draft Guidance once of the wasn't General Principles",
  'original_text': "Lionberger, Ph.D., (Title: Introduction to FDA's Draft Guidance of the General Principles",
  'bbox_index': 12,
  'font_color': 'red'},
 {'bbox_coords': (373, 1677, 2174, 1724),
  'text': "After off needn't were a brief break, ADC member mustn Jeffrey that Dayno, MD, Chief Medical Officer for at their Egalet",
  'original_text': 'After a brief break, ADC member Jeffrey Dayno, MD, Chief Medical Officer at Egalet',
  'bbox_index': 19,
  'font_color': 'red'},
 {'bbox_coords': (525, 2109, 2172, 2156),
  'text': 'll Brands recognize the has importance and of a generics ADF guidance to ensure which after',
  'original_text': 'Brands recognize the importance of a generics ADF guidance to ensure',
  'bbox_index': 23,
  'font_color': 'red'}]
  



Synthetic Data Generation

This augmentation method may be prolonged to the generation of synthetic data, because it enables the rendering of text on any background or template.

template = cv2.imread('template.png')
image_template = cv2.cvtColor(template, cv2.COLOR_BGR2RGB)
transform = A.Compose([A.TextImage(font_path=font_path, p=1, clear_bg=True, font_color = 'red', font_size_fraction_range=(0.5, 0.7))])

metadata = [{
    "bbox": [0.1, 0.4, 0.5, 0.48],
    "text": "Some smart text goes here.",
}, {
    "bbox": [0.1, 0.5, 0.5, 0.58],
    "text": "Hope you discover it helpful.",
}]

transformed = transform(image=image_template, textimage_metadata=metadata)
visualize(transformed['image'])

image/png



Conclusion

In collaboration with Albumentations AI, we introduced TextImage Augmentation, a multimodal technique that modifies document images together with the text. By combining text augmentations similar to Random Insertion, Deletion, Swap, and Stopword Substitute with image modifications, this pipeline allows for the generation of diverse training samples.

For detailed parameters and use case illustrations, discuss with the Albumentations AI Documentation. We hope you discover these augmentations useful for enhancing your document image processing workflows.



References

@inproceedings{kim2022ocr,
  title={Ocr-free document understanding transformer},
  writer={Kim, Geewook and Hong, Teakgyu and Yim, Moonbin and Nam, JeongYeon and Park, Jinyoung and Yim, Jinyeong and Hwang, Wonseok and Yun, Sangdoo and Han, Dongyoon and Park, Seunghyun},
  booktitle={European Conference on Computer Vision},
  pages={498--517},
  yr={2022},
  organization={Springer}
}



Source link

ASK ANA

What are your thoughts on this topic?
Let us know in the comments below.

0 0 votes
Article Rating
guest
0 Comments
Oldest
Newest Most Voted
Inline Feedbacks
View all comments

Share this article

Recent posts

0
Would love your thoughts, please comment.x
()
x