The recent publication of the DocLayNet dataset (IBM Research) and that of Document Understanding models (by the detection of layout and texts) on Hugging Face (LayoutLM, LayoutLMv2, LayoutLMv3, LayoutXLM, LiLT), allow the.
Many firms and individuals are waiting for such models. Indeed, having the ability to to go looking for information, classify documents, interact with them via different NLP models corresponding to QA, NER and even chatbots (humm… who’s talking about ChatGPT here?)
Furthermore, in an effort to encourage AI professionals to coach this sort of model, IBM Research has just launched a contest: ICDAR 2023 Competition on Robust Layout Segmentation in Corporate Documents.
On this context and in an effort to help as many individuals as possible to explore and higher understand the DocLayNet dataset, :
- the to facilitate using DocLayNet with annotated text (and never only with bounding boxes) (to read: “Document AI | Processing of DocLayNet dataset to be utilized by layout models of the Hugging Face hub (finetuning, inference)”);
- an to visualise the annotated bounding boxes of lines and paragraphs of the documents of the (to read: “Document AI | DocLayNet image viewer APP”).
- a model finetuned on the dataset DocLayNet base with overlap chunks of 384 tokens at that uses the XLM-RoBERTa base model and its inference app and production code
- a model finetuned on the dataset DocLayNet base with overlap chunks of 512 tokens at that uses the XLM-RoBERTa base model and its inference app and production code.
- a model finetuned on the dataset DocLayNet base with overlap chunks of 384 tokens at that uses the XLM-RoBERTa base tokenizer and its inference app and production code.
- a model finetuned on the dataset DocLayNet base with overlap chunks of 512tokens at and its inference app and production code.
APP
As a way to compare these 2 models, there’s an APP now 🙂
Notebook with Gradio APP
Here, the App notebook 🙂
This notebook runs a Gradio App that processes the primary page of any uploaded PDF. As done by our other Document Understanding APPs, this APP displays not only the paragraph labelled image of the primary page for every of the two models but in addition the DataFrame of labelled texts.
This notebook will be run on Google Colab. It’s hosted in github.
Let’s have a look at a report from the European Commission.
Page 1
Our Gradio app renders the primary page of this PDF.
We are able to see from the paragraph labeled images that there are differences: our Document Understanding LiLT base model seems to work higher:
- labeled Page Header text well,
- it does a greater job of labeling texts blocks.
Nonetheless, the two models did not label the title of the page.
Page 2
This time, we will see from the paragraph labeled images that there are again differences BUT that is our Document Understanding LayoutXLM base model that seems to work higher:
- it detects thoroughly the Sub-Header.
: Pierre Guillou is an AI consultant in Brazil and France. Get in contact with him through his LinkedIn profile