Home Artificial Intelligence Generative AI on Research Papers Using Nougat Model

Generative AI on Research Papers Using Nougat Model

2
Generative AI on Research Papers Using Nougat Model

Doing cool things with Data!

Photo by Dan Dimmock on Unsplash

Recent advances in large language models (LLMs) like GPT-4 have shown impressive capabilities in generating coherent text. Nonetheless, parsing and understanding research papers accurately stays a particularly difficult task for AI. Research papers contain complex formatting, math equations, tables, figures, and domain-specific language. The density of knowledge may be very high and necessary semantics are encoded within the formatting.

In this text, I’ll reveal how a latest model called Nougat from Meta may help parse research papers accurately. We then mix it with an LLM pipeline that extracts and summarizes all of the tables within the paper.

The potential here is immense. There may be numerous data/information locked up in research papers and books which have not been parsed accurately. Accurate parsing enables their use in many various applications including LLM retraining.

Nougat is a visible transformer model developed by researchers at Meta AI that may convert images of document pages into structured text [1]. It takes a rasterized image of a document page as input and outputs text in a light-weight markup language.

The important thing advantage of Nougat is that it relies solely on the document image and doesn’t need any OCR text. This enables it to recuperate semantic structure like math equations properly. It’s trained on tens of millions of educational papers from arXiv and PubMed to learn the patterns of research paper formatting and language.

The figure below from [1] shows how math equations written in PDF are reproduced in Latex and rendered accurately.

Source: Fig5 from Nougat Paper — https://arxiv.org/pdf/2308.13418.pdf

Nougat uses a visible transformer encoder-decoder architecture. The encoder uses a Swin Transformer to encode the document image into latent embeddings. The Swin Transformer processes the image in a hierarchical fashion using shifted windows. The decoder then generates the output text tokens autoregressively using self-attention over the encoder…

2 COMMENTS

LEAVE A REPLY

Please enter your comment!
Please enter your name here