Training frontier large multimodal models (LMMs) requires large-scale datasets with interleaved sequences of images and text in free form. Although open-source LMMs have evolved rapidly, there continues to be a significant lack of multi-modal interleaved datasets at scale that are open-sourced. The importance of those datasets can’t be overstated, as they form the inspiration for creating advanced AI systems able to understanding and generating content across different modalities. And not using a sufficient supply of comprehensive, interleaved datasets, the potential for developing more sophisticated and capable LMMs is significantly hindered. These datasets enable models to learn from a various range of inputs, making them more versatile and effective in various applications. Moreover, the scarcity of such datasets poses a challenge to the open-source community, which relies on shared resources to drive innovation and collaboration.
Open-source LMMs have made significant strides in recent times, but their growth is hampered by the limited availability of large-scale, interleaved datasets. To beat this obstacle, concerted efforts are needed to curate, annotate, and release more comprehensive datasets that may support the continuing development and refinement of multimodal models. As well as, the creation and dissemination of those datasets involve overcoming several technical and logistical hurdles. Data collection have to be extensive and representative of the various contexts through which LMMs shall be deployed. Annotation requires careful consideration to be certain that the interleaved sequences of images and text are aligned in a fashion that enhances the model’s learning capabilities. Furthermore, ensuring the datasets are open-source entails addressing legal and ethical considerations related to data privacy and usage rights. Expanding the supply of high-quality, large-scale multimodal interleaved datasets is important for the longer term of AI research and development. By addressing the present scarcity, the AI community can foster greater innovation and collaboration, resulting in the creation of more powerful and versatile LMMs able to tackling complex, real-world problems.
Constructing on that note, MINT-1T, the most important and most diverse multimodal interleaved open-source dataset so far. MINT-1T: A 10x larger scale, including one trillion text tokens & 3.4 billion images than existing open-source datasets. The MINT-1T dataset also introduces never-exposed sources resembling PDF files, ArXiv papers. Since multimodal interleaved datasets don’t scale easily, it will be important that the MINT-1T dataset shares the info curation process so others can even perform experiments on such information-rich variants. The MINT-1T dataset demonstrates that its method; LM models trained on MINT-1T are competitive (albeit somewhat) to previous state-of-the-art OBELICS.
MINT-1T: A Multimodal Dataset with One Trillion Tokens
Large open-source pre-training datasets have been pivotal for the research community in exploring data engineering and training transparent, open-source models. Within the text domain, early works resembling C4 and The Pile played crucial roles in enabling the community to coach the primary set of open-source large language models like GPT-J, GPT-Neo, and others. These foundational efforts also paved the way in which for subsequent improvements in data filtering methods and scaling. Similarly, within the image-text space, large-scale open-source datasets have spurred innovations in higher data curation methods, resembling Data filtering networks and T-MARS. There’s a noticeable shift from frontier labs towards training large multimodal models (LMMs) that require extensive multimodal interleaved datasets comprising free-form sequences of images and text. Because the capabilities of frontier models advance rapidly, a major gap is emerging within the multimodal training data between closed- and open-source models. Current open-source multimodal interleaved datasets are smaller and fewer diverse than their text-only counterparts, being sourced primarily from HTML documents, which limits the breadth and variety of knowledge. This limitation impedes the event of sturdy open-source LMMs and creates a disparity between the capabilities of open- and closed-source models.
To handle this gap, MINT-1T was created as the most important and most diverse open-source multimodal interleaved dataset so far. MINT-1T comprises a complete of 1 trillion text tokens and three billion images, sourced from diverse origins resembling HTML, PDFs, and ArXiv. Before MINT-1T, the most important open-source dataset on this area was OBELICS, which included 115 billion text tokens and 353 million images, all sourced from HTML.
The contributions of MINT-1T are as follows:
- Data Engineering: Scaling this multimodal interleaved data presents more of an engineering challenge than constructing either text-only or image-text pair datasets. Handling much larger document sizes and preserving the unique ordering of images and text is crucial.
- Diversity: MINT-1T is the primary within the multimodal interleaved space to assemble high-quality multimodal documents at large scales from sources like CommonCrawl PDFs and ArXiv.
- Model Experiments: Experiments show that LMMs trained on MINT-1T not only match but potentially surpass the performance of models trained on the very best existing open-source dataset, OBELICS, while offering a tenfold increase in scale.
MINT-1T: Constructing the Dataset
MINT-1T curates a large-scale open-source dataset that utilizes more diverse sources of interleaved documents, resembling PDFs and ArXiv papers. This section details MINT-1T’s methods for sourcing multimodal documents, filtering low-quality content, deduplicating data, and removing not secure for work or NSFW and undesirable material. The ultimate dataset comprises 922 billion (B) HTML tokens, 106B PDF tokens, and 9B ArXiv tokens.
Sourcing Large Quantities of Multimodal Documents
HTML Pipeline
MINT-1T follows OBELICS’s method for extracting interleaved multimodal documents from CommonCrawl WARC files by parsing each WARC entry’s DOM tree. While OBELICS only processed documents from February 2020 to February 2023 CommonCrawl dumps, MINT-1T has expanded the document pool to incorporate HTML documents from May 2017 to April 2024 (with full dumps from October 2018 to April 2024 and partial dumps from earlier years). Just like OBELICS, MINT-1T filters out documents containing no images, greater than thirty images, or any images with URLs that include inappropriate substrings resembling logo, avatar, porn, and xxx.
PDF Pipeline
MINT-1T sources PDF documents from CommonCrawl WAT files from February 2023 to April 2024 dumps. Initially, all PDF links are extracted from these dumps. MINT-1T then attempts to download and skim PDFs using PyMuPDF, discarding PDFs over 50MB (likely containing large images) and people over 50 pages long. Pages without text are excluded, and a reading order is established for the remaining pages. Reading order is set by finding the bounding box of all text blocks on a page, clustering the blocks based on columns, and ordering them from top left to bottom right. Images are integrated into the sequence based on their proximity to text blocks on the identical page.
ArXiv Pipeline
MINT-1T builds ArXiv interleaved documents from LaTeX source code using TexSoup to search out figure tags and interleave images with the paper text. For multi-file papers, MINT-1T identifies the primary Tex file and replaces input tags with the contents of its files. The LaTeX code is cleaned up by removing imports, bibliography, tables, and citation tags. Since ArXiv is already a highly curated data source, no additional filtering and deduplication are performed.
Text Quality Filtering
MINT-1T avoids using model-based heuristics for text filtering, following practices established by RefinedWeb, Dolma, and FineWeb. Initially, non-English documents are eliminated using Fasttext’s language identification model (with a confidence threshold of 0.65). Documents with URLs containing NSFW substrings are also removed to exclude pornographic and undesirable content. Text filtering methods from RefinedWeb are applied, specifically removing documents with excessive duplicate n-grams or those identified as low quality using MassiveText rules.
Image Filtering
After curating PDFs and HTML files, MINT-1T attempts to download all image URLs within the HTML dataset, discarding non-retrievable links and removing documents with no valid image links. Images smaller than 150 pixels are discarded to avoid noisy images resembling logos and icons, and pictures larger than 20,000 pixels are also removed as they sometimes correspond to off-topic images. For HTML documents, images with a side ratio greater than two are removed to filter out low-quality images resembling commercial banners. For PDFs, the brink is adjusted to 3 to preserve scientific figures and tables.
The above figure represents how MINT-1T uniquely includes data from PDFs and ArXiv documents beyond HTML sources.
Safety Filtering
- NSFW Image Filtering: MINT-1T applies an NSFW image detector to all images within the dataset. If a document comprises a single NSFW image, the complete document is discarded.
- Personally Identifiable Information Removal: To mitigate the chance of non-public data leakage, email addresses and IP addresses within the text data are anonymized. Emails are replaced with templates resembling “[email protected]” and IPs with randomly generated non-functional IPs.
Deduplication
MINT-1T performs paragraph and document text deduplication inside each CommonCrawl snapshot and image deduplication to remove repetitive, uninformative images resembling icons and logos. All deduplication steps are conducted individually for every data source.
Paragraph and Document Deduplication
Following Dolma’s methodology, MINT-1T uses a Bloom Filter for efficient text deduplication, setting the false positive rate to 0.01 and deduplicating 13-gram paragraphs (indicated through double newline delimiters) from each document. If greater than 80% of a document’s paragraphs are duplicates, the complete document is discarded.
Removing Common Boilerplate Text
After paragraph deduplication, MINT-1T removes short common boilerplate sentences in HTML documents, resembling “Skip to content” or “Blog Archive.” This is completed by running exact paragraph deduplication on 2% of every CommonCrawl snapshot, in keeping with CCNet practices, ensuring mostly the removal of common boilerplate text.
The above figure demonstrates the filtering process for MINT-1T, and shows how tokens are removed throughout the info pipeline for HTML, PDFs, and ArXiv papers.
Image Deduplication
Inside each CommonCrawl snapshot, MINT-1T removes steadily occurring images based on SHA256 hashes. Reasonably than strict deduplication, only images that appear greater than ten times inside a snapshot are removed, following Multimodal-C4 practices. Consistent with OBELICS, repeated images inside a single document are removed, keeping only the primary occurrence.
Infrastructure
Throughout the info processing, MINT-1T had access to a median of two,350 CPU cores from a combination of 190-processor and 90-processor nodes. In total, roughly 4.2 million CPU hours were used to construct this dataset.
Comparing Document Composition in MINT-1T with OBELICS
In evaluating the composition of interleaved datasets, two key characteristics are examined: the distribution of text tokens per document and the variety of images per document. For this evaluation, 50,000 documents were randomly sampled from each OBELICS and every data source in MINT-1T. GPT-2’s tokenizer was used to calculate the variety of text tokens. Outliers were removed by excluding documents that fell outside the 1.5 interquartile range for the variety of text tokens and pictures. As shown in the next figure, the HTML subset of MINT-1T aligns closely with the token distribution seen in OBELICS. Nevertheless, documents sourced from PDFs and ArXiv are likely to be longer than HTML documents on average, highlighting the advantages of sourcing data from diverse sources. Figure 5 examines the image density across all documents, revealing that PDFs and ArXiv documents contain more images in comparison with HTML documents, with ArXiv samples being probably the most image-dense.
How Do Different Data Sources Improve Document Diversity?
A very important motivation for expanding the pool of multimodal documents beyond HTML is the advance of domain coverage. To quantify the variety and depth of this coverage, a Latent Dirichlet Allocation (LDA) model was trained on 100,000 documents sampled from the OBELICS dataset, the HTML subset of MINT-1T, and the PDF subset (excluding ArXiv) from MINT-1T to get 200 topics. GPT-4 was then used to categorise the set of words to discover the dominant domains – resembling Health & Medicine, Science, Business, Humanities, History, etc. – based on MMMU domains. The evaluation reveals distinct trends in domain distribution:
- OBELICS: This dataset shows a pronounced concentration in “Humanities and Social Sciences”. This may increasingly be attributed to its data construction process, which involves filtering out documents that don’t resemble Wikipedia articles, thus potentially altering the distribution to more general knowledge and humanities-focused content.
- MINT-1T’s HTML Subset: In contrast to OBELICS, the HTML subset of MINT-1T just isn’t strongly biased towards any specific domain, suggesting a broader and more balanced domain representation.
- MINT-1T’s PDF Subset: There’s the next proportion of “Science and Technology” documents inside the PDF documents of MINT-1T. This trend is probably going attributable to the character of scientific communication, where PDFs are the popular format for sharing detailed research papers and technical reports.
MINT-1T: Results and Experiments
For all experiments, MINT-1T trains the model on 50% image-text captioning batches and 50% multimodal interleaved batches. A maximum of 2048 multimodal tokens is sampled from each interleaved document and 340 tokens from each image-text sample. Just like Flamingo, an “end” token is added to point the top of an adjoining image-text sequence. During training, 50% of single-image interleaved documents are randomly dropped to upsample multi-image documents. The image-text dataset consists of a mix of internally curated caption datasets.The model’s capability to reason about multimodal interleaved sequences is assessed through its in-context learning abilities and multi-image reasoning performance.
The above figure illustrates the proportion of documents from each domain in MMMU for OBELICS and subsets of MINT-1T.
In-Context Learning: The models are evaluated on four-shot and eight-shot in-context learning performance on various captioning benchmarks (COCO (Karpathy test) and TextCaps (validation)) and visual query answering datasets (VQAv2 (validation), OK-VQA (validation), TextVQA (validation), and VizWiz (validation)). Demonstrations are randomly sampled from the training set. Scores are averaged over multiple evaluation runs, with randomized demonstrations to account for sensitivity to chosen prompts. Different prompts are ablated for every task to pick the very best performing ones.
Multi-Image Reasoning: Models are evaluated on MMMU (containing each single and multi-image questions) and Mantis-Eval (all multi-image questions) to probe multi-image reasoning abilities beyond in-context learning evaluations.
Training on HTML Documents
Initially, the HTML portion of MINT-1T is in comparison with OBELICS, as OBELICS is the previous leading interleaved dataset, also curated from HTML documents. Two models are trained on the HTML portions of MINT-1T and OBELICS for a complete of 10B multimodal tokens. Their in-context learning performance is assessed. The next table presents the 4-shot and 8-shot performance on common benchmarks; the model trained on MINT-1T HTML documents performs higher than OBELICS on VQA tasks but worse on captioning benchmarks. On average, OBELICS performs barely higher than MINT-1T (HTML).
Adding PDF and ArXiv Documents
Subsequently, training is conducted on MINT-1T’s full data sources, with a mix of HTML, PDF, and ArXiv documents. The interleaved documents are sampled with 50% from HTML, 45% from PDFs, and 5% from ArXiv. The model is trained for a complete of 10B multimodal tokens. As seen within the above table, the model trained on the total MINT-1T data mixture outperforms OBELICS and MINT-1T (HTML) on most in-context learning benchmarks. On more complex multimodal reasoning benchmarks, the MINT-1T model outperforms OBELICS on MMMU but performs worse on Mantis-Eval.
Tremendous-Grained Trends
How Does In-Context Learning Performance Scale with Demonstrations?
The in-context learning performance is evaluated when prompted with one to eight demonstrations. A single trial per shot count is run for every evaluation benchmark. As seen in the next figure, the model trained on MINT-1T outperforms the model trained on the HTML subset of MINT-1T and OBELICS across all shots. The MINT-1T (HTML) model performs barely worse than OBELICS.
Performance on Captioning and Visual Query Answering Tasks
The next figure presents the common in-context learning performance on captioning and visual query answering (VQA) benchmarks. OBELICS outperforms all MINT-1T variants on four-shot captioning benchmarks and performs barely worse in comparison with MINT-1T on eight-shot captioning. Nevertheless, MINT-1T significantly outperforms each baselines on VQA benchmarks. MINT-1T (HTML) also outperforms OBELICS on VQA tasks.
Performance on Different Domains
Including diverse domains in MINT-1T is geared toward improving model generalization. The figure earlier breaks down performance on MMMU for every domain. Apart from the Business domain, MINT-1T outperforms OBELICS and MINT-1T (HTML). The performance increase in Science and Technology domains for MINT-1T is attributed to the prevalence of those domains in ArXiv and PDF documents.
Final Thoughts
In this text we now have talked about MINT-1T, the most important and most diverse multimodal interleaved open-source dataset so far. MINT-1T: A 10x larger scale, including one trillion text tokens & 3.4 billion images than existing open-source datasets. The MINT-1T dataset also introduces never-exposed sources resembling PDF files, ArXiv papers. Since multimodal interleaved datasets don’t scale easily, it will be important that the MINT-1T dataset shares the info curation process so others can even perform experiments on such information-rich variants. The MINT-1T dataset demonstrates that its method; LM models trained on MINT-1T are competitive (albeit somewhat) to previous state-of-the-art OBELICS.