Share your open ML datasets on Hugging Face Hub!

Should you’re working on data-intensive research or machine learning projects, you wish a reliable strategy to share and host your datasets. Public datasets reminiscent of Common Crawl, ImageNet, Common Voice and more are critical to the open ML ecosystem, yet they could be difficult to host and share.

Hugging Face Hub makes it seamless to host and share datasets, trusted by many leading research institutions, corporations, and government agencies, including Nvidia, Google, Stanford, NASA, THUDM and Barcelona Supercomputing Center.

By hosting a dataset on the Hugging Face Hub, you get quick access to features that may maximize your work’s impact:

Generous Limits

Support for big datasets

The Hub can host terabyte-scale datasets, with high per-file and per-repository limits. If you might have data to share, the Hugging Face datasets team may help suggest one of the best format for uploading your data for community usage.
The 🤗 Datasets library makes it easy to upload and download your files, and even create a dataset from scratch. 🤗 Datasets also enables dataset streaming , making it possible to work with large datasets with no need to download the whole thing. This could be invaluable to permit researchers with less computational resources to work along with your datasets, or to pick small portions of an enormous dataset for testing, development or prototyping.

Screenshot of the file size information for a dataset
The Hugging Face Hub can host the big datasets often created for machine learning research.

Note: The Xet team is currently working on a backend update that can increase per-file limits from the present 50 GB to 500 GB while also improving storage and transfer efficiency.

Dataset Viewer

Beyond just hosting your data, the Hub provides powerful tools for exploration. With the Datasets Viewer, users can explore and interact with datasets hosted on the Hub directly of their browser. This provides a straightforward way for others to view and explore your data without downloading it first.

Hugging Face datasets supports many alternative modalities (audio, images, video, etc.) and file formats (CSV, JSON, Parquet, etc.), and compression formats (Gzip, Zip, etc.). Take a look at the Datasets File Formats page for more details.

Screenshot of the Datasets Viewer
The Dataset Viewer for the Infinity-Instruct dataset.

The Datasets Viewer also includes a couple of features which make it easier to explore a dataset.

Full Text Search

Built-in Full Text Search is one of the vital powerful features of the Datasets Viewer. Any text columns in a dataset immediately develop into searchable.

The Arxiver dataset accommodates 63.4k rows of arXiv research papers converted to Markdown. Through the use of Full Text Search, it is easy to seek out the papers containing a selected writer reminiscent of Ilya Sutskever below.

Sorting

The Datasets Viewer means that you can sort the dataset by clicking on the column headers. This makes it easy to seek out essentially the most relevant examples in a dataset.

Below is an example of a dataset sorted by the helpfulness column in descending order for the HelpSteer2 dataset.

Third Party Library Support

Hugging Face is fortunate to have third party integrations with the leading open source data tools. By hosting a dataset on the Hub, it immediately makes the dataset compatible with the tools users are most accustomed to.

Listed here are among the libraries Hugging Face supports out of the box:

Library	Description	Monthly PyPi Downloads (2024)
Pandas	Python data evaluation toolkit.	258M
Spark	Real-time, large-scale data processing tool in a distributed environment.	29M
Datasets	🤗 Datasets is a library for accessing and sharing datasets for Audio, Computer Vision, and Natural Language Processing (NLP).	17M
Dask	Parallel and distributed computing library that scales the prevailing Python and PyData ecosystem.	12M
Polars	A DataFrame library on top of an OLAP query engine.	8.5M
DuckDB	In-process SQL OLAP database management system.	6M
WebDataset	Library to put in writing I/O pipelines for big datasets.	871K
Argilla	Collaboration tool for AI engineers and domain experts that value prime quality data.	400k

Most of those libraries enable you to load or stream a dataset in 1 single line of code.

Listed here are some examples with Pandas, Polars and DuckDB:


import pandas as pd
df = pd.read_parquet("hf://datasets/neuralwork/arxiver/data/train.parquet")


import polars as pl
df = pl.read_parquet("hf://datasets/neuralwork/arxiver/data/train.parquet")


import duckdb
duckdb.sql("SELECT * FROM 'hf://datasets/neuralwork/arxiver/data/train.parquet' LIMIT 10")

You could find more details about integrated libraries within the Datasets documentation. Together with the libraries listed above, there are a lot of more community supported tools which support the Hugging Face Hub reminiscent of Lilac and Highlight.

SQL Console

The SQL Console provides an interactive SQL editor that runs entirely in your browser, enabling quick data exploration with none setup. Key features include:

One-Click: Open a SQL Console to question a dataset with a single click
Shareable and Embeddable Results: Share and embed interesting query results
Full DuckDB Syntax: Use full SQL syntax with built-in functions for regex, lists, JSON, embeddings, and more

On every public dataset it’s best to see a brand new SQL Console badge. With only one click you’ll be able to open a SQL Console to question that dataset.

Querying the Magpie-Ultra dataset for excellent, prime quality reasoning instructions.

Security

While making datasets accessible is very important, protecting sensitive data is equally crucial. The Hugging Face Hub provides robust safety features to show you how to maintain control over your data while sharing it with the proper audiences.

Access Controls

The Hugging Face Hub supports unique access control options for who has access to the dataset.

Public: Anyone can access the dataset.
Private: Only you and people in your organization can access the dataset.
Gated: Control access to your dataset through two options:
- Automatic Approval: Users must provide required information (like name and email) and conform to terms before gaining access
- Manual Approval: You review and manually approve/reject each access request

For more details about gated datasets, see the gated datasets documentation. For more fine-grained controls, there are Enterprise plan features where organizations can create resource security groups, use SSO, and more.

Built-in Security Scanning

Together with access controls, the Hugging Face Hub offers several security scanners:

Feature	Description
Malware Scanning	Scans files for malware and suspicious content at each commit and visit
Secrets Scanning	Blocks datasets with hardcoded secrets and environment variables
Pickle Scanning	Scans pickle files and shows vetted imports for PyTorch weights
ProtectAI	Uses Guardian tech to dam datasets with pickle, Keras and other exploits

Security scanner status banner showing various security checks — To learn more about these scanners, see the security scanners documentation.

Reach and Visibility

Having a secure platform with powerful features is priceless, however the true impact of research comes from reaching the proper audience. Reach and visibility are crucial for researchers sharing datasets – it helps maximize research impact, enables reproducibility, facilitates collaboration, and ensures priceless data can profit the broader scientific community.

With over 5M builders actively using the platform, the Hugging Face Hub provides researchers with powerful tools for community engagement and visibility. Here’s what you’ll be able to expect:

Higher Community Engagement

Built-in discussion tabs for every dataset for community engagement
Organizations as a centralized place for grouping and collaborating on multiple datasets
Metrics for dataset usage and impact

Wider Reach

Access to a big, energetic community of researchers, developers, and practitioners
Search engine optimization-optimized URLs making your dataset easily discoverable
Integration with the broader ecosystem of models, datasets, and libraries
Clear links between your dataset and related models, papers, and demos

Improved Documentation

Customizable README files for comprehensive documentation
Support for detailed dataset descriptions and proper academic citations
Links to related research papers and publications

Screenshot of a discussion for a dataset on the Hub.
The Hub makes it easy to ask questions and discuss datasets.

How can I host my dataset on the Hugging Face Hub?

Now that you just understand the advantages of hosting your dataset on the Hub, you is perhaps wondering how you can start. Listed here are some comprehensive resources to guide you thru the method:

The next pages shall be useful if you need to share large datasets:

Should you want any further help uploading a dataset to the Hub or wish to upload a very large dataset, please contact datasets@huggingface.co.

Source link

Share your open ML datasets on Hugging Face Hub!

Generous Limits

Support for big datasets

Dataset Viewer

Full Text Search

Sorting

Third Party Library Support

SQL Console

Security

Access Controls

Built-in Security Scanning

Reach and Visibility

Higher Community Engagement

Wider Reach

Improved Documentation

How can I host my dataset on the Hugging Face Hub?

What are your thoughts on this topic?
Let us know in the comments below.

Share this article

Recent posts

Enhancing maritime cybersecurity with technology and policy

Hugging Face on PyTorch / XLA TPUs

Retrieval Augmented Generation with Huggingface Transformers and Ray

Decisioning on the Edge: Policy Matching at Scale

Easy considerations for easy people constructing fancy neural networks

Share your open ML datasets on Hugging Face Hub!

Generous Limits

Support for big datasets

Dataset Viewer

Full Text Search

Sorting

Third Party Library Support

SQL Console

Security

Access Controls

Built-in Security Scanning

Reach and Visibility

Higher Community Engagement

Wider Reach

Improved Documentation

How can I host my dataset on the Hugging Face Hub?

What are your thoughts on this topic? Let us know in the comments below.

Share this article

Recent posts

What are your thoughts on this topic?
Let us know in the comments below.