Should you’re working on data-intensive research or machine learning projects, you wish a reliable strategy to share and host your datasets. Public datasets reminiscent of Common Crawl, ImageNet, Common Voice and more are critical to the open ML ecosystem, yet they could be difficult to host and share.
Hugging Face Hub makes it seamless to host and share datasets, trusted by many leading research institutions, corporations, and government agencies, including Nvidia, Google, Stanford, NASA, THUDM and Barcelona Supercomputing Center.
By hosting a dataset on the Hugging Face Hub, you get quick access to features that may maximize your work’s impact:
Generous Limits
Support for big datasets
The Hub can host terabyte-scale datasets, with high per-file and per-repository limits. If you might have data to share, the Hugging Face datasets team may help suggest one of the best format for uploading your data for community usage.
The 🤗 Datasets library makes it easy to upload and download your files, and even create a dataset from scratch. 🤗 Datasets also enables dataset streaming , making it possible to work with large datasets with no need to download the whole thing. This could be invaluable to permit researchers with less computational resources to work along with your datasets, or to pick small portions of an enormous dataset for testing, development or prototyping.

The Hugging Face Hub can host the big datasets often created for machine learning research.
Note: The Xet team is currently working on a backend update that can increase per-file limits from the present 50 GB to 500 GB while also improving storage and transfer efficiency.
Dataset Viewer
Beyond just hosting your data, the Hub provides powerful tools for exploration. With the Datasets Viewer, users can explore and interact with datasets hosted on the Hub directly of their browser. This provides a straightforward way for others to view and explore your data without downloading it first.
Hugging Face datasets supports many alternative modalities (audio, images, video, etc.) and file formats (CSV, JSON, Parquet, etc.), and compression formats (Gzip, Zip, etc.). Take a look at the Datasets File Formats page for more details.

The Dataset Viewer for the Infinity-Instruct dataset.
The Datasets Viewer also includes a couple of features which make it easier to explore a dataset.
Full Text Search
Built-in Full Text Search is one of the vital powerful features of the Datasets Viewer. Any text columns in a dataset immediately develop into searchable.
The Arxiver dataset accommodates 63.4k rows of arXiv research papers converted to Markdown. Through the use of Full Text Search, it is easy to seek out the papers containing a selected writer reminiscent of Ilya Sutskever below.
Sorting
The Datasets Viewer means that you can sort the dataset by clicking on the column headers. This makes it easy to seek out essentially the most relevant examples in a dataset.
Below is an example of a dataset sorted by the helpfulness column in descending order for the HelpSteer2 dataset.
Third Party Library Support
Hugging Face is fortunate to have third party integrations with the leading open source data tools. By hosting a dataset on the Hub, it immediately makes the dataset compatible with the tools users are most accustomed to.
Listed here are among the libraries Hugging Face supports out of the box:
| Library | Description | Monthly PyPi Downloads (2024) |
|---|---|---|
| Pandas | Python data evaluation toolkit. | 258M |
| Spark | Real-time, large-scale data processing tool in a distributed environment. | 29M |
| Datasets | 🤗 Datasets is a library for accessing and sharing datasets for Audio, Computer Vision, and Natural Language Processing (NLP). | 17M |
| Dask | Parallel and distributed computing library that scales the prevailing Python and PyData ecosystem. | 12M |
| Polars | A DataFrame library on top of an OLAP query engine. | 8.5M |
| DuckDB | In-process SQL OLAP database management system. | 6M |
| WebDataset | Library to put in writing I/O pipelines for big datasets. | 871K |
| Argilla | Collaboration tool for AI engineers and domain experts that value prime quality data. | 400k |
Most of those libraries enable you to load or stream a dataset in 1 single line of code.
Listed here are some examples with Pandas, Polars and DuckDB:
import pandas as pd
df = pd.read_parquet("hf://datasets/neuralwork/arxiver/data/train.parquet")
import polars as pl
df = pl.read_parquet("hf://datasets/neuralwork/arxiver/data/train.parquet")
import duckdb
duckdb.sql("SELECT * FROM 'hf://datasets/neuralwork/arxiver/data/train.parquet' LIMIT 10")
You could find more details about integrated libraries within the Datasets documentation. Together with the libraries listed above, there are a lot of more community supported tools which support the Hugging Face Hub reminiscent of Lilac and Highlight.
SQL Console
The SQL Console provides an interactive SQL editor that runs entirely in your browser, enabling quick data exploration with none setup. Key features include:
- One-Click: Open a SQL Console to question a dataset with a single click
- Shareable and Embeddable Results: Share and embed interesting query results
- Full DuckDB Syntax: Use full SQL syntax with built-in functions for regex, lists, JSON, embeddings, and more
On every public dataset it’s best to see a brand new SQL Console badge. With only one click you’ll be able to open a SQL Console to question that dataset.
Security
While making datasets accessible is very important, protecting sensitive data is equally crucial. The Hugging Face Hub provides robust safety features to show you how to maintain control over your data while sharing it with the proper audiences.
Access Controls
The Hugging Face Hub supports unique access control options for who has access to the dataset.
- Public: Anyone can access the dataset.
- Private: Only you and people in your organization can access the dataset.
- Gated: Control access to your dataset through two options:
- Automatic Approval: Users must provide required information (like name and email) and conform to terms before gaining access
- Manual Approval: You review and manually approve/reject each access request
For more details about gated datasets, see the gated datasets documentation. For more fine-grained controls, there are Enterprise plan features where organizations can create resource security groups, use SSO, and more.
Built-in Security Scanning
Together with access controls, the Hugging Face Hub offers several security scanners:
| Feature | Description |
|---|---|
| Malware Scanning | Scans files for malware and suspicious content at each commit and visit |
| Secrets Scanning | Blocks datasets with hardcoded secrets and environment variables |
| Pickle Scanning | Scans pickle files and shows vetted imports for PyTorch weights |
| ProtectAI | Uses Guardian tech to dam datasets with pickle, Keras and other exploits |

Reach and Visibility
Having a secure platform with powerful features is priceless, however the true impact of research comes from reaching the proper audience. Reach and visibility are crucial for researchers sharing datasets – it helps maximize research impact, enables reproducibility, facilitates collaboration, and ensures priceless data can profit the broader scientific community.
With over 5M builders actively using the platform, the Hugging Face Hub provides researchers with powerful tools for community engagement and visibility. Here’s what you’ll be able to expect:
Higher Community Engagement
- Built-in discussion tabs for every dataset for community engagement
- Organizations as a centralized place for grouping and collaborating on multiple datasets
- Metrics for dataset usage and impact
Wider Reach
- Access to a big, energetic community of researchers, developers, and practitioners
- Search engine optimization-optimized URLs making your dataset easily discoverable
- Integration with the broader ecosystem of models, datasets, and libraries
- Clear links between your dataset and related models, papers, and demos
Improved Documentation
- Customizable README files for comprehensive documentation
- Support for detailed dataset descriptions and proper academic citations
- Links to related research papers and publications

The Hub makes it easy to ask questions and discuss datasets.
How can I host my dataset on the Hugging Face Hub?
Now that you just understand the advantages of hosting your dataset on the Hub, you is perhaps wondering how you can start. Listed here are some comprehensive resources to guide you thru the method:
The next pages shall be useful if you need to share large datasets:
Should you want any further help uploading a dataset to the Hub or wish to upload a very large dataset, please contact datasets@huggingface.co.
