Announcing Latest Dataset Search Features

The AI and ML community has shared greater than 180,000 public datasets on The Hugging Face Dataset Hub.
Researchers and engineers are using these datasets for various tasks, from training LLMs to talk with users to evaluating automatic speech recognition or computer vision systems.
Dataset discoverability and visualization are key challenges to letting AI builders find, explore, and transform datasets to suit their use cases.

At Hugging Face, we’re constructing the Dataset Hub because the place for the community to collaborate on open datasets.
So we built tools like Dataset Search and the Dataset Viewer, in addition to a wealthy open source ecosystem of tools.
Today we’re announcing 4 latest features that can take Dataset Search on the Hub to the subsequent level.

Search by Modality

The modality of a dataset corresponds to the kind of data contained in the dataset. For instance, essentially the most common forms of data on Hugging Face are text, image, audio, and tabular data.

We released a set of filters that lets you filter datasets which have one or several modalities amongst this list:

Text
Image
Audio
Tabular
Time-Series
3D
Video
Geospatial

For instance, it is feasible to search for datasets that contain each text and image data:

The modalities of every dataset are robotically detected based on file contents and extensions.

Search by Size

We recently released a brand new feature within the interface to indicate the variety of rows of every dataset:

Following this, it’s now possible to look datasets by quite a lot of rows by specifying a minimum and maximum variety of rows.
This may allow you to search for datasets of small size to the largest datasets that exist (for instance, those used to pretrain LLMs).

The knowledge concerning the variety of rows is out there for all of the datasets in supported formats.
Even for the largest datasets for which the variety of rows is just not included within the metadata the entire variety of rows is estimated accurately based on the content of the primary 5GB.

For instance, when you are taking a look at the datasets with the very best variety of rows on Hugging Face, you may search for datasets with greater than 10B (10¹⁰) rows:

Search by Format

The identical dataset might be stored in many various formats.
For instance, text datasets are sometimes in Parquet or JSON Lines, but they may very well be in text files, and image datasets are sometimes a single directory of images, but they may very well be in WebDataset format (a format based on TAR archives).

Each format has its pros and cons.
For instance, Parquet offers nested data support, unlike CSV, efficient filtering/analytics, and a very good compression ratio, but accessing one specific row requires decoding a full row group.
One other example is WebDataset, which offers the very best data streaming speed but lacks some metadata, reminiscent of the variety of rows per file, which is commonly needed to efficiently distribute data in multi-node training setups.

The dataset format, subsequently, indicates which use cases are favoured and whether you will have to reformat the info to suit your needs.

Here you may see the datasets in WebDataset format:

Search by Library

There are a lot of good libraries and tools to load datasets and prepare them for training, like Pandas, Dask, or the 🤗 Datasets library.
The Hub lets you use your favorite tools and filter datasets compatible with any library, for instance you may search for datasets compatible with Pandas:

The dataset compatibility is predicated on the dataset format and size (e.g., Dask can load big JSON Lines dataset, unlike Pandas, which requires loading the total dataset in memory).
Along with this, we also provide the code snippet to load any dataset in your favorite tool:

When you would really like your library to seem within the list of supported libraries, be happy to open a discussion on huggingface.js!

Mix filters

Those 4 latest Dataset Search tools might be used together and with the opposite existing filters like Language, Tasks, and Licenses.
Combining those filters with the text search bar you may search for the particular dataset you might be in search of:

Source link

Announcing Latest Dataset Search Features

Search by Modality

Search by Size

Search by Format

Search by Library

Mix filters

What are your thoughts on this topic?
Let us know in the comments below.

Share this article

Recent posts

Bridging the operational AI gap

Escaping the Prototype Mirage: Why Enterprise AI Stalls

Altman faces the fallout from OpenAI’s Pentagon deal

A “ChatGPT for spreadsheets” helps solve difficult engineering challenges faster

Downdetector, Speedtest sold to IT service provider Accenture in $1.2B deal

Announcing Latest Dataset Search Features

Search by Modality

Search by Size

Search by Format

Search by Library

Mix filters

What are your thoughts on this topic? Let us know in the comments below.

Share this article

Recent posts

What are your thoughts on this topic?
Let us know in the comments below.