Enabling communities to collectively construct higher datasets together using Argilla and Hugging Face Spaces

-


Daniel van Strien's avatar

Daniel Vila's avatar


Recently, Argilla and Hugging Face launched Data is Higher Together, an experiment to collectively construct a preference dataset of prompt rankings. In just a few days, we had:

  • 350 community contributors labeling data
  • Over 11,000 prompt rankings

See the progress dashboard for the most recent stats!

This resulted in the discharge of 10k_prompts_ranked, a dataset consisting of 10,000 prompts with user rankings for the standard of the prompt. We would like to enable many more projects like this!

On this post, we’ll discuss why we predict it’s essential for the community to collaborate on constructing datasets and share an invite to affix the primary cohort of communities Argilla and Hugging Face will support to develop higher datasets together!



Data stays essential for higher models

Data continues to be essential for higher models: We see continued evidence from published research, open-source experiments, and from the open-source community that higher data can lead to higher models.

Screenshot of datasets in the Hugging Face Hub
The query.

Screenshot of datasets in the Hugging Face Hub
A frequent answer.



Why construct datasets collectively?

Data is important for machine learning, but many languages, domains, and tasks still lack high-quality datasets for training, evaluating, and benchmarking — the community already shares 1000’s of models, datasets, and demos day by day via the Hugging Face Hub. In consequence of collaboration, the open-access AI community has created amazing things. Enabling the community to construct datasets collectively will unlock unique opportunities for constructing the following generation of datasets to construct the following generation of models.

Empowering the community to construct and improve datasets collectively will allow people to:

  • Contribute to the event of Open Source ML with no ML or programming skills required.
  • Create chat datasets for a specific language.
  • Develop benchmark datasets for a selected domain.
  • Create preference datasets from a various range of participants.
  • Construct datasets for a specific task.
  • Construct completely recent forms of datasets collectively as a community.

Importantly we imagine that constructing datasets collectively will allow the community to construct higher datasets abd allow individuals who do not know tips on how to code to contribute to the event of AI.



Making it easy for people to contribute

One among the challenges to many previous efforts to construct AI datasets collectively was organising an efficient annotation task. Argilla is an open-source tool that can assist create datasets for LLMs and smaller specialised task-specific models. Hugging Face Spaces is a platform for constructing and hosting machine learning demos and applications. Recently, Argilla added support for authentication via a Hugging Face account for Argilla instances hosted on Spaces. This implies it now takes seconds for users to begin contributing to an annotation task.

Now that now we have stress-tested this recent workflow when creating the 10k_prompts_ranked, dataset, we wish to support the community in launching recent collective dataset efforts.



Join our first cohort of communities who wish to construct higher datasets together!

We’re very enthusiastic about the probabilities unlocked by this recent, easy flow for hosting annotation tasks. To support the community in constructing higher datasets, Hugging Face and Argilla invite interested people and communities to affix our initial cohort of community dataset builders.

People joining this cohort will:

  • Be supported in creating an Argilla Space with Hugging Face authentication. Hugging Face will grant free persistent storage and improved CPU spaces for participants.
  • Have their comms and promotion promoting the initiative amplified by Argilla and Hugging Face.
  • Be invited to affix a cohort community channel

Our goal is to support the community in constructing higher datasets together. We’re open to many ideas and wish to support the community so far as possible in constructing higher datasets together.



What forms of projects are we searching for?

We’re open to supporting many forms of projects, especially those of existing open-source communities. We’re particularly interested by projects specializing in constructing datasets for languages, domains, and tasks which can be currently underrepresented within the open-source community. Our only current limitation is that we’re primarily focused on text-based datasets. If you’ve a really cool idea for multimodal datasets, we would love to listen to from you, but we may not have the opportunity to support you in this primary cohort.

Tasks can either be fully open or open to members of a specific Hugging Face Hub organization.

If you wish to be a part of the primary cohort, please join us within the #data-is-better-together channel within the Hugging Face Discord and tell us what you wish to construct together!

We’re looking forward to constructing higher datasets along with you!



Source link

ASK ANA

What are your thoughts on this topic?
Let us know in the comments below.

0 0 votes
Article Rating
guest
0 Comments
Oldest
Newest Most Voted
Inline Feedbacks
View all comments

Share this article

Recent posts

0
Would love your thoughts, please comment.x
()
x