A Look Back and Forward

-



For the past few months, now we have been working on the Data Is Higher Together initiative. With this collaboration between Hugging Face and Argilla and the support of the open-source ML community, our goal has been to empower the open-source community to create impactful datasets collectively.

Now, now we have decided to maneuver forward with the identical goal. To supply an summary of our achievements and tasks where everyone can contribute, we organized it into two sections: community efforts and cookbook efforts.


Community efforts

Our first steps on this initiative focused on the prompt rating project. Our goal was to create a dataset of 10K prompts, each synthetic and human-generated, ranked by quality. The community’s response was immediate!

  • In a number of days, over 385 people joined.
  • We released the DIBT/10k_prompts_ranked dataset intended for prompt rating tasks or synthetic data generation.
  • The dataset was used to construct latest models, similar to SPIN.

Seeing the worldwide support from the community, we recognized that English-centric data alone is insufficient, and there will not be enough language-specific benchmarks for open LLMs. So, we created the Multilingual Prompt Evaluation Project (MPEP) with the aim of developing a leaderboard for multiple languages. For that, a subset of 500 high-quality prompts from DIBT/10k_prompts_ranked was chosen to be translated into different languages.

  • Greater than 18 language leaders created the spaces for the translations.
  • Accomplished translations for Dutch, Russian or Spanish, with many more efforts working towards complete translations of the prompts.
  • The creation of a community of dataset builders on Discord

Going forward, we’ll proceed to support community efforts focused on constructing datasets through tools and documentation.



Cookbook efforts

As a part of DIBT, we also created guides and tools that help the community construct helpful datasets on their very own.

  • Domain Specific dataset: To bootstrap the creation of more domain-specific datasets for training models, bringing together engineers and domain experts.
  • DPO/ORPO dataset: To assist foster a community of individuals constructing more DPO-style datasets for various languages, domains, and tasks.
  • KTO dataset: To assist the community create their very own KTO datasets.



What have we learnt?

  • The community is desirous to take part in these efforts, and there may be excitement about collectively working on datasets.
  • There are existing inequalities that should be overcome to make sure comprehensive and inclusive benchmarks. Datasets for certain languages, domains, and tasks are currently underrepresented within the open-source community.
  • We have now most of the tools needed for the community to effectively collaborate on constructing helpful datasets.



How will you get entangled?

You may still contribute to the cookbook efforts by following the instructions within the README of the project you are all in favour of, sharing your datasets and results with the community, or providing latest guides and tools for everybody. Your contributions are invaluable in helping us construct a strong and comprehensive resource for all.

If you need to be a part of it, please join us within the #data-is-better-together channel within the Hugging Face Discord and tell us what you need to construct together!

We’re looking forward to constructing higher datasets along with you!



Source link

ASK ANA

What are your thoughts on this topic?
Let us know in the comments below.

0 0 votes
Article Rating
guest
0 Comments
Oldest
Newest Most Voted
Inline Feedbacks
View all comments

Share this article

Recent posts

0
Would love your thoughts, please comment.x
()
x