XetHub is joining Hugging Face!

-


yuchenglow's avatar

Julien Chaumond's avatar


We’re super excited to officially announce that Hugging Face acquired XetHub 🔥

XetHub is a Seattle-based company founded by Yucheng Low, Ajit Banerjee, Rajat Arya who previously worked at Apple where they built and scaled Apple’s internal ML infrastructure. XetHub’s mission is to enable software engineering best practices for AI development. XetHub has developed technologies to enable Git to scale to TB repositories and enable teams to explore, understand and work together on large evolving datasets and models. They were soon joined by a talented team of 12 team members. It’s best to give them a follow at their recent org page: hf.co/xet-team



Our common goal at HF

The XetHub team will help us unlock the subsequent 5 years of growth of HF datasets and models by switching to our own, higher version of LFS as storage backend for the Hub’s repos.

– Julien Chaumond, HF CTO

Back in 2020 after we built the primary version of the HF Hub, we decided to construct it on top of Git LFS since it was decently well-known and it was an affordable selection to bootstrap the Hub’s usage.

We knew back then, nonetheless, that we might want to change to our own, more optimized storage and versioning backend in some unspecified time in the future. Git LFS – despite the fact that it stands for Large File storage – was just never meant for the style of large files we handle in AI, which should not just large, but very very large 😃



Example future use cases 🔥 – what this can enable on the Hub

As an example you might have a 10GB Parquet file. You add a single row. Today you could re-upload 10GB. With the chunked files and deduplication from XetHub, you’ll only have to re-upload the few chunks containing the brand new row.

One other example for GGUF model files: let’s say @bartowski desires to update one single metadata value within the GGUF header for a Llama 3.1 405B repo. Well, in the long run bartowski can only re-upload a single chunk of just a few kilobytes, making the method far more efficient 🔥

As the sector moves to trillion parameters models in the approaching months (thanks Maxime Labonne for the brand new BigLlama-3.1-1T 🤯) our hope is that this recent tech will unlock recent scale each locally, and within Enterprise firms.

Finally, with large datasets and huge models come challenges with collaboration. How do teams work together on large data, models and code? How do users understand how their data and models are evolving? We will probably be working to search out higher solutions to reply these questions.



Fun current stats on Hub repos 🤯🤯

  • variety of repos: 1.3m models, 450k datasets, 680k spaces
  • total cumulative size: 12PB stored in LFS (280M files) / 7,3 TB stored in git (non-LFS)
  • Hub’s day by day variety of requests: 1B
  • day by day Cloudfront bandwidth: 6PB 🤯



A private word from @ylow

I even have been a part of the AI/ML world for over 15 years, and have seen how deep learning has slowly taken over vision, speech, text and really increasingly every data domain.

What I even have severely underestimated is the ability of information. What gave the impression of unattainable tasks just just a few years ago (like image generation) turned out to be possible with orders of magnitude more data, and a model with the capability to soak up it. In hindsight, that is an ML history lesson that has repeated itself over and over.

I even have been working in the information domain ever since my PhD. First in a startup (GraphLab/Dato/Turi) where I made structured data and ML algorithms scale on a single machine. Then after it was acquired by Apple, worked to scale AI data management to >100PB, supporting 10s of internal teams who shipped 100s of features annually. In 2021, along with my co-founders, supported by Madrona and other angel investors, began XetHub to bring our learnings of achieving collaboration at scale to the world.

XetHub’s goal is to enable ML teams to operate like software teams, by scaling Git file storage to TBs, seamlessly enabling experimentation and reproducibility, and providing the visualization capabilities to know how datasets and models evolve.

I, together with the complete XetHub team, are very excited to affix Hugging Face and proceed this mission to make AI collaboration and development easier – by integrating XetHub technology into Hub – and to release these features to the most important ML Community on the earth!



Finally, our Infrastructure team is hiring 👯

When you like those subjects and you desire to construct and scale the collaboration platform for the open source AI movement, get in contact!



Source link

ASK ANA

What are your thoughts on this topic?
Let us know in the comments below.

0 0 votes
Article Rating
guest
0 Comments
Oldest
Newest Most Voted
Inline Feedbacks
View all comments

Share this article

Recent posts

0
Would love your thoughts, please comment.x
()
x