Rearchitecting Hugging Face Uploads and Downloads

As a part of Hugging Face’s Xet team’s work to improve Hugging Face Hub’s storage backend, we analyzed a 24 hour window of Hugging Face upload requests to higher understand access patterns. On October eleventh, 2024, we saw:

Uploads from 88 countries
8.2 million upload requests
130.8 TB of information transferred

The map below visualizes this activity, with countries coloured by bytes uploaded per hour.

Animated view of uploads

Currently, uploads are stored in an S3 bucket in us-east-1 and optimized using S3 Transfer Acceleration. Downloads are cached and served using AWS Cloudfront as a CDN. Cloudfront’s 400+ convenient edge locations provide global coverage and low-latency data transfers. Nonetheless, like most CDNs, it’s optimized for web content and has a file size limit of 50GB.

While this size restriction is cheap for typical web file transfers, the ever-growing size of files in model and dataset repositories presents a challenge. As an illustration, the weights of meta-llama/Meta-Llama-3-70B total 131GB and are split across 30 files to satisfy the Hub’s advice of chunking weights into 20 GB segments. Moreover, to enable advanced deduplication or compression techniques for each uploads and downloads requires a reimagining of how we handle file transfers.

A Custom Protocol for Uploads and Downloads

To push Hugging Face infrastructure beyond its current limits, we’re redesigning the Hub’s upload and download architecture. We plan to insert a content-addressed store (CAS) as the primary stop for content distribution. This permits us to implement a custom protocol built on a guiding philosophy of dumb reads and smart writes. Unlike Git LFS, which treats files as opaque blobs, our approach analyzes files on the byte level, uncovering opportunities to enhance transfer speeds for the huge files present in model and dataset repositories.

The read path prioritizes simplicity and speed to make sure high throughput with minimal latency. Requests for a file are routed to a CAS server, which provides reconstruction information. The information itself stays backed by an S3 bucket in us-east-1, with AWS CloudFront continuing to serve because the CDN for downloads.

The write path is more complex to optimize upload speeds and supply additional security guarantees. Like reads, upload requests are routed to a CAS server, but as a substitute of querying on the file level we operate on chunks. As matches are found, the CAS server instructs the client (e.g., huggingface_hub) to transfer only the obligatory (latest) chunks. The chunks are validated by CAS before uploading them to S3.

There are a lot of implementation details to deal with equivalent to network constraints and storage overhead which we’ll cover in future posts. For now, let’s take a look at how reads currently look. The primary diagram below show the read and write path as they currently look today:

Old read and write sequence diagram — Reads are represented on the left; writes are to the best. Note that writes go on to S3 with none intermediary.

Meanwhile, in the brand new design, reads will take the next path:

New read path in proposed architecture — Recent read path with a content addressed store (CAS) providing reconstruction information. Cloudfront continues to act as a CDN.

and eventually here is the updated write path:

By managing files on the byte level, we are able to adapt optimizations to suit different file formats. As an illustration, now we have explored improving the dedupeability of Parquet files, and are actually investigating compressing tensor files (e.g., Safetensors) which have the potential to trim 10-25% off upload speeds. As latest formats emerge, we’re uniquely positioned to develop further enhancements that improve the event experience on the Hub.

This protocol also introduces significant improvements for enterprise customers and power users. Inserting a control plane for file transfers provides added guarantees to make sure malicious or invalid data can’t be uploaded. Operationally, uploads are not any longer a black box. Enhanced telemetry provides audit trails and detailed logging, enabling the Hub infrastructure team to discover and resolve issues quickly and efficiently.

Designing for Global Access

To support this practice protocol, we’d like to find out the optimal geographic distribution for the CAS service. AWS Lambda@Edge was initially considered for its extensive global coverage to assist minimize the round-trip time. Nonetheless, its reliance on Cloudfront triggers made it incompatible with our updated upload path. As an alternative, we opted to deploy CAS nodes in a select few of AWS’s 34 regions.

Taking a better have a look at our 24-hour window of S3 PUT requests, we identified global traffic patterns that reveal the distribution of information uploads to the Hub. As expected, nearly all of activity comes from North America and Europe, with continuous, high-volume uploads throughout the day. The information also highlights a robust and growing presence in Asia. By specializing in these core regions, we are able to place our CAS points of presence to balance storage and network resources while minimizing latency.

Pareto chart of uploads

While AWS offers 34 regions, our goal is to maintain infrastructure costs reasonable while maintaining a high user experience. Out of the 88 countries represented on this snapshot, the Pareto chart above shows that the highest 7 countries account for 80% of uploaded bytes, while the highest 20 countries contribute 95% of the whole upload volume and requests.

America emerges as the first source of upload traffic, necessitating a PoP on this region. In Europe, most activity is concentrated in central and western countries (e.g., Luxembourg, the UK, and Germany) though there may be some additional activity to account for in Africa (specifically Algeria, Egypt, and South Africa). Asia’s upload traffic is primarily driven by Singapore, Hong Kong, Japan, and South Korea.

If we use an easy heuristic to distribute traffic, we are able to divide our CAS coverage into three major regions:

us-east-1: Serving North and South America
eu-west-3: Serving Europe, the Middle East, and Africa
ap-southeast-1: Serving Asia and Oceania

This finally ends up being quite effective. The US and Europe account for 78.4% of uploaded bytes, while Asia accounts for 21.6%.

New AWS mapping

This regional breakdown leads to a well-balanced load across our three CAS PoPs, with additional capability for growth in ap-southeast-1 and adaptability to scale up in us-east-1 and eu-west-3 as needed.

Based on expected traffic, we plan to allocate resources as follows:

us-east-1: 4 nodes
eu-west-3: 4 nodes
ap-southeast-1: 2 nodes

Validating and Vetting

Regardless that we’re increasing the primary hop distance for some users, the general impact to bandwidth across the Hub can be limited. Our estimates predict that while the cumulative bandwidth for all uploads will decrease from 48.5 Mbps to 42.5 Mbps (a 12% reduction), the performance hit can be greater than offset by other system optimizations.

We’re currently working toward moving our infrastructure into production by the top of 2024, where we’ll start with a single CAS in us-east-1. From there, we’ll start duplicating internal repositories to our latest storage system to benchmark transfer performance, after which replicate our CAS to the extra PoPs mentioned above for more benchmarking. Based on those results, we’ll proceed to optimize our approach to make sure that every little thing works easily when our storage backend is fully in place next yr.

Beyond the Bytes

As we proceed this evaluation, latest opportunities for deeper insights are emerging. Hugging Face hosts one among the most important collections of information from the open-source machine learning community, providing a novel vantage point to explore the modalities and trends driving AI development world wide.

For instance, future analyses could classify models uploaded to the Hub by use case (equivalent to NLP, computer vision, robotics, or large language models) and examine geographic trends in ML activity. This data not only informs our infrastructure decisions but in addition provides a lens into the evolving landscape of machine learning.

We invite you to explore our current findings in additional detail! Visit our interactive Space to see the upload distribution in your region, and follow our team to listen to more about what we’re constructing.

Source link

Rearchitecting Hugging Face Uploads and Downloads

A Custom Protocol for Uploads and Downloads

Designing for Global Access

Validating and Vetting

Beyond the Bytes

What are your thoughts on this topic?
Let us know in the comments below.

Share this article

Recent posts

OpenAI’s “compromise” with the Pentagon is what Anthropic feared

Exciting Changes Are Coming to the TDS Creator Payment Program

I checked out considered one of the largest anti-AI protests ever

OpenAI steps into Anthropic’s Pentagon void

Zero-Waste Agentic RAG: Designing Caching Architectures to Minimize Latency and LLM Costs at Scale

Rearchitecting Hugging Face Uploads and Downloads

A Custom Protocol for Uploads and Downloads

Designing for Global Access

Validating and Vetting

Beyond the Bytes

What are your thoughts on this topic? Let us know in the comments below.

Share this article

Recent posts

What are your thoughts on this topic?
Let us know in the comments below.