Speed up 1.0.0

3.5 years ago, Speed up was a straightforward framework geared toward making training on multi-GPU and TPU systems easier
by having a low-level abstraction that simplified a raw PyTorch training loop:

Since then, Speed up has expanded right into a multi-faceted library geared toward tackling many common problems with
large-scale training and huge models in an age where 405 billion parameters (Llama) are the brand new language model size. This involves:

A versatile low-level training API, allowing for training on six different hardware accelerators (CPU, GPU, TPU, XPU, NPU, MLU) while maintaining 99% of your original training loop
A straightforward-to-use command-line interface geared toward configuring and running scripts across different hardware configurations
The birthplace of Big Model Inference or device_map="auto", allowing users to not only perform inference on LLMs with multi-devices but now also aiding in training LLMs on small compute through techniques like parameter-efficient fine-tuning (PEFT)

These three facets have allowed Speed up to develop into the muse of nearly every package at Hugging Face, including transformers, diffusers, peft, trl, and more!

Because the package has been stable for nearly a yr, we’re excited to announce that, as of today, we have published the primary release candidates for Speed up 1.0.0!

This blog will detail:

Why did we determine to do 1.0?
What’s the long run for Speed up, and where can we see PyTorch as a complete going?
What are the breaking changes and deprecations that occurred, and how will you migrate over easily?

Why 1.0?

The plans to release 1.0.0 have been within the works for over a yr. The API has been roughly at a degree where we wanted,
centering on the Accelerator side, simplifying much of the configuration and making it more extensible. Nevertheless, we knew
there have been a number of missing pieces before we could call the “base” of Speed up “feature complete”:

With the changes made for 1.0, speed up is ready to tackle latest tech integrations while keeping the user-facing API stable.

The longer term of Speed up

Now that 1.0 is nearly done, we are able to give attention to latest techniques coming out throughout the community and find integration paths into Speed up, as we foresee some radical changes within the PyTorch ecosystem very soon:

As a part of the multiple-model DeepSpeed support, we found that while generally how DeepSpeed is currently could work, some heavy changes to the general API may eventually be needed as we work to support easy wrappings to arrange models for any multiple-model training scenario.
With torchao and torchtitan picking up steam, they hint at the long run of PyTorch as a complete. Aiming at more native support for FP8 training, a brand new distributed sharding API, and support for a new edition of FSDP, FSDPv2, we predict that much of the internals and general usage API of Speed up might want to change (hopefully not too drastic) to satisfy these needs because the frameworks slowly develop into more stable.
Riding on torchao/FP8, many latest frameworks are bringing in numerous ideas and implementations on easy methods to make FP8 training work and be stable (transformer_engine, torchao, MS-AMP, nanotron, to call a number of). Our aim with Speed up is to accommodate each of those implementations in a single place with easy configurations to let users explore and test out every one as they please, meaning to find those that wind up being essentially the most stable and versatile. It is a rapidly accelerating (no pun intended) field of research, especially with NVIDIA’s FP4 training support on the way in which, and we would like to be sure that that not only can we support each of those methods but aim to supply solid benchmarks for every to point out their tendencies out-of-the-box (with minimal tweaking) in comparison with native BF16 training

We’re incredibly excited in regards to the way forward for distributed training within the PyTorch ecosystem, and we would like to be sure that that Speed up is there every step of the way in which, providing a lower barrier to entry for these latest techniques. By doing so, we hope the community will proceed experimenting and learning together as we discover the very best methods for training and scaling larger models on more complex computing systems.

Find out how to try it out

To try the primary release candidate for Speed up today, please use one in every of the next methods:

pip install --pre speed up

docker pull huggingface/speed up:gpu-release-1.0.0rc1

Valid release tags are:

gpu-release-1.0.0rc1
cpu-release-1.0.0rc1
gpu-fp8-transformerengine-release-1.0.0rc1
gpu-deepspeed-release-1.0.0rc1

Migration assistance

Below are the complete details for all deprecations which are being enacted as a part of this release:

Passing in dispatch_batches, split_batches, even_batches, and use_seedable_sampler to the Accelerator() should now be handled by creating an speed up.utils.DataLoaderConfiguration() and passing this to the Accelerator() as an alternative (Accelerator(dataloader_config=DataLoaderConfiguration(...)))
Accelerator().use_fp16 and AcceleratorState().use_fp16 have been removed; this ought to be replaced by checking accelerator.mixed_precision == "fp16"
Accelerator().autocast() not accepts a cache_enabled argument. As a substitute, an AutocastKwargs() instance ought to be used which handles this flag (amongst others) passing it to the Accelerator (Accelerator(kwargs_handlers=[AutocastKwargs(cache_enabled=True)]))
speed up.utils.is_tpu_available ought to be replaced with speed up.utils.is_torch_xla_available
speed up.utils.modeling.shard_checkpoint ought to be replaced with split_torch_state_dict_into_shards from the huggingface_hub library
speed up.tqdm.tqdm() not accepts True/False as the primary argument, and as an alternative, main_process_only ought to be passed in as a named argument
ACCELERATE_DISABLE_RICH isn’t any longer a sound environmental variable, and as an alternative, one should manually enable wealthy traceback by setting ACCELERATE_ENABLE_RICH=1
The FSDP setting fsdp_backward_prefetch_policy has been replaced with fsdp_backward_prefetch

Closing thoughts

Thanks a lot for using Speed up; it has been amazing watching a small idea turn into over 100 million downloads and nearly 300,000 every day downloads over the previous couple of years.

With this release candidate, we hope to present the community a chance to try it out and migrate to 1.0 before the official release.

Please stay tuned for more information by keeping track of the github and on socials!

Source link

Why 1.0?

The longer term of Speed up

Find out how to try it out

Migration assistance

Closing thoughts

What are your thoughts on this topic?
Let us know in the comments below.

Share this article

Recent posts

The Basics of Vibe Engineering

Beyond Prompt Caching: 5 More Things You Should Cache in RAG Pipelines

A Unified and Diverse Benchmark for Speculative Decoding**

Vibe Coding with AI: Best Practices for Human-AI Collaboration in Software Development

Google bets on ‘vibe design’ with Stitch

Speed up 1.0.0

Why 1.0?

The longer term of Speed up

Find out how to try it out

Migration assistance

Closing thoughts

What are your thoughts on this topic? Let us know in the comments below.

Share this article

Recent posts

What are your thoughts on this topic?
Let us know in the comments below.