Transformers.js v4 Preview: Now Available on NPM!

-


Joshua's avatar

Nico Martin's avatar


Overview

We’re excited to announce that Transformers.js v4 (preview) is now available on NPM! After nearly a yr of development (we began in March 2025 🤯), we’re finally ready so that you can check it out. Previously, users had to put in v4 directly from source via GitHub, but now it’s so simple as running a single command!

npm i @huggingface/transformers@next

We’ll proceed publishing v4 releases under the next tag on NPM until the total release, so expect regular updates!



Performance & Runtime Improvements

The most important change is undoubtedly the adoption of a brand new WebGPU Runtime, completely rewritten in C++. We have worked closely with the ONNX Runtime team to thoroughly test this runtime across our ~200 supported model architectures, in addition to many recent v4-exclusive architectures.

As well as to higher operator support (for performance, accuracy, and coverage), this recent WebGPU runtime allows the identical transformers.js code for use across a wide range of JavaScript environments, including browsers, server-side runtimes, and desktop applications. That is right, you possibly can now run WebGPU-accelerated models directly in Node, Bun, and Deno!

WebGPU Overview

We have proven that it’s possible to run state-of-the-art AI models 100% locally within the browser, and now we’re focused on performance: making these models run as fast as possible, even in resource-constrained environments. This required completely rethinking our export strategy, especially for giant language models. We achieve this by re-implementing recent models operation by operation, leveraging specialized ONNX Runtime Contrib Operators like com.microsoft.GroupQueryAttention, com.microsoft.MatMulNBits, or com.microsoft.QMoE to maximise performance.

For instance, adopting the com.microsoft.MultiHeadAttention operator, we were in a position to achieve a ~4x speedup for BERT-based embedding models.

Optimized ONNX Exports

This update enables full offline support by caching WASM files locally within the browser, allowing users to run Transformers.js applications without an online connection after the initial download.



Repository Restructuring

Developing a brand new major version gave us the chance to speculate within the codebase and tackle long-overdue refactoring efforts.



PNPM Workspaces

Until now, the GitHub repository served as our npm package. This worked well so long as the repository only exposed a single library. Nonetheless, trying to the long run, we saw the necessity for various sub-packages that depend heavily on the Transformers.js core while addressing different use cases, like library-specific implementations, or smaller utilities that almost all users don’t need but are essential for some.

That is why we converted the repository to a monorepo using pnpm workspaces. This permits us to ship smaller packages that rely upon @huggingface/transformers without the overhead of maintaining separate repositories.



Modular Class Structure

One other major refactoring effort targeted the ever-growing models.js file. In v3, all available models were defined in a single file spanning over 8,000 lines, becoming increasingly difficult to take care of. For v4, we split this into smaller, focused modules with a transparent distinction between utility functions, core logic, and model-specific implementations. This recent structure improves readability and makes it much easier so as to add recent models. Developers can now deal with model-specific logic without navigating through 1000’s of lines of unrelated code.



Examples Repository

In v3, many Transformers.js example projects lived directly within the major repository. For v4, we have moved them to a dedicated repository, allowing us to take care of a cleaner codebase focused on the core library. This also makes it easier for users to seek out and contribute to examples without sifting through the major repository.



Prettier

We updated the Prettier configuration and reformatted all files within the repository. This ensures consistent formatting throughout the codebase, with all future PRs robotically following the identical style. No more debates about formatting… Prettier handles all of it, keeping the code clean and readable for everybody.



Recent Models and Architectures

Due to our recent export strategy and ONNX Runtime’s expanding support for custom operators, we have been in a position to add many recent models and architectures to Transformers.js v4. These include popular models like GPT-OSS, Chatterbox, GraniteMoeHybrid, LFM2-MoE, HunYuanDenseV1, Apertus, Olmo3, FalconH1, and Youtu-LLM. A lot of these required us to implement support for advanced architectural patterns, including Mamba (state-space models), Multi-head Latent Attention (MLA), and Mixture of Experts (MoE). Perhaps most significantly, these models are all compatible with WebGPU, allowing users to run them directly within the browser or server-side JavaScript environments with hardware acceleration. Stay tuned for some exciting demos showcasing these recent models in motion!



Recent Construct System

We have migrated our construct system from Webpack to esbuild, and the outcomes have been incredible. Construct times dropped from 2 seconds to only 200 milliseconds, a 10x improvement that makes development iteration significantly faster. Speed is not the only profit, though: bundle sizes also decreased by a mean of 10% across all builds. Essentially the most notable improvement is in transformers.web.js, our default export, which is now 53% smaller, meaning faster downloads and quicker startup times for users.



Standalone Tokenizers.js Library

A frequent request from users was to extract the tokenization logic right into a separate library, and with v4, that is exactly what we have done. @huggingface/tokenizers is a whole refactor of the tokenization logic, designed to work seamlessly across browsers and server-side runtimes. At just 8.8kB (gzipped) with zero dependencies, it’s incredibly lightweight while remaining fully type-safe.

See example code
import { Tokenizer } from "@huggingface/tokenizers";


const modelId = "HuggingFaceTB/SmolLM3-3B";
const tokenizerJson = await fetch(
  `https://huggingface.co/${modelId}/resolve/major/tokenizer.json`
).then(res => res.json());

const tokenizerConfig = await fetch(
  `https://huggingface.co/${modelId}/resolve/major/tokenizer_config.json`
).then(res => res.json());


const tokenizer = recent Tokenizer(tokenizerJson, tokenizerConfig);


const tokens = tokenizer.tokenize("Hello World");


const encoded = tokenizer.encode("Hello World");

This separation keeps the core of Transformers.js focused and lean while offering a flexible, standalone tool that any WebML project can use independently.



Miscellaneous Improvements

We have made several quality-of-life improvements across the library. The sort system has been enhanced with dynamic pipeline types that adapt based on inputs, providing higher developer experience and sort safety.

Type Improvements

Logging has been improved to offer users more control and clearer feedback during model execution. Moreover, we have added support for larger models exceeding 8B parameters. In our tests, we have been in a position to run GPT-OSS 20B (q4f16) at ~60 tokens per second on an M4 Pro Max.



Acknowledgements

We would like to increase our heartfelt because of everyone who contributed to this major release, especially the ONNX Runtime team for his or her incredible work on the brand new WebGPU runtime and their support throughout development, in addition to all external contributors and early testers.



Source link

ASK ANA

What are your thoughts on this topic?
Let us know in the comments below.

0 0 votes
Article Rating
guest
0 Comments
Oldest
Newest Most Voted
Inline Feedbacks
View all comments

Share this article

Recent posts

0
Would love your thoughts, please comment.x
()
x