Designing open-source libraries for contemporary machine learning
“Don’t repeat yourself”, or DRY, is a well known principle of software development. The principle originates from “The pragmatic programmer”, one of the read books on code design.
The principle’s easy message makes obvious sense: Don’t rewrite a logic that already exists elsewhere. This ensures the code stays in sync, making it easier to keep up and more robust. Any change to this logical pattern will uniformly affect all of its dependencies.
At first glance, the design of Hugging Face’s Transformers library couldn’t be more contrary to the DRY principle. Code for the eye mechanism is roughly copied over 50 times into different model files. Sometimes code of the entire BERT model is copied into other model files. We frequently force latest model contributions equivalent to existing models – besides a small logical tweak – to repeat all of the present code. Why will we do that? Are we just too lazy or overwhelmed to centralize all logical pieces into one place?
No, we are usually not lazy – it’s a really conscious decision not to use the DRY design principle to the Transformers library. As a substitute, we decided to adopt a unique design principle which we prefer to call the single model file policy. The single model file policy states that every one code crucial for the forward pass of a model is in a single and just one file – called the model file. If a reader wants to know how BERT works for inference, she should only must look into BERT’s modeling_bert.py file.
We often reject any try and abstract equivalent sub-components of various models right into a latest centralized place. We don’t desire to have a attention_layer.py that features all possible attention mechanisms. Again why will we do that?
Briefly the explanations are:
- 1. Transformers is built by and for the open-source community.
- 2. Our product are models and our customers are users reading or tweaking model code.
- 3. The sector of machine learning evolves extremely fast.
- 4. Machine Learning models are static.
1. Built by and for the open-source community
Transformers is built to actively incentivize external contributions. A contribution is commonly either a bug fix or a brand new model contribution. If a bug is present in one in every of the model files, we have the desire to make it as easy as possible for the finder to repair it. There’s little that’s more demotivating than fixing a bug only to see that it caused 100 failures of other models.
Because model code is independent from all other models, it’s fairly easy for somebody that only understands the one model she is working with to repair it. Similarly, it’s easier so as to add latest modeling code and review the corresponding PR if only a single latest model file is added. The contributor doesn’t must work out how one can add latest functionality to a centralized attention mechanism without breaking existing models. The reviewer can easily confirm that none of the present models are broken.
2. Modeling code is our product
We assume that a major amount of users of the Transformers library not only read the documentation, but in addition look into the actual modeling code and potentially modify it. This hypothesis is backed by the Transformers library being forked over 10,000 times and the Transformers paper being cited over a thousand times.
Due to this fact it’s of utmost importance that somebody reading Transformers modeling code for the primary time can easily understand and potentially adapt it. Providing all of the crucial logical components so as in a single modeling file helps quite a bit to realize improved readability and adaptableness. Moreover, we care an awesome deal about sensible variable/method naming and like expressive/readable code over character-efficient code.
3. Machine Learning is evolving at a neck-breaking speed
Research in the sphere of machine learning, and particularly neural networks, evolves extremely fast. A model that was state-of-the-art a 12 months ago is likely to be outdated today. We do not know which attention mechanism, position embedding, or architecture will likely be the most effective in a 12 months. Due to this fact, we cannot define standard logical patterns that apply to all models.
For instance, two years ago, one might need defined BERT’s self attention layer as the usual attention layer utilized by all Transformers models. Logically, a “standard” attention function might have been moved right into a central attention.py file. But then got here attention layers that added relative positional embeddings in each attention layer (T5), multiple different types of chunked attention (Reformer, Longformer, BigBird), and separate attention mechanism for position and word embeddings (DeBERTa), etc… Each time we might must have asked ourselves whether the “standard” attention function needs to be adapted or whether it will have been higher so as to add a brand new attention function to attention.py. But then how will we name it? attention_with_positional_embd, reformer_attention, deberta_attention?
It’s dangerous to offer logical components of machine learning models general names since the perception of what this component stands for might change or turn out to be outdated in a short time. E.g., does chunked attention corresponds to GPTNeo’s, Reformer’s, or BigBird’s chunked attention? Is the eye layer a self-attention layer, a cross-attentional layer, or does it include each? Nevertheless, if we name attention layers by their model’s name, we should always directly put the eye function within the corresponding modeling file.
4. Machine Learning models are static
The Transformers library is a unified and polished collection of machine learning models that different research teams have created. Every machine learning model is generally accompanied by a paper and its official GitHub repository. Once a machine learning model is published, it isn’t adapted or modified afterward.
As a substitute, research teams are likely to publish a brand new model built upon previous models but rarely make significant changes to already published code. That is a crucial realization when deciding on the design principles of the Transformers library.
It implies that once a model architecture has been added to Transformers, the elemental components of the model don’t change anymore. Bugs are sometimes found and glued, methods and variables is likely to be renamed, and the output or input format of the model is likely to be barely modified, however the model’s core components don’t change anymore. Consequently, the necessity to apply global changes to all models in Transformers is significantly reduced, making it less essential that each logical pattern only exists once because it’s rarely modified.
A second realization is that models do not rely upon one another in a bidirectional way. Newer published models might rely upon existing models, but it surely’s quite obvious that an existing model cannot logically rely upon its successor. E.g. T5 is partly built upon BERT and subsequently T5’s modeling code might logically rely upon BERT’s modeling code, but BERT cannot logically depend in any way on T5. Thus, it will not be logically sound to refactor BERT’s attention function to also work with T5’s attention function – someone reading through BERT’s attention layer mustn’t must know anything about T5. Again, this advocates against centralizing components comparable to the eye layer into modules that every one models can access.
However, the modeling code of successor models can thoroughly logically rely upon its predecessor model. E.g., DeBERTa-v2 modeling code does logically depend
to some extent on DeBERTa’s modeling code. Maintainability is significantly improved by ensuring the modeling code of DeBERTa-v2 stays in sync with DeBERTa’s. Fixing a bug in
DeBERTa should ideally also fix the identical bug in DeBERTa-v2. How can we maintain the single model file policy while ensuring that successor models stay in sync with their predecessor model?
Now, we explain why we put the asterisk after “Repeat Yourself”. We do not blindly copy-paste all existing modeling code even when it looks this manner. Considered one of Transformers’ core maintainers, Sylvain Gugger, found an awesome mechanism that respects each the single file policy and keeps maintainability cost in bounds. This mechanism, loosely called “the copying mechanism”, allows us to mark logical components, comparable to an attention layer function, with a # Copied from statement, which enforces the marked code to be equivalent to the of the . E.g., this line of over DeBERTa-v2’s class enforces the entire class to be equivalent to DeBERTa’s class apart from the prefix DeBERTav2.
This manner, the copying mechanism keeps modeling code very easy to know while significantly reducing maintenance. If some code is modified in a function of a predecessor model that’s referred to by a function of its successor model, there are tools in place that routinely correct the successor model’s function.
Drawbacks
Clearly, there are also drawbacks to the only file policy two of which we quickly wish to mention here.
A significant goal of Transformers is to offer a unified API for each inference and training for all models so
that a user can quickly switch between different models in her setup. Nevertheless, ensuring a unified API across
models is way more difficult if modeling files are usually not allowed to make use of abstracted logical patterns. We solve
this problem by running quite a bit of tests (ca. 20,000 tests are run every day on the time of writing this blog post) to be sure that models follow a consistent API. On this case, the only file policy requires us to be very rigorous when reviewing model and test additions.
Second, there’s numerous research on only a single component of a Machine Learning model. E.g., research
teams investigate latest types of an attention mechanism that may apply to all existing pre-trained models as
has been done within the Rethinking Attention with Performers. How should
we incorporate such research into the Transformers library? It’s indeed problematic. Should we alter
all existing models? This could go against points 3. and 4. as written above. Should we add 100+ latest modeling
files each prefixed with Performer...? This seems absurd. In such a case there’s sadly no good solution
and we go for not integrating the paper into Transformers on this case. If the paper would have gotten
way more traction and included strong pre-trained checkpoints, we might have probably added latest modeling
files of crucial models comparable to modeling_performer_bert.py
available.
Conclusion
All in all, at 🤗 Hugging Face we’re convinced that the single file policy is the fitting coding philosophy for Transformers.
What do you think that? If you happen to read until here, we could be greater than fascinated about hearing your opinion!
If you happen to would really like to go away a comment, please visit the corresponding forum post here.
