Photo by Henry & Co. on Unsplash
As machine learning continues penetrating all features of the industry, neural networks have never been so hyped. For example, models like GPT-3 have been throughout social media previously few weeks and proceed to make headlines outside of tech news outlets with fear-mongering titles.
At the identical time, deep learning frameworks, tools, and specialized libraries democratize machine learning research by making state-of-the-art research easier to make use of than ever. It is kind of common to see these almost-magical/plug-and-play 5 lines of code that promise (near) state-of-the-art results. Working at Hugging Face 🤗, I admit that I’m partially guilty of that. 😅 It could actually give an inexperienced user the misleading impression that neural networks at the moment are a mature technology while in reality, the sector is in constant development.
In point of fact, constructing and training neural networks can often be an especially frustrating experience:
- It is typically hard to grasp in case your performance comes from a bug in your model/code or is solely limited by your model’s expressiveness.
- You’ll be able to make tons of tiny mistakes at every step of the method without realizing at first, and your model will still train and provides a good performance.
On this post, I’ll try to focus on a number of steps of my mental process on the subject of constructing and debugging neural networks. By “debugging”, I mean ensuring you align what you will have built and what you will have in mind. I can even indicate things you may take a look at when you find yourself unsure what the next step ought to be by listing the standard questions I ask myself.
Loads of these thoughts stem from my experience doing research in natural language processing but most of those principles might be applied to other fields of machine learning.
1. 🙈 Start by putting machine learning aside
It would sound counter-intuitive however the very first step of constructing a neural network is to put aside machine learning and easily concentrate on your data. Have a look at the examples, their labels, the range of the vocabulary should you are working with text, their length distribution, etc. You need to dive into the information to get a primary sense of the raw product you might be working with and concentrate on extracting general patterns that a model might give you the chance to catch. Hopefully, by taking a look at a number of hundred examples, you’ll give you the chance to discover high-level patterns. A couple of standard questions you may ask yourself:
- Are the labels balanced?
- Are there gold-labels that you just don’t agree with?
- How were the information obtained? What are the possible sources of noise on this process?
- Are there any preprocessing steps that appear natural (tokenization, URL or hashtag removing, etc.)?
- How diverse are the examples?
- What rule-based algorithm would perform decently on this problem?
It’s important to get a high-level feeling (qualitative) of your dataset together with a fine-grained evaluation (quantitative). For those who are working with a public dataset, another person might need already dived into the information and reported their evaluation (it is kind of common in Kaggle competition as an illustration) so you must absolutely have a take a look at these!
2. 📚 Proceed as should you just began machine learning
Once you will have a deep and broad understanding of your data, I at all times recommend to place yourself within the shoes of your old self once you just began machine learning and were watching introduction classes from Andrew Ng on Coursera. Start so simple as possible to get a way of the issue of your task and the way well standard baselines would perform. For example, should you work with text, standard baselines for binary text classification can include a logistic regression trained on top of word2vec or fastText embeddings. With the present tools, running these baselines is as easy (if no more) as running BERT which may arguably be considered certainly one of the usual tools for a lot of natural language processing problems. If other baselines can be found, run (or implement) a few of them. It would allow you to get much more aware of the information.
As developers, it easy to feel good when constructing something fancy however it is typically hard to rationally justify it if it beats easy baselines by only a number of points, so it’s central to be sure that you will have reasonable points of comparisons:
- How would a random predictor perform (especially in classification problems)? Dataset might be unbalanced…
- What would the loss appear like for a random predictor?
- What’s (are) one of the best metric(s) to measure progress on my task?
- What are the boundaries of this metric? If it’s perfect, what can I conclude? What can’t I conclude?
- What’s missing in “easy approaches” to succeed in an ideal rating?
- Are there architectures in my neural network toolbox that may be good to model the inductive bias of the information?
3. 🦸♀️ Don’t be afraid to look under the hood of those 5-liners templates
Next, you may start constructing your model based on the insights and understanding you acquired previously. As mentioned earlier, implementing neural networks can quickly change into quite tricky: there are various moving parts that work together (the optimizer, the model, the input processing pipeline, etc.), and plenty of small things can go fallacious when implementing these parts and connecting them to one another. The challenge lies in the very fact you can make these mistakes, train a model without it ever crashing, and still get a good performance…
Yet, it’s an excellent habit once you think you will have finished implementing to overfit a small batch of examples (16 as an illustration). In case your implementation is (nearly) correct, your model will give you the chance to overfit and remember these examples by displaying a 0-loss (be sure that you remove any type of regularization equivalent to weight decay). If not, it is extremely possible that you just did something fallacious in your implementation. In some rare cases, it signifies that your model will not be expressive enough or lacks capability. Again, start with a small-scale model (fewer layers as an illustration): you want to debug your model so you wish a fast feedback loop, not a high performance.
Pro-tip: in my experience working with pre-trained language models, freezing the embeddings modules to their pre-trained values doesn’t affect much the fine-tuning task performance while considerably speeding up the training.
Some common errors include:
- Incorrect indexing… (these are really the worst 😅). Make certain you might be gathering tensors along the proper dimensions as an illustration…
- You forgot to call
model.eval()in evaluation mode (in PyTorch) ormodel.zero_grad()to scrub the gradients - Something went fallacious within the pre-processing of the inputs
- The loss got fallacious arguments (as an illustration passing probabilities when it expects logits)
- Initialization doesn’t break the symmetry (normally happens once you initialize a complete matrix with a single constant value)
- Some parameters are never called through the forward pass (and thus receive no gradients)
- The educational rate is taking funky values like 0 on a regular basis
- Your inputs are being truncated in a suboptimal way
Pro-tip: once you work with language, have a serious take a look at the outputs of the tokenizers. I can’t count the variety of lost hours I spent attempting to reproduce results (and sometimes my very own old results) because something went fallacious with the tokenization.🤦♂️
One other useful gizmo is deep-diving into the training dynamic and plot (in Tensorboard as an illustration) the evolution of multiple scalars through training. On the bare minimum, you must take a look at the dynamic of your loss(es), the parameters, and their gradients.
Because the loss decreases, you furthermore may want to take a look at the model’s predictions: either by evaluating in your development set or, my personal favorite, print a few model outputs. For example, should you are training a machine translation model, it is kind of satisfying to see the generations change into increasingly convincing through the training. You desire to be more specifically careful about overfitting: your training loss continues to decreases while your evaluation loss is aiming at the celebrities.💫
4. 👀 Tune but don’t tune blindly
Once you will have every thing up and running, it is advisable to tune your hyperparameters to seek out one of the best configuration to your setup. I generally follow a random grid search because it seems to be fairly effective in practice.
Some people report successes using fancy hyperparameter tuning methods equivalent to Bayesian optimization but in my experience, random over a fairly manually defined grid search continues to be a tough-to-beat baseline.
Most significantly, there isn’t any point of launching 1000 runs with different hyperparameters (or architecture tweaks like activation functions): compare a few runs with different hyperparameters to get an idea of which hyperparameters have the very best impact but on the whole, it’s delusional to expect to get your biggest jumps of performance by simply tuning a number of values. For example, in case your best performing model is trained with a learning rate of 4e2, there’s probably something more fundamental happening inside your neural network and you would like to discover and understand this behavior so you can re-use this information outside of your current specific context.
On average, experts use fewer resources to seek out higher solutions.
To conclude, a chunk of general advice that has helped me change into higher at constructing neural networks is to favor (as most as possible) a deep understanding of every component of your neural network as a substitute of blindly (to not say magically) tweak the architecture. Keep it easy and avoid small tweaks you can’t reasonably justify even after trying really hard. Obviously, there’s the suitable balance to seek out between a “trial-and-error” and an “evaluation approach” but a variety of these intuitions feel more natural as you accumulate practical experience. You too are training your internal model. 🤯
A couple of related tips that could complete your reading:


