Super Charge Your ML Systems In 4 Easy Steps

Artificial Intelligence

Super Charge Your ML Systems In 4 Easy Steps

admin

October 28, 2023

Super Charge Your ML Systems In 4 Easy Steps

Welcome to the rollercoaster of ML optimization! This post will take you thru my process for optimizing any ML system for lightning-fast training and inference in 4 easy steps.

Imagine this: You finally get placed on a cool recent ML project where you might be training your agent to count what number of hot dogs are in a photograph, the success of which could possibly make your organization tens of dollars!

You get the newest hotshot object detection model implemented in your favourite framework that has numerous GitHub stars, run some toy examples and after an hour or so it’s picking out hotdogs like a broke student of their third repeat yr of faculty, life is sweet.

The subsequent steps are obvious, we would like to scale it as much as some harder problems, this implies more data, a bigger model and after all, longer training time. Now you’re looking at days of coaching as a substitute of hours. That’s wonderful though, you’ve been ignoring the remainder of your team for 3 weeks now and may probably spend a day getting through the backlog of code reviews and passive-aggressive emails which have built up.

You come back a day later after feeling good in regards to the insightful and absolutely vital nitpicks you left in your colleagues MR’s, only to seek out your performance tanked and crashed post a 15-hour training stint (karma works fast).

The following days morph right into a whirlwind of trials, tests and experiments, with each potential idea taking greater than a day to run. These quickly start racking up a whole lot of dollars in compute costs, all resulting in the large query: How can we make this faster and cheaper?

Welcome to the emotional rollercoaster of ML optimization! Here’s a simple 4-step process to show the tides in your favour:

Benchmark
Simplify
Optimize
Repeat

That is an iterative process, and there might be repeatedly once you repeat some steps before moving on to the following, so it’s less of a 4 step system and more of a toolbox, but 4 steps sounds higher.

“Measure twice, cut once” — Someone smart.

The primary (and possibly second) thing you need to at all times do, is profile your system. This could be something so simple as just timing how long it takes to run a particular block of code, or as complex as doing a full profile trace. What matters is you’ve enough information to discover the bottlenecks in your system. I perform multiple benchmarks depending on where we’re in the method and typically break it down into 2 types: high-level and low-level benchmarking.

High Level

That is the type of stuff you might be showing your boss on the weekly “How f**cked are we?” meeting and would want these metrics as a part of every run. These provides you with a high-level sense of how performant your system is running.

Batches Per Second — how quickly are we getting through each of our batches? this ought to be as high as possible

Steps Per Second — (RL specific) how quickly are we stepping through the environment to generate our data, ought to be as high as possible. There are some complicated interplays between step time and train batches that I won’t get into here.

GPU Util — how much of your GPU is being utilised during training? This ought to be consistently as near 100%, if not then you’ve idle time that could be optimized.

CPU Util — how much of your CPUs are being utilised during training? Again, this ought to be as near 100% as possible.

FLOPS — floating point operations per second, this offers you a view of how effectively are you using your total hardware.

Low Level

Using the metrics above you possibly can then begin to look deeper as to where your bottleneck could be. Once you’ve these, you desire to start taking a look at more fine-grained metrics and profiling.

Time Profiling — That is the only, and sometimes most useful, experiment to run. Profiling tools like cprofiler could be used to get a bird’s eye view of the timing of every of your components as an entire or can have a look at the timing of specific components.

Memory Profiling — One other staple of the optimization toolbox. Big systems require quite a lot of memory, so we’ve got to be certain we will not be wasting any of it! tools like memory-profiler will allow you to narrow down where your system is eating up your RAM.

Model Profiling — Tools like Tensorboard include excellent profiling tools for taking a look at what’s eating up your performance inside your model.

Network Profiling — Network load is a standard wrongdoer for bottlenecking your system. There are tools like wireshark to allow you to profile this, but to be honest I never use it. As an alternative, I prefer to do time profiling on my components and measure the overall time it’s taking inside my component after which isolate how much time is coming from the network I/O itself.

Be sure that to envision out this great article on profiling in Python from RealPython for more information!

Once you’ve identified an area in your profiling that should be optimized, simplify it. Cut out all the things else except that part. Keep reducing the system right down to smaller parts until you reach the bottleneck. Don’t be afraid to profile as you simplify, this may be sure that you might be moving into the appropriate direction as you iterate. Keep repeating this until you discover your bottleneck.

Suggestions

Replace other components with stubs and mock functions that just provide expected data.
Simulate heavy functions with sleep functions or dummy calculations.
Use dummy data to remove the overhead of the info generation and processing.
Start with local, single-process versions of your system before moving to distributed.
Simulate multiple nodes and actors on a single machine to remove the network overhead.
Find the theoretical max performance for every a part of the system. If the entire other bottlenecks within the system were gone apart from this component, what’s our expected performance?
Profile again! Every time you simplify the system, re-run your profiling.

Questions

Once we’ve got zoned in on the bottleneck there are some key questions we would like to reply

What’s the theoretical max performance of this component?

If we’ve got sufficiently isolated the bottlenecked component then we should always have the opportunity to reply this.

How distant are we from the max?

This optimality gap will inform us on how optimized our system is. Now, it might be the case that there are other hard constraints once we introduce the component back into the system and that’s wonderful, nevertheless it is crucial to not less than pay attention to what the gap is.

Is there a deeper bottleneck?

At all times ask yourself this, possibly the issue is deeper than you initially thought, during which case, we repeat the technique of benchmarking and simplifying.

Okay, so let’s say we’ve got identified the largest bottleneck, now we get to the fun part, how can we improve things? There are often 3 areas that we should always be taking a look at for possible improvements

Compute
Communication
Memory

Compute

So as to reduce computation bottlenecks we’d like to take a look at being as efficient as possible with the info and algorithms we’re working with. This is clearly project-specific and there is a large amount of things that could be done, but let’s have a look at some good rules of thumb.

Parallelising — be certain that you just perform as much work as possible in parallel. That is the primary big win in designing your system that may massively impact performance. Have a look at methods like vectorisation, batching, multi-threading and multi-processing.

Caching — pre-compute and reuse calculations where you possibly can. Many algorithms can reap the benefits of reusing pre-computed values and save critical compute for every of your training steps.

Offloading — everyone knows that Python just isn’t known for its speed. Luckily we are able to offload critical computations to lower level languages like C/C++.

Hardware Scaling — That is type of a cop-out, but when all else fails, we are able to at all times just throw more computers at the issue!

Communication

Any seasoned engineer will let you know that communication is essential to delivering a successful project, and by that, we after all mean communication inside our system (God forbid we ever should consult with our colleagues). Some good rules of thumb are:

No Idle Time — Your entire available hardware should be utilised in any respect times, otherwise you might be leaving performance gains on the table. This is normally as a consequence of complications and overhead of communication across your system.

Stay Local — Keep all the things on a single machine for so long as possible before moving to a distributed system. This keeps your system easy in addition to avoids the communication overhead of a distributed system.

Async > Sync — Discover anything that could be done asynchronously, this may help offload the price of communication by keeping work moving while data is being moved.

Avoid Moving Data — moving data from CPU to GPU or from one process to a different is dear! Do as little of this as possible or reduce the impact of this by carrying it out asynchronously.

Memory

Last but not least is memory. Lots of the areas mentioned above could be helpful in relieving your bottleneck, nevertheless it won’t be possible if you’ve no memory available! Let’s have a look at some things to think about.

Data Types — keep these as small as possible helping to cut back the price of communication, and memory and with modern accelerators, it can also reduce computation.

Caching — just like reducing computation, smart caching may help prevent memory. Nevertheless, be certain your cached data is getting used continuously enough to justify the caching.

Pre-Allocate — not something we’re used to in Python, but being strict with pre-allocating memory can mean you already know exactly how much memory you wish, reduces the danger of fragmentation and should you are able to jot down to shared memory, you’ll reduce communication between your processes!

Garbage Collection — luckily python handles most of this for us, but it is necessary to be certain you will not be keeping large values in scope while not having them or worse, having a circular dependency that could cause a memory leak.

Be Lazy — Evaluate expressions only when vital. In Python, you should utilize generator expressions as a substitute of list comprehensions for operations that could be lazily evaluated.

So, when are we finished? Well, that basically is determined by your project, what the necessities are and the way long it takes before your dwindling sanity finally breaks!

As you remove bottlenecks, you’re going to get diminishing returns on the effort and time you might be putting in to optimize your system. As you undergo the method it is advisable determine when good is sweet enough. Remember, speed is a method to an end, don’t get caught within the trap of optimizing for the sake of it. If it just isn’t going to have an effect on users, then it might be time to maneuver on.

Constructing large-scale ML systems is HARD. It’s like playing a twisted game of “Where’s Waldo” crossed with Dark Souls. Should you do manage to seek out the issue you’ve to take multiple attempts to beat it and you find yourself spending most of your time getting your ass kicked, asking yourself “Why am I spending my Friday night doing this?”. Having a straightforward and principled approach can allow you to get past that final boss battle and taste those sweet, sweet theoretical max FLOPs.