As we detailed in our previous blog post, Intel Xeon CPUs provide a set of features especially designed for AI workloads corresponding to AVX512 or VNNI (Vector Neural Network Instructions)
for efficient inference using integer quantized neural network for inference together with additional system tools to make sure the work is being done in probably the most efficient way.
On this blog post, we’ll concentrate on software optimizations and offer you a way of the performances of the brand new Ice Lake generation of Xeon CPUs from Intel. Our goal is to provide you a full picture of what’s available on the software side to make probably the most out of your Intel hardware.
As within the previous blog post, we show the performance with benchmark results and charts, together with latest tools to make all these knobs and features easy to make use of.
Back in April, Intel launched its latest generation of Intel Xeon processors, codename Ice Lake, targeting more efficient and performant AI workloads.
More precisely, Ice Lake Xeon CPUs can achieve as much as 75% faster inference on a wide range of NLP tasks when comparing against the previous generation of Cascade Lake Xeon processors.
That is achieved by a mix of each hardware and software improvements, corresponding to latest instructions and PCIe 4.0 featured on the brand new Sunny Cove architecture to supports Machine Learning and Deep Learning workloads.
Last but not least, Intel worked on dedicated optimizations for various frameworks which now include Intel’s flavors like
Intel’s Extension for Scikit Learn,
Intel TensorFlow and
Intel PyTorch Extension.
All these features are very low-level within the stack of what Data Scientists and Machine Learning Engineers use of their day-to-day toolset.
In a overwhelming majority of situations, it’s more common to depend on higher level frameworks and libraries to handle multi-dimensional arrays manipulation corresponding to
PyTorch and TensorFlow and make use of highly tuned mathematical operators corresponding to BLAS (Basic Linear Algebra Subroutines) for the computational part.
On this area, Intel plays a necessary role by providing software components under the oneAPI umbrella which makes it very easy to make use of highly efficient linear algebra routines through
Intel oneMKL (Math Kernel Library),
higher-level parallelization framework with Intel OpenMP or the Threading Constructing Blocks (oneTBB).
Also, oneAPI provides some domain-specific libraries corresponding to Intel oneDNN for deep neural network primitives (ReLU, fully-connected, etc.) or
oneCCL for collective communication especially useful when using distributed setups to access efficient all-reduce operations over multiple hosts.
A few of these libraries, especially MKL or oneDNN, are natively included in frameworks corresponding to PyTorch and TensorFlow (since 2.5.0) to bring all of the performance improvements to the top user out of the box.
When one would really like to focus on very specific hardware features, Intel provides custom versions of probably the most common software, especially optimized for the Intel platform.
That is as an illustration the case with TensorFlow, for which Intel provides custom, highly tuned and optimized versions of the framework,
or with the Intel PyTorch Extension (IPEX) framework which might be regarded as a feature laboratory before upstreaming to PyTorch.
Deep Dive: Leveraging advanced Intel features to enhance AI performances
Performance tuning knobs
As highlighted above, we’re going to cover a brand new set of tunable items to enhance the performance of our AI application. From a high-level viewpoint, every machine learning and deep learning framework is manufactured from the identical ingredients:
- A structural way of representing data in memory (vector, matrices, etc.)
- Implementation of mathematical operators
- Efficient parallelization of the computations on the goal hardware
Along with the points listed above, deep learning frameworks provide ways to represent data flow and dependencies to compute gradients.
This falls out of the scope of this blog post, and it leverages the identical components because the ones listed above!
1. Memory allocation and management libraries
This blog post will deliberately skip the primary point concerning the data representation because it is something slightly framework specific.
For reference, PyTorch uses its very own implementation, called ATen,
while TensorFlow relies on the open source library Eigen for this purpose.
While it’s very complex to use generic optimizations to different object structures and layouts, there’s one area where we will have an effect: Memory Allocation.
As a brief reminder, memory allocation here refers back to the means of programmatically asking the operating system a dynamic (unknown beforehand) area on the system where we’ll have the ability to store items into, corresponding to the malloc and derived in C or the brand new operator in C++.
Memory efficiency, each when it comes to speed but additionally when it comes to fragmentation, is an unlimited scientific and engineering subject with multiple solutions depending on the duty and underlying hardware.
Over the past years we saw increasingly work on this area, with notably:
Each pushes forward different approaches to enhance points of the memory allocation and management on various software.
2. Efficient parallelization of computations
Now that we’ve an efficient option to represent our data, we want a option to take probably the most out of the computational hardware at our disposal.
Interestingly, relating to inference, CPUs have a possible advantage over GPUs within the sense they’re all over the place, they usually don’t require specific application components and administration staff to operate them.
Modern CPUs include many cores and complicated mechanisms to extend the overall performances of software.
Yet, as we highlighted on the primary blog post, in addition they have features which might be tweaked depending on the sort of workload (CPU or I/O certain) you goal, to further improve performances on your application.
Still, implementing parallel algorithms may not be so simple as throwing more cores to do the work.
Many aspects, corresponding to data structures used, concurrent data access, CPU caches invalidation – all of which could prevent your algorithm from being effectively faster.
As a reference talk, we recommend the talk from Scott Meyers: CPU Caches and Why You Care in case you are considering diving more into the topic.
Thankfully, there are libraries which make the event means of such parallel algorithms easier and fewer error-prone.
Amongst probably the most common parallel libraries we will mention OpenMP and TBB (Threading Constructing Blocks), which work at various levels, from programming API in C/C++ to environment variable tuning and dynamic scheduling.
On Intel hardware, it is suggested to make use of the Intel implementation of the OpenMP specification often referred as “IOMP” available as a part of the Intel oneAPI toolkit.
3. Optimized mathematical operators
Now that we covered the obligatory constructing blocks for designing efficient data structures and parallel algorithms, the last remaining piece is the one running the computation,
the one implementing the variability of mathematical operators and neural network layers to do what we love most, designing neural networks! 😊
In every programmer toolkit, there are multiple levels which may bring mathematical operations support, which may then be optimized otherwise depending on various aspects corresponding to the info storage layout
getting used (Contiguous memory, Chunked, Packed, etc.), the info format representing each scalar element (Float32, Integer, Long, Bfloat16, etc.) and naturally the assorted instructions being supported by your processor.
Nowadays, just about all processors support basic mathematical operations on scalar items (one single item at time) or in vectorized mode (meaning they operate on multiple items throughout the same CPU instructions, referred as SIMD “Single Instruction Multiple Data”).
Famous sets of SIMD instructions are SSE2, AVX, AVX2 and the AVX-512 present on the newest generations of Intel CPUs having the ability to operate over 16 bytes of content inside a single CPU clock.
More often than not, one doesn’t must worry an excessive amount of concerning the actual assembly being generated to execute an easy element-wise addition between two vectors, but in case you do,
again there are some libraries which assist you to go one level higher than writing code calling CPU specific intrinsic to implement efficient mathematical kernels.
That is as an illustration what Intel’s MKL “Math Kernel Library” provides, together with the famous BLAS “Basic Linear Algebra Subroutines” interface to implement all the essential operations for linear algebra.
Finally, on top of this, one can find some domain specific libraries corresponding to Intel’s oneDNN which brings all probably the most common and essential constructing blocks required to implement neural network layers.
Intel MKL and oneDNN are natively integrated throughout the PyTorch framework, where it could enable some performance speedup for certain operations corresponding to Linear + ReLU or Convolution.
On the TensorFlow side, oneDNN might be enabled by setting the environment variable TF_ENABLE_ONEDNN_OPTS=1 (TensorFlow >= 2.5.0) to attain similar machinery under the hood.
More Efficient AI Processing on latest Intel Ice Lake CPUs
To be able to report the performances of the Ice Lake product lineup we’ll closely follow the methodology we used for the primary blog post of this series. As a reminder, we’ll adopt the very same schema to benchmark the assorted setups we’ll highlight through this second blog post. More precisely, the outcomes presented in the next sections are based on:
- PyTorch: 1.9.0
- TensorFlow: 2.5.0
- Batch Sizes: 1, 4, 8, 16, 32, 128
- Sequence Lengths: 8, 16, 32, 64, 128, 384, 512
We are going to present the outcomes through metrics accepted by the sector to ascertain the performances of the proposed optimizations:
- Latency: Time it takes to execute a single inference request (i.e., “forward call”) through the model, expressed in millisecond.
- Throughput: Variety of inference requests (i.e., “forward calls”) the system can sustain inside an outlined period, expressed in call/sec.
We can even provide an initial baseline showing out-of-the-box results and a second baseline applying all different optimizations we highlighted in the primary blogpost.
Every part was run on an Intel provided cloud instance featuring the Ice Lake Xeon Platinum 8380 CPU operating on Ubuntu 20.04.2 LTS.
Yow will discover the identical processors on the assorted cloud providers:
Establishing the baseline
As mentioned previously, the baselines can be composed of two different setups:
– Out-of-the-box: We’re running the workloads as-is, with none tuning
– Optimized: We apply the assorted knobs present in Blog #1
Also, from the comments we had concerning the previous blog post, we wanted to alter the best way we present the framework throughout the resulting benchmarks.
As such, through the remaining of this second blog post, we’ll split framework benchmarking results based on the next:
- Frameworks using “eager” mode for computations (PyTorch, TensorFlow)
- Frameworks using “graph” mode for computations (TorchScript, TensorFlow Graph, Intel Tensorflow)
Baseline: Eager frameworks latencies
Frameworks operating in eager mode normally discover the actual graph while executing it.
More precisely, the actual computation graph isn’t known beforehand and also you regularly (eagerly) execute one operator
which is able to grow to be the input of the following one, etc. until you reach leaf nodes (outputs).
These frameworks normally provide more flexibility within the algorithm you implement at the fee of increased runtime overhead
and barely potential more memory usage to maintain track of all of the required elements for the backward pass.
Last but not least, it is often harder through these frameworks to enable graph optimizations corresponding to operator fusion.
As an illustration, many deep learning libraries corresponding to oneDNN have optimized kernels for Convolution + ReLU but you really want
to know before executing the graph that this pattern will occur throughout the sequence of operation, which is, by design, not
something possible inside eager frameworks.
The worldwide trend highlights the positive impact of the variety of cores on the observed latencies.
In a lot of the cases, increasing the variety of cores reduces the computation time across different workload sizes.
Still, putting more cores to the duty doesn’t end in monotonic latency reductions, there’s at all times a trade-off between the workload’s size and the variety of resources you allocate to execute the job.
As you’ll be able to see on the charts above, one quite common pattern tends to arise from using all of the cores available on systems with a couple of CPU (a couple of socket).
The inter-socket communication introduces a big latency overhead and leads to little or no improvement to increased latency overall.
Also, this inter-socket communication overhead tends to be less and fewer perceptive because the workload becomes larger,
meaning the usage of all computational resources advantages from using all of the available cores.
On this domain, it seems PyTorch (Figure 1.) and Intel TensorFlow (Figure 4.) appear to have barely higher parallelism support,
as showed on the sequence length 384 and 512 for which using all of the cores still reduces the observed latency.
Baseline: Graph frameworks latencies
This time we compare performance when using frameworks in “Graph” mode, where the graph is fully known beforehand,
and all of the allocations and optimizations corresponding to graph pruning and operators fusing might be made.
That is also known as “tracing” the graph and, as you’ll be able to see here, the outcomes usually are not that different from TorchScript (Graph execution mode from PyTorch) vs TensorFlow(s).
All TensorFlow implementations appear to perform higher than TorchScript when the parallelization is restricted (low variety of cores involved within the intra operation computations) but this seems to not scale efficiently
as we increase the computation resources, whereas TorchScript seems to have the ability to raised leverage the facility of contemporary CPUs.
Still, the margin between all these frameworks generally very limited.
Tuning the Memory Allocator: Can this impact the latencies observed?
One crucial component every program dynamically allocating memory relies on is the memory allocator.
If you happen to are acquainted with C/C++ programming this component provides the low bits to malloc/free or latest/delete.
More often than not you don’t must worry an excessive amount of about it and the default ones (glibc as an illustration on most Linux distributions) will provide great performances out of the box.
Still, in some situations it may not provide probably the most efficient performances, as these default allocators are more often than not designed to be “good” more often than not,
and never fine-tuned for specific workloads or parallelism.
So, what are the alternatives, and when are they more suitable than the default ones? Well, again, it depends upon the sort of context around your software.
Possible situations are a heavy variety of allocations/de-allocations causing fragmentation over time,
specific hardware and/or architecture you’re executing your software on and at last the extent of parallelism of your application.
Do you see where that is going? Deep learning and by extension all of the applications doing heavy computations are heavily multi-threaded,
that’s also the case for software libraries corresponding to PyTorch, TensorFlow and some other frameworks targeting Machine Learning workloads.
The default memory allocator strategies often depend on global memory pools which require the usage of synchronization primitives to operate,
increasing the general pressure on the system, reducing the performance of your application.
Some recent works by corporations corresponding to Google, Facebook and Microsoft provided alternative memory allocation strategies implemented in custom memory allocator libraries
one can easily integrate directly inside its software components or use dynamic shared library preload to swap the library getting used to attain the allocation/de-allocation.
Amongst these libraries, we will cite a couple of of them corresponding to tcmalloc, jemalloc and mimalloc.
Through this blog post we’ll only concentrate on benchmarking tcmalloc and jemalloc as potential memory allocators drop-in candidates.
To be fully transparent, for the scope of the outcomes below we used tcmalloc as a part of the gperftools package available on Ubuntu distributions version 2.9 and jemalloc 5.1.0-1.
Memory allocator benchmarks
Again, we first compare performance against frameworks executing in an eager fashion.
That is potentially the use case where the allocator can play the largest role: Because the graph is unknown before its execution, each framework must manage the memory required for every operation when it meets the actual execution of the above node, no planning ahead possible.
On this context, the allocator is a significant component because of all of the system calls to allocate and reclaim memory.
As per the graph above, you’ll be able to notice that the usual library allocator (glibc) is usually behind performance-wise but provides reasonable performance.
Jemalloc allocator is usually the fastest around but in very specific situations, where the concurrency isn’t that prime, this might be explained by the underlying structure jemalloc uses
internally which is out of the scope of this blog, but you’ll be able to read the Facebook Engineering blog if you ought to know more about it.
Finally, tcmalloc appears to be the one providing generally best performances across all of the workloads benchmarked here.
Again, tcmalloc has a special approach than Jemalloc in the best way it allocates resources, especially tcmalloc maintains a pool of memory segments locally for every thread, which reduces the need to have global, exclusive, critical paths.
Again, for more details, I invite you to read the complete blog by Google Abseil team.
Now, back to the graph mode where we benchmark framework having an omniscient representation of the general computation graph.
This time, by knowing the underlying structure of the operator flows and matrix shapes involved then the framework can plan and reserve the required resources beforehand.
On this context, and because it is shown within the chart above, the difference between framework may be very small and there is no such thing as a clear winner between jemalloc and tcmalloc.
In fact, glibc remains to be barely behind as a general-purpose memory allocator, however the margin is less important than within the eager setup.
To sum it up, tuning the memory allocator can provide an interesting item to grab the last milliseconds’ improvement at the top of the optimization process, especially in case you are already using traced computation graphs.
OpenMP
Within the previous section we talked concerning the memory management inside machine learning software involving mostly CPU-bound workloads.
Such software often relies on intermediary frameworks corresponding to PyTorch or TensorFlow for Deep Learning which commonly abstract away all of the underlying, highly parallelized, operator implementations.
Writing such highly parallel and optimized algorithms is an actual engineering challenge, and it requires a really low-level understanding of all of the actual elements coming into play
operated by the CPU (synchronization, memory cache, cache validity, etc.).
On this context, it is rather essential to have the ability to leverage primitives to implement such powerful algorithms, reducing the delivery time and computation time by a big margin
in comparison with implementing every thing from scratch.
There are various libraries available which give such higher-level features to speed up the event of algorithms.
Amongst probably the most common, one can have a look at OpenMP, Thread Constructing Blocks and directly from the C++ when targeting a recent version of the usual.
In the next a part of this blog post, we’ll restrict ourselves to OpenMP and particularly comparing the GNU, open source and community-based implementation, to the Intel OpenMP one.
The latter especially targets Intel CPUs and is optimized to supply best of sophistication performances when used as a drop-in substitute against the GNU OpenMP one.
OpenMP exposes many environment variables to routinely configure the underlying resources which can be involved within the computations,
corresponding to the variety of threads to make use of to dispatch computation to (intra-op threads), the best way the system scheduler should bind each of those threads with respect to the CPU resources (threads, cores, sockets)
and another variables which bring further control to the user.
Intel OpenMP exposes more of those environment variables to supply the user much more flexibility to regulate the performance of its software.
As stated above, tuning OpenMP is something you’ll be able to begin to tweak whenever you tried all the opposite, system related, tuning knobs.
It may well bring a final speed as much as you model with only a single environment variable to set.
Also, it is vital to notice that tuning OpenMP library will only work inside software that uses the OpenMP API internally.
More specially, now only PyTorch and TorchScript really make usage of OpenMP and thus profit from OpenMP backend tuning.
This also explains why we reported latencies just for these two frameworks.
Automatic Performances Tuning: Bayesian Optimization with Intel SigOpt
As mentioned above, many knobs might be tweaked to enhance latency and throughput on Intel CPUs, but because there are lots of, tuning all of them to get optimal performance might be cumbersome.
As an illustration, in our experiments, the next knobs were tuned:
- The variety of cores: although using as many cores as you’ve gotten is usually a very good idea, it doesn’t at all times provide one of the best performance since it also means more communication between different threads. On top of that, having higher performance with fewer cores might be very useful because it allows to run multiple instances at the identical time, leading to each higher latency and throughput.
- The memory allocator: which memory allocator out of the default malloc, Google’s tcmalloc and Facebook’s jemalloc provides one of the best performance?
- The parallelism library: which parallelism library out of GNU OpenMP and Intel OpenMP provides one of the best performance?
- Transparent Huge Pages: does enabling Transparent Huge Pages (THP) on the system provide higher performance?
- KMP block time parameter: sets the time, in milliseconds, that a thread should wait, after completing the execution of a parallel region, before sleeping.
In fact, the brute force approach, consisting of trying out all the probabilities will provide one of the best knob values to make use of to get optimal performance but,
the scale of the search space being N x 3 x 2 x 2 x 2 = 24N, it could take numerous time: on a machine with 80 physical cores, this implies trying out at most 24 x 80 = 1920 different setups! 😱
Fortunately, Intel’s SigOpt, through Bayesian optimization, allows us to make these tuning experiments each faster and more convenient to analyse, while providing similar performance than the brute force approach.
Once we analyse the relative difference between the very best latency and what SigOpt provides, we observe that even though it is usually not pretty much as good as brute force (apart from sequence length = 512 in that specific case),
it gives very close performance, with 8.6% being the largest gap on this figure.
|
|
|
SigOpt can be very useful for evaluation: it provides numerous figures and invaluable information.
First, it gives one of the best value it was in a position to find, the corresponding knobs, and the history of trials and the way it improved as trials went, for instance, with sequence length = 20:
|
|
|
On this specific setup, 16 cores together with the opposite knobs were able to provide one of the best results, that may be very essential to know, because as mentioned before,
that implies that multiple instances of the model might be run in parallel while still having one of the best latency for every.
It also shows that it had converged at roughly 20 trials, meaning that perhaps 25 trials as an alternative of 40 would have been enough.
A wide selection of other invaluable information is out there, corresponding to Parameter Importance:
As expected, the variety of cores is, by far, an important parameter, however the others play a component too, and it is rather experiment dependent.
As an illustration, for the sequence length = 512 experiment, this was the Parameter Importance:
|
|
|
Here not only the impact of using OpenMP vs Intel OpenMP was larger than the impact of the allocator, the relative importance of every knob is more balanced than within the sequence length = 20 experiment.
And plenty of more figures, often interactive, can be found on SigOpt corresponding to:
- 2D experiment history, allowing to check knobs vs knobs or knobs vs objectives
- 3D experiment history, allowing to do the identical thing because the 2D experiment history with yet another knob / objective.
Conclusion – Accelerating Transformers for Production
On this post, we showed how the brand new Intel Ice Lake Xeon CPUs are suitable for running AI workloads at scale together with the software elements you’ll be able to swap and tune as a way to exploit the complete potential of the hardware.
All this stuff are to be considered after setting-up the assorted lower-level knobs detailed in the previous blog to maximise the usage of all of the cores and resources.
At Hugging Face, we’re on a mission to democratize state-of-the-art Machine Learning, and a critical a part of our work is to make these state-of-the-art models as efficient as possible, to make use of less energy and memory at scale, and to be more cost-effective to run by corporations of all sizes.
Our collaboration with Intel through the 🤗 Hardware Partner Program enables us to make advanced efficiency and optimization techniques easily available to the community, through our latest 🤗 Optimum open source library dedicated to production performance.
For corporations seeking to speed up their Transformer models inference, our latest 🤗 Infinity product offers a plug-and-play containerized solution, achieving all the way down to 1ms latency on GPU and 2ms on Intel Xeon Ice Lake CPUs.
If you happen to found this post interesting or useful to your work, please consider giving Optimum a star. And if this post was music to your ears, consider joining our Machine Learning Optimization team!
