SIMD (Single Instruction, Multiple Data) operations have been a feature of Intel/AMD and ARM CPUs because the early 2000s. These operations enable you to, for instance, add an array of eight i32
to a different array of eight i32
with only one CPU operation on a single core. Using SIMD operations greatly hastens certain tasks. When you’re not using SIMD, you might not be fully using your CPU’s capabilities.
Is that this “Yet One other Rust and SIMD” article? Yes and no. Yes, I did apply SIMD to a programming problem after which feel compelled to put in writing an article about it. No, I hope that this text also goes into enough depth that it may guide you thru project. It explains the newly available SIMD capabilities and settings in Rust nightly. It features a Rust SIMD cheatsheet. It shows how one can make your SIMD code generic without leaving protected Rust. It gets you began with tools resembling Godbolt and Criterion. Finally, it introduces recent cargo commands that make the method easier.
The range-set-blaze
crate uses its RangeSetBlaze::from_iter
method to ingest potentially long sequences of integers. When the integers are “clumpy”, it may do that 30 times faster than Rust’s standard HashSet::from_iter
. Can we do even higher if we use Simd operations? Yes!
On clumpy integers, RangeSetBlaze::from_slice
— a brand new method based on SIMD operations — is 7 times faster than RangeSetBlaze::from_iter.
That makes it greater than 200 times faster than HashSet::from_iter
. (When the integers will not be clumpy, it remains to be 2 to three times slower than HashSet
.)
Over the course of implementing this speed up, I learned nine rules that may enable you speed up your projects with SIMD operations.
The foundations are:
- Use nightly Rust and
core::simd
, Rust’s experimental standard SIMD module. - CCC: Check, Control, and Select your computer’s SIMD capabilities.
- Learn
core::simd
, but selectively. - Brainstorm candidate algorithms.
- Use Godbolt and AI to know your code’s assembly, even if you happen to don’t know assembly language.
- Generalize to every type and LANES with in-lined generics, (and when that doesn’t work) macros, and (when that doesn’t work) traits.
See Part 2 for these rules:
Rule 1: Use nightly Rust and core::simd
, Rust’s experimental standard SIMD module.
Rust can access SIMD operations either via the stable core::arch
module or via nighty’s core::simd
module. Let’s compare them:
core::arch
core::simd
- Nightly
- Delightfully easy and portable.
- Limits downstream users to nightly.
I made a decision to go along with “easy”. When you determine to take the harder road, starting first with the simpler path should be worthwhile.
In either case, before we try to make use of SIMD operations in a bigger project, let’s ensure that we will get them working in any respect. Listed below are the steps:
First, create a project called simd_hello
:
cargo recent simd_hello
cd simd_hello
Edit src/major.rs
to contain (Rust playground):
// Tell nightly Rust to enable 'portable_simd'
#![feature(portable_simd)]
use core::simd::prelude::*;
// constant Simd structs
const LANES: usize = 32;
const THIRTEENS: Simd = Simd::::from_array([13; LANES]);
const TWENTYSIXS: Simd = Simd::::from_array([26; LANES]);
const ZEES: Simd = Simd::::from_array([b'Z'; LANES]);
fn major() {
// create a Simd struct from a slice of LANES bytes
let mut data = Simd::::from_slice(b"URYYBJBEYQVQBUBCRVGFNYYTBVATJRYY");
data += THIRTEENS; // add 13 to every byte
// compare each byte to 'Z', where the byte is bigger than 'Z', subtract 26
let mask = data.simd_gt(ZEES); // compare each byte to 'Z'
data = mask.select(data - TWENTYSIXS, data);
let output = String::from_utf8_lossy(data.as_array());
assert_eq!(output, "HELLOWORLDIDOHOPEITSALLGOINGWELL");
println!("{}", output);
}
Next — full SIMD capabilities require the nightly version of Rust. Assuming you will have Rust installed, install nightly (rustup install nightly
). Make sure that you will have the most recent nightly version (rustup update nightly
). Finally, set this project to make use of nightly (rustup override set nightly
).
You’ll be able to now run this system with cargo run
. This system applies ROT13 decryption to 32 bytes of upper-case letters. With SIMD, this system can decrypt all 32 bytes concurrently.
Let’s have a look at each section of this system to see how it really works. It starts with:
#![feature(portable_simd)]
use core::simd::prelude::*;
Rust nightly offers its extra capabilities (or “features”) only on request. The #![feature(portable_simd)]
statement requests that Rust nightly make available the brand new experimental core::simd
module. The use
statement then imports the module’s most vital types and traits.
Within the code’s next section, we define useful constants:
const LANES: usize = 32;
const THIRTEENS: Simd = Simd::::from_array([13; LANES]);
const TWENTYSIXS: Simd = Simd::::from_array([26; LANES]);
const ZEES: Simd = Simd::::from_array([b'Z'; LANES]);
The Simd
struct is a special form of Rust array. (It’s, for instance, all the time memory aligned.) The constant LANES
tells the length of the Simd
array. The from_array
constructor copies a daily Rust array to create a Simd
. On this case, because we would like const
Simd
’s, the arrays we construct from must even be const
.
The subsequent two lines copy our encrypted text into data
after which adds 13 to every letter.
let mut data = Simd::::from_slice(b"URYYBJBEYQVQBUBCRVGFNYYTBVATJRYY");
data += THIRTEENS;
What if you happen to make an error and your encrypted text isn’t exactly length LANES
(32)? Sadly, the compiler won’t inform you. As a substitute, while you run this system, from_slice
will panic. What if the encrypted text comprises non-upper-case letters? In this instance program, we’ll ignore that possibility.
The +=
operator does element-wise addition between the Simd
data
and Simd
THIRTEENS
. It puts the lead to data
. Recall that debug builds of normal Rust addition check for overflows. Not so with SIMD. Rust defines SIMD arithmetic operators to all the time wrap. Values of type u8
wrap after 255.
Coincidentally, Rot13 decryption also requires wrapping, but after ‘Z’ quite than after 255. Here is one approach to coding the needed Rot13 wrapping. It subtracts 26 from any values on beyond ‘Z’.
let mask = data.simd_gt(ZEES);
data = mask.select(data - TWENTYSIXS, data);
This says to search out the element-wise places beyond ‘Z’. Then, subtract 26 from all values. On the places of interest, use the subtracted values. At the opposite places, use the unique values. Does subtracting from all values after which using just some seem wasteful? With SIMD, this takes no extra computer time and avoids jumps. This strategy is, thus, efficient and customary.
This system ends like so:
let output = String::from_utf8_lossy(data.as_array());
assert_eq!(output, "HELLOWORLDIDOHOPEITSALLGOINGWELL");
println!("{}", output);
Notice the .as_array()
method. It safely transmutes a Simd
struct into a daily Rust array without copying.
Surprisingly to me, this program runs effective on computers without SIMD extensions. Rust nightly compiles the code to regular (non-SIMD) instructions. But we don’t just need to run “effective”, we would like to run . That requires us to activate our computer’s SIMD power.
Rule 2: CCC: Check, Control, and Select your computer’s SIMD capabilities.
To make SIMD programs run faster in your machine, you have to first discover which SIMD extensions your machine supports. If you will have an Intel/AMD machine, you should utilize my simd-detect
cargo command.
Run with:
rustup override set nightly
cargo install cargo-simd-detect --force
cargo simd-detect
On my machine, it outputs:
extension width available enabled
sse2 128-bit/16-bytes true true
avx2 256-bit/32-bytes true false
avx512f 512-bit/64-bytes true false
This says that my machine supports the sse2
, avx2
, and avx512f
SIMD extensions. Of those, by default, Rust enables the ever present twenty-year-old sse2
extension.
The SIMD extensions form a hierarchy with avx512f
above avx2
above sse2
. Enabling a higher-level extension also enables the lower-level extensions.
Most Intel/AMD computers also support the ten-year-old avx2
extension. You enable it by setting an environment variable:
# For Windows Command Prompt
set RUSTFLAGS=-C target-feature=+avx2
# For Unix-like shells (like Bash)
export RUSTFLAGS="-C target-feature=+avx2"
“Force install” and run simd-detect
again and it’s best to see that avx2
is enabled.
# Force install each time to see changes to 'enabled'
cargo install cargo-simd-detect --force
cargo simd-detect
extension width available enabled
sse2 128-bit/16-bytes true true
avx2 256-bit/32-bytes true true
avx512f 512-bit/64-bytes true false
Alternatively, you may activate every SIMD extension that your machine supports:
# For Windows Command Prompt
set RUSTFLAGS=-C target-cpu=native
# For Unix-like shells (like Bash)
export RUSTFLAGS="-C target-cpu=native"
On my machine this permits avx512f
, a more moderen SIMD extension supported by some Intel computers and just a few AMD computers.
You’ll be able to set SIMD extensions back to their default (sse2
on Intel/AMD) with:
# For Windows Command Prompt
set RUSTFLAGS=
# For Unix-like shells (like Bash)
unset RUSTFLAGS
You might wonder why target-cpu=native
isn’t Rust’s default. The issue is that binaries created using avx2
or avx512f
won’t run on computers missing those SIMD extensions. So, if you happen to are compiling just for your personal use, use target-cpu=native
. If, nevertheless, you might be compiling for others, select your SIMD extensions thoughtfully and let people know which SIMD extension level you might be assuming.
Happily, whatever level of SIMD extension you choose, Rust’s SIMD support is so flexible you may easily change your decision later. Let’s next learn details of programming with SIMD in Rust.
Rule 3: Learn core::simd
, but selectively.
To construct with Rust’s recent core::simd
module it’s best to learn chosen constructing blocks. Here’s a cheatsheet with the structs, methods, etc., that I’ve found most useful. Each item features a link to its documentation.
Structs
Simd
– a special, aligned, fixed-length array ofSimdElement
. We confer with a position within the array and the element stored at that position as a “lane”. By default, we copySimd
structs quite than reference them.Mask
– a special Boolean array showing inclusion/exclusion on a per-lane basis.
SimdElements
Simd
constructors
Simd::from_array
– creates aSimd
struct by copying a fixed-length array.Simd::from_slice
– creates aSimd
struct by copying the primaryLANE
elements of a slice.Simd::splat
– replicates a single value across all lanes of aSimd
struct.slice::as_simd
– without copying, safely transmutes a daily slice into an aligned slice ofSimd
(plus unaligned leftovers).
Simd
conversion
Simd::as_array
– without copying, safely transmutes anSimd
struct into a daily array reference.
Simd
methods and operators
simd[i]
– extract a worth from a lane of aSimd
.simd + simd
– performs element-wise addition of twoSimd
structs. Also, supported-
,*
,/
,%
, remainder, bitwise-and, -or, xor, -not, -shift.simd += simd
– adds one otherSimd
struct to the present one, in place. Other operators supported, too.Simd::simd_gt
– compares twoSimd
structs, returning aMask
indicating which elements of the primary are greater than those of the second. Also, supportedsimd_lt
,simd_le
,simd_ge
,simd_lt
,simd_eq
,simd_ne
.Simd::rotate_elements_left
– rotates the weather of aSimd
struct to the left by a specified amount. Also,rotate_elements_right
.simd_swizzle!(simd, indexes)
– rearranges the weather of aSimd
struct based on the required const indexes.simd == simd
– checks for equality between twoSimd
structs, returning a dailybool
result.Simd::reduce_and
– performs a bitwise AND reduction across all lanes of aSimd
struct. Also, supported:reduce_or
,reduce_xor
,reduce_max
,reduce_min
,reduce_sum
(but noreduce_eq
).
Mask
methods and operators
Mask::select
– selects elements from twoSimd
struct based on a mask.Mask::all
– tells if the mask is alltrue
.Mask::any
– tells if the mask comprises anytrue
.
All about lanes
Simd::LANES
– a relentless indicating the variety of elements (lanes) in aSimd
struct.SupportedLaneCount
– tells the allowed values ofLANES
. Use by generics.simd.lanes
– const method that tells aSimd
struct’s variety of lanes.
Low-level alignment, offsets, etc.
More, perhaps of interest
With these constructing blocks at hand, it’s time to construct something.
Rule 4: Brainstorm candidate algorithms.
What do need to speed up? You won’t know ahead of time which SIMD approach (of any) will work best. You must, subsequently, create many algorithms that you would be able to then analyze (Rule 5) and benchmark (Rule 7).
I desired to speed up range-set-blaze
, a crate for manipulating sets of “clumpy” integers. I hoped that creating is_consecutive
, a function to detect blocks of consecutive integers, could be useful.
Background: Crate lumpy”, here, signifies that the variety of ranges needed to represent the info is small in comparison with the variety of input integers. For instance, these 1002 input integers
100, 101,
…,489, 499, 501, 502,
…,998, 999, 999, 100, 0
Ultimately grow to be three Rust ranges:
0..=0, 100..=499, 501..=999
.(Internally, the struct represents a set of integers as a sorted list of disjoint ranges stored in a cache efficient BTreeMap.)
Although the input integers are allowed to be unsorted and redundant, we expect them to often be “nice”. RangeSetBlaze’s
from_iter
constructor already exploits this expectation by grouping up adjoining integers. For instance,from_iter
first turns the 1002 input integers into 4 ranges
with minimal, constant memory usage, independent of input size. It then sorts and merges these reduced ranges.
I wondered if a brand new
from_slice
method could speed construction from array-like inputs by quickly finding (some) consecutive integers. For instance, could it— with minimal, constant memory — turn the 1002 inputs integers
Let’s start by writing is_consecutive
with regular Rust:
pub const LANES: usize = 16;
pub fn is_consecutive_regular(chunk: &[u32; LANES]) -> bool {
for i in 1..LANES {
if chunk[i - 1].checked_add(1) != Some(chunk[i]) {
return false;
}
}
true
}
The algorithm just loops through the array sequentially, checking that every value is another than its predecessor. It also avoids overflow.
Looping over the items seemed really easy, I wasn’t sure if SIMD could do any higher. Here was my first attempt:
Splat0
use std::simd::prelude::*;
const COMPARISON_VALUE_SPLAT0: Simd =
Simd::from_array([15, 14, 13, 12, 11, 10, 9, 8, 7, 6, 5, 4, 3, 2, 1, 0]);
pub fn is_consecutive_splat0(chunk: Simd) -> bool {
if chunk[0].overflowing_add(LANES as u32 - 1) != (chunk[LANES - 1], false) {
return false;
}
let added = chunk + COMPARISON_VALUE_SPLAT0;
Simd::splat(added[0]) == added
}
Here is a top level view of its calculations:
It first (needlessly) checks that the primary and last items are 15 apart. It then creates added
by adding 15 to the 0th item, 14 to the following, etc. Finally, to see if all items in added
are the identical, it creates a brand new Simd
based on added
’s 0th item after which compares. Recall that splat
creates a Simd
struct from one value.
Splat1 & Splat2
When I discussed the is_consecutive
problem to Ben Lichtman, he independently got here up with this, Splat1:
const COMPARISON_VALUE_SPLAT1: Simd =
Simd::from_array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15]);
pub fn is_consecutive_splat1(chunk: Simd) -> bool {
let subtracted = chunk - COMPARISON_VALUE_SPLAT1;
Simd::splat(chunk[0]) == subtracted
}
Splat1 subtracts the comparison value from chunk
and checks if the result is similar as the primary element of chunk
, splatted.

He also got here up with a variation called Splat2 that splats the primary element of subtracted
quite than chunk
. That might seemingly avoid one memory access.
I’m sure you might be wondering which of those is best, but before we discuss that allow’s have a look at two more candidates.
Swizzle
Swizzle is like Splat2 but uses simd_swizzle!
as an alternative of splat
. Macro simd_swizzle!
creates a brand new Simd
by rearranging the lanes of an old Simd
in response to an array of indexes.
pub fn is_consecutive_sizzle(chunk: Simd) -> bool {
let subtracted = chunk - COMPARISON_VALUE_SPLAT1;
simd_swizzle!(subtracted, [0; LANES]) == subtracted
}
Rotate
This one is different. I had high hopes for it.
const COMPARISON_VALUE_ROTATE: Simd =
Simd::from_array([4294967281, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]);
pub fn is_consecutive_rotate(chunk: Simd) -> bool {
let rotated = chunk.rotate_elements_right::<1>();
chunk - rotated == COMPARISON_VALUE_ROTATE
}
The concept is to rotate all the weather one to the fitting. We then subtract the unique chunk
from rotated
. If the input is consecutive, the result must be “-15” followed by all 1’s. (Using wrapped subtraction, -15 is 4294967281u32
.)

Now that we have now candidates, let’s start to judge them.
Rule 5: Use Godbolt and AI to know your code’s assembly, even if you happen to don’t know assembly language.
We’ll evaluate the candidates in two ways. First, on this rule, we’ll have a look at the assembly language generated from our code. Second, in Rule 7, we’ll benchmark the code’s speed.
The best strategy to see the generated assembly language is with the Compiler Explorer, AKA Godbolt. It really works best on short bits of code that don’t use outside crates. It looks like this:

Referring to the numbers within the figure above, follow these steps to make use of Godbolt:
- Open godbolt.org together with your web browser.
- Add a brand new source editor.
- Select Rust as your language.
- Paste within the code of interest. Make the functions of interest public (
pub fn
). Don’t include a major or unneeded functions. The tool doesn’t support external crates. - Add a brand new compiler.
- Set the compiler version to nightly.
- Set options (for now) to
-C opt-level=3 -C target-feature=+avx512f.
- If there are errors, have a look at the output.
- If you must share or save the state of the tool, click “Share”
From the image above, you may see that Splat2 and Sizzle are the exact same, so we will remove Sizzle from consideration. When you open up a duplicate of my Godbolt session, you’ll also see that almost all of the functions compile to in regards to the same variety of assembly operations. The exceptions are Regular — which is for much longer — and Splat0 — which incorporates the early check.
Within the assembly, 512-bit registers start with ZMM. 256-bit registers start YMM. 128-bit registers start with XMM. If you must higher understand the generated assembly, use AI tools to generate annotations. For instance, here I ask Bing Chat about Splat2:

Try different compiler settings, including -C target-feature=+avx2
after which leaving target-feature
completely off.
Fewer assembly operations don’t necessarily mean faster speed. Taking a look at the assembly does, nevertheless, give us a sanity check that the compiler is a minimum of attempting to use SIMD operations, inlining const references, etc. Also, as with Splat1 and Swizzle, it may sometimes tell us when two candidates are the identical.
The range-set-blaze
crate must handle integer types beyond u32
. Furthermore, we must pick various LANES, but we have now no reason to think that 16 LANES is all the time best. To deal with these needs, in the following rule we’ll generalize the code.
Rule 6: Generalize to every type and LANES with in-lined generics, (and when that doesn’t work) macros, and (when that doesn’t work) traits.
Let’s first generalize Splat1 with generics.
#[inline]
pub fn is_consecutive_splat1_gen(
chunk: Simd,
comparison_value: Simd,
) -> bool
where
T: SimdElement + PartialEq,
Simd: Sub, Output = Simd>,
LaneCount: SupportedLaneCount,
{
let subtracted = chunk - comparison_value;
Simd::splat(chunk[0]) == subtracted
}
First, note the #[inline]
attribute. It’s vital for efficiency and we’ll apply it to just about every one in all these small functions.
The function defined above, is_consecutive_splat1_gen
, looks great except that it needs a second input, called comparison_value
, that we have now yet to define.
We are able to attempt to create a comparison_value_splat_gen
that’s generic and const. Sadly, neither From
nor alternative T::One
are const, so this doesn’t work:
// DOESN'T WORK BECAUSE From just isn't const
pub const fn comparison_value_splat_gen() -> Simd
where
T: SimdElement + Default + From + AddAssign,
LaneCount: SupportedLaneCount,
{
let mut arr: [T; N] = [T::from(0usize); N];
let mut i_usize = 0;
while i_usize < N {
arr[i_usize] = T::from(i_usize);
i_usize += 1;
}
Simd::from_array(arr)
}
Macros are the last refuge of scoundrels. So, let’s use macros:
#[macro_export]
macro_rules! define_is_consecutive_splat1 {
($function:ident, $type:ty) => {
#[inline]
pub fn $function(chunk: Simd<$type, N>) -> bool
where
LaneCount: SupportedLaneCount,
{
define_comparison_value_splat!(comparison_value_splat, $type);
let subtracted = chunk - comparison_value_splat();
Simd::splat(chunk[0]) == subtracted
}
};
}
#[macro_export]
macro_rules! define_comparison_value_splat {
($function:ident, $type:ty) => {
pub const fn $function() -> Simd<$type, N>
where
LaneCount: SupportedLaneCount,
{
let mut arr: [$type; N] = [0; N];
let mut i = 0;
while i < N {
arr[i] = i as $type;
i += 1;
}
Simd::from_array(arr)
}
};
}
This lets us run on any particular element type and all variety of LANES (Rust Playground):
define_is_consecutive_splat1!(is_consecutive_splat1_i32, i32);
let a: Simd = black_box(Simd::from_array(array::from_fn(|i| 100 + i as i32)));
let ninety_nines: Simd = black_box(Simd::from_array([99; 16]));
assert!(is_consecutive_splat1_i32(a));
assert!(!is_consecutive_splat1_i32(ninety_nines));
Sadly, this still isn’t enough for range-set-blaze
. It must run on element types (not only one) and (ideally) all LANES (not only one).
Happily, there’s a workaround, that again is determined by macros. It also exploits the incontrovertible fact that we only have to support a finite list of types, namely: i8
, i16
, i32
, i64
, isize
, u8
, u16
, u32
, u64
, and usize
. If you might want to also (or as an alternative) support f32
and f64
, that’s effective.
The workaround defines a brand new trait, here called IsConsecutive
. We then use a macro (that calls a macro, that calls a macro) to implement the trait on the ten sorts of interest.
pub trait IsConsecutive {
fn is_consecutive(chunk: Simd) -> bool
where
Self: SimdElement,
Simd: Sub, Output = Simd>,
LaneCount: SupportedLaneCount;
}
macro_rules! impl_is_consecutive {
($type:ty) => {
impl IsConsecutive for $type {
#[inline] // very vital
fn is_consecutive(chunk: Simd) -> bool
where
Self: SimdElement,
Simd: Sub, Output = Simd>,
LaneCount: SupportedLaneCount,
{
define_is_consecutive_splat1!(is_consecutive_splat1, $type);
is_consecutive_splat1(chunk)
}
}
};
}
impl_is_consecutive!(i8);
impl_is_consecutive!(i16);
impl_is_consecutive!(i32);
impl_is_consecutive!(i64);
impl_is_consecutive!(isize);
impl_is_consecutive!(u8);
impl_is_consecutive!(u16);
impl_is_consecutive!(u32);
impl_is_consecutive!(u64);
impl_is_consecutive!(usize);
We are able to now call fully generic code (Rust Playground):
// Works on i32 and 16 lanes
let a: Simd = black_box(Simd::from_array(array::from_fn(|i| 100 + i as i32)));
let ninety_nines: Simd = black_box(Simd::from_array([99; 16]));
assert!(IsConsecutive::is_consecutive(a));
assert!(!IsConsecutive::is_consecutive(ninety_nines));
// Works on i8 and 64 lanes
let a: Simd = black_box(Simd::from_array(array::from_fn(|i| 10 + i as i8)));
let ninety_nines: Simd = black_box(Simd::from_array([99; 64]));
assert!(IsConsecutive::is_consecutive(a));
assert!(!IsConsecutive::is_consecutive(ninety_nines));
With this system, we will create multiple candidate algorithms which are fully generic over type and LANES. Next, it’s time to benchmark and see which algorithms are fastest.
Those are the primary six rules for adding SIMD code to Rust. In Part 2, we have a look at rules 7 to 9. These rules will cover how one can pick an algorithm and set LANES. Also, how one can integrate SIMD operations into your existing code and (importantly) how one can make it optional. Part 2 concludes with a discussion of when/if you happen to should use SIMD and concepts for improving Rust’s SIMD experience. I hope to see you there.