Nine Rules for SIMD Acceleration of Your Rust Code (Part 1)

-

SIMD (Single Instruction, Multiple Data) operations have been a feature of Intel/AMD and ARM CPUs because the early 2000s. These operations enable you to, for instance, add an array of eight i32 to a different array of eight i32 with only one CPU operation on a single core. Using SIMD operations greatly hastens certain tasks. When you’re not using SIMD, you might not be fully using your CPU’s capabilities.

Is that this “Yet One other Rust and SIMD” article? Yes and no. Yes, I did apply SIMD to a programming problem after which feel compelled to put in writing an article about it. No, I hope that this text also goes into enough depth that it may guide you thru  project. It explains the newly available SIMD capabilities and settings in Rust nightly. It features a Rust SIMD cheatsheet. It shows how one can make your SIMD code generic without leaving protected Rust. It gets you began with tools resembling Godbolt and Criterion. Finally, it introduces recent cargo commands that make the method easier.


The range-set-blaze crate uses its RangeSetBlaze::from_iter method to ingest potentially long sequences of integers. When the integers are “clumpy”, it may do that 30 times faster than Rust’s standard HashSet::from_iter. Can we do even higher if we use Simd operations? Yes!

slower

On clumpy integers, RangeSetBlaze::from_slice — a brand new method based on SIMD operations — is 7 times faster than RangeSetBlaze::from_iter. That makes it greater than 200 times faster than HashSet::from_iter. (When the integers will not be clumpy, it remains to be 2 to three times slower than HashSet.)

Over the course of implementing this speed up, I learned nine rules that may enable you speed up your projects with SIMD operations.

The foundations are:

  1. Use nightly Rust and core::simd, Rust’s experimental standard SIMD module.
  2. CCC: Check, Control, and Select your computer’s SIMD capabilities.
  3. Learn core::simd, but selectively.
  4. Brainstorm candidate algorithms.
  5. Use Godbolt and AI to know your code’s assembly, even if you happen to don’t know assembly language.
  6. Generalize to every type and LANES with in-lined generics, (and when that doesn’t work) macros, and (when that doesn’t work) traits.

See Part 2 for these rules:

Rule 1: Use nightly Rust and core::simd, Rust’s experimental standard SIMD module.

Rust can access SIMD operations either via the stable core::arch module or via nighty’s core::simd module. Let’s compare them:

core::arch

core::simd

  • Nightly
  • Delightfully easy and portable.
  • Limits downstream users to nightly.

I made a decision to go along with “easy”. When you determine to take the harder road, starting first with the simpler path should be worthwhile.


In either case, before we try to make use of SIMD operations in a bigger project, let’s ensure that we will get them working in any respect. Listed below are the steps:

First, create a project called simd_hello:

cargo recent simd_hello
cd simd_hello

Edit src/major.rs to contain (Rust playground):

// Tell nightly Rust to enable 'portable_simd'
#![feature(portable_simd)]
use core::simd::prelude::*;

// constant Simd structs
const LANES: usize = 32;
const THIRTEENS: Simd = Simd::::from_array([13; LANES]);
const TWENTYSIXS: Simd = Simd::::from_array([26; LANES]);
const ZEES: Simd = Simd::::from_array([b'Z'; LANES]);

fn major() {
    // create a Simd struct from a slice of LANES bytes
    let mut data = Simd::::from_slice(b"URYYBJBEYQVQBUBCRVGFNYYTBVATJRYY");

    data += THIRTEENS; // add 13 to every byte

    // compare each byte to 'Z', where the byte is bigger than 'Z', subtract 26
    let mask = data.simd_gt(ZEES); // compare each byte to 'Z'
    data = mask.select(data - TWENTYSIXS, data);

    let output = String::from_utf8_lossy(data.as_array());
    assert_eq!(output, "HELLOWORLDIDOHOPEITSALLGOINGWELL");
    println!("{}", output);
}

Next — full SIMD capabilities require the nightly version of Rust. Assuming you will have Rust installed, install nightly (rustup install nightly). Make sure that you will have the most recent nightly version (rustup update nightly). Finally, set this project to make use of nightly (rustup override set nightly).

You’ll be able to now run this system with cargo run. This system applies ROT13 decryption to 32 bytes of upper-case letters. With SIMD, this system can decrypt all 32 bytes concurrently.

Let’s have a look at each section of this system to see how it really works. It starts with:

#![feature(portable_simd)]
use core::simd::prelude::*;

Rust nightly offers its extra capabilities (or “features”) only on request. The #![feature(portable_simd)] statement requests that Rust nightly make available the brand new experimental core::simd module. The use statement then imports the module’s most vital types and traits.

Within the code’s next section, we define useful constants:

const LANES: usize = 32;
const THIRTEENS: Simd = Simd::::from_array([13; LANES]);
const TWENTYSIXS: Simd = Simd::::from_array([26; LANES]);
const ZEES: Simd = Simd::::from_array([b'Z'; LANES]);

The Simd struct is a special form of Rust array. (It’s, for instance, all the time memory aligned.) The constant LANES tells the length of the Simd array. The from_array constructor copies a daily Rust array to create a Simd. On this case, because we would like const Simd’s, the arrays we construct from must even be const.

The subsequent two lines copy our encrypted text into data after which adds 13 to every letter.

let mut data = Simd::::from_slice(b"URYYBJBEYQVQBUBCRVGFNYYTBVATJRYY");
data += THIRTEENS;

What if you happen to make an error and your encrypted text isn’t exactly length LANES (32)? Sadly, the compiler won’t inform you. As a substitute, while you run this system, from_slice will panic. What if the encrypted text comprises non-upper-case letters? In this instance program, we’ll ignore that possibility.

The += operator does element-wise addition between the Simd data and Simd THIRTEENS. It puts the lead to data. Recall that debug builds of normal Rust addition check for overflows. Not so with SIMD. Rust defines SIMD arithmetic operators to all the time wrap. Values of type u8 wrap after 255.

Coincidentally, Rot13 decryption also requires wrapping, but after ‘Z’ quite than after 255. Here is one approach to coding the needed Rot13 wrapping. It subtracts 26 from any values on beyond ‘Z’.

let mask = data.simd_gt(ZEES);
data = mask.select(data - TWENTYSIXS, data);

This says to search out the element-wise places beyond ‘Z’. Then, subtract 26 from all values. On the places of interest, use the subtracted values. At the opposite places, use the unique values. Does subtracting from all values after which using just some seem wasteful? With SIMD, this takes no extra computer time and avoids jumps. This strategy is, thus, efficient and customary.

This system ends like so:

let output = String::from_utf8_lossy(data.as_array());
assert_eq!(output, "HELLOWORLDIDOHOPEITSALLGOINGWELL");
println!("{}", output);

Notice the .as_array() method. It safely transmutes a Simd struct into a daily Rust array without copying.

Surprisingly to me, this program runs effective on computers without SIMD extensions. Rust nightly compiles the code to regular (non-SIMD) instructions. But we don’t just need to run “effective”, we would like to run . That requires us to activate our computer’s SIMD power.

Rule 2: CCC: Check, Control, and Select your computer’s SIMD capabilities.

To make SIMD programs run faster in your machine, you have to first discover which SIMD extensions your machine supports. If you will have an Intel/AMD machine, you should utilize my simd-detect cargo command.

Run with:

rustup override set nightly
cargo install cargo-simd-detect --force
cargo simd-detect

On my machine, it outputs:

extension       width                   available       enabled
sse2            128-bit/16-bytes        true            true
avx2            256-bit/32-bytes        true            false
avx512f         512-bit/64-bytes        true            false

This says that my machine supports the sse2avx2, and avx512f SIMD extensions. Of those, by default, Rust enables the ever present twenty-year-old sse2 extension.

The SIMD extensions form a hierarchy with avx512f above avx2 above sse2. Enabling a higher-level extension also enables the lower-level extensions.

Most Intel/AMD computers also support the ten-year-old avx2 extension. You enable it by setting an environment variable:

# For Windows Command Prompt
set RUSTFLAGS=-C target-feature=+avx2

# For Unix-like shells (like Bash)
export RUSTFLAGS="-C target-feature=+avx2"

“Force install” and run simd-detect again and it’s best to see that avx2 is enabled.

# Force install each time to see changes to 'enabled'
cargo install cargo-simd-detect --force
cargo simd-detect
extension         width                   available       enabled
sse2            128-bit/16-bytes        true            true
avx2            256-bit/32-bytes        true            true
avx512f         512-bit/64-bytes        true            false

Alternatively, you may activate every SIMD extension that your machine supports:

# For Windows Command Prompt
set RUSTFLAGS=-C target-cpu=native

# For Unix-like shells (like Bash)
export RUSTFLAGS="-C target-cpu=native"

On my machine this permits avx512f, a more moderen SIMD extension supported by some Intel computers and just a few AMD computers.

You’ll be able to set SIMD extensions back to their default (sse2 on Intel/AMD) with:

# For Windows Command Prompt
set RUSTFLAGS=

# For Unix-like shells (like Bash)
unset RUSTFLAGS

You might wonder why target-cpu=native isn’t Rust’s default. The issue is that binaries created using avx2 or avx512f won’t run on computers missing those SIMD extensions. So, if you happen to are compiling just for your personal use, use target-cpu=native. If, nevertheless, you might be compiling for others, select your SIMD extensions thoughtfully and let people know which SIMD extension level you might be assuming.

Happily, whatever level of SIMD extension you choose, Rust’s SIMD support is so flexible you may easily change your decision later. Let’s next learn details of programming with SIMD in Rust.

Rule 3: Learn core::simd, but selectively.

To construct with Rust’s recent core::simd module it’s best to learn chosen constructing blocks. Here’s a cheatsheet with the structs, methods, etc., that I’ve found most useful. Each item features a link to its documentation.

Structs

  • Simd – a special, aligned, fixed-length array of SimdElement. We confer with a position within the array and the element stored at that position as a “lane”. By default, we copy Simd structs quite than reference them.
  • Mask – a special Boolean array showing inclusion/exclusion on a per-lane basis.

SimdElements

  • Floating-Point Types: f32f64
  • Integer Types: i8u8i16u16i32u32i64u64isizeusize
  • — 

Simd constructors

  • Simd::from_array – creates a Simd struct by copying a fixed-length array.
  • Simd::from_slice – creates a Simd struct by copying the primary LANE elements of a slice.
  • Simd::splat – replicates a single value across all lanes of a Simd struct.
  • slice::as_simd – without copying, safely transmutes a daily slice into an aligned slice of Simd (plus unaligned leftovers).

Simd conversion

  • Simd::as_array – without copying, safely transmutes an Simd struct into a daily array reference.

Simd methods and operators

  • simd[i] – extract a worth from a lane of a Simd.
  • simd + simd – performs element-wise addition of two Simd structs. Also, supported -*/%, remainder, bitwise-and, -or, xor, -not, -shift.
  • simd += simd – adds one other Simd struct to the present one, in place. Other operators supported, too.
  • Simd::simd_gt – compares two Simd structs, returning a Mask indicating which elements of the primary are greater than those of the second. Also, supported simd_ltsimd_lesimd_gesimd_ltsimd_eqsimd_ne.
  • Simd::rotate_elements_left – rotates the weather of a Simd struct to the left by a specified amount. Also, rotate_elements_right.
  • simd_swizzle!(simd, indexes) – rearranges the weather of a Simd struct based on the required const indexes.
  • simd == simd – checks for equality between two Simd structs, returning a daily bool result.
  • Simd::reduce_and – performs a bitwise AND reduction across all lanes of a Simd struct. Also, supported: reduce_orreduce_xorreduce_maxreduce_minreduce_sum (but noreduce_eq).

Mask methods and operators

  • Mask::select – selects elements from two Simd struct based on a mask.
  • Mask::all – tells if the mask is all true.
  • Mask::any – tells if the mask comprises any true.

All about lanes

  • Simd::LANES – a relentless indicating the variety of elements (lanes) in a Simd struct.
  • SupportedLaneCount – tells the allowed values of LANES. Use by generics.
  • simd.lanes – const method that tells a Simd struct’s variety of lanes.

Low-level alignment, offsets, etc.

More, perhaps of interest

With these constructing blocks at hand, it’s time to construct something.

Rule 4: Brainstorm candidate algorithms.

What do  need to speed up? You won’t know ahead of time which SIMD approach (of any) will work best. You must, subsequently, create many algorithms that you would be able to then analyze (Rule 5) and benchmark (Rule 7).

I desired to speed up range-set-blaze, a crate for manipulating sets of “clumpy” integers. I hoped that creating is_consecutive, a function to detect blocks of consecutive integers, could be useful.

Background: Crate lumpy”, here, signifies that the variety of ranges needed to represent the info is small in comparison with the variety of input integers. For instance, these 1002 input integers

100, 101, …, 489, 499, 501, 502, …, 998, 999, 999, 100, 0

Ultimately grow to be three Rust ranges:

0..=0, 100..=499, 501..=999.

(Internally, the  struct represents a set of integers as a sorted list of disjoint ranges stored in a cache efficient BTreeMap.)

Although the input integers are allowed to be unsorted and redundant, we expect them to often be “nice”. RangeSetBlaze’s from_iter constructor already exploits this expectation by grouping up adjoining integers. For instance, from_iter first turns the 1002 input integers into 4 ranges

with minimal, constant memory usage, independent of input size. It then sorts and merges these reduced ranges.

I wondered if a brand new from_slice method could speed construction from array-like inputs by quickly finding (some) consecutive integers. For instance, could it— with minimal, constant memory — turn the 1002 inputs integers 

Let’s start by writing is_consecutive with regular Rust:

pub const LANES: usize = 16;
pub fn is_consecutive_regular(chunk: &[u32; LANES]) -> bool {
    for i in 1..LANES {
        if chunk[i - 1].checked_add(1) != Some(chunk[i]) {
            return false;
        }
    }
    true
}

The algorithm just loops through the array sequentially, checking that every value is another than its predecessor. It also avoids overflow.

Looping over the items seemed really easy, I wasn’t sure if SIMD could do any higher. Here was my first attempt:

Splat0

use std::simd::prelude::*;

const COMPARISON_VALUE_SPLAT0: Simd =
    Simd::from_array([15, 14, 13, 12, 11, 10, 9, 8, 7, 6, 5, 4, 3, 2, 1, 0]);

pub fn is_consecutive_splat0(chunk: Simd) -> bool {
    if chunk[0].overflowing_add(LANES as u32 - 1) != (chunk[LANES - 1], false) {
        return false;
    }
    let added = chunk + COMPARISON_VALUE_SPLAT0;
    Simd::splat(added[0]) == added
}

Here is a top level view of its calculations:

Source: This and all following images by creator.

It first (needlessly) checks that the primary and last items are 15 apart. It then creates added by adding 15 to the 0th item, 14 to the following, etc. Finally, to see if all items in added are the identical, it creates a brand new Simd based on added’s 0th item after which compares. Recall that splat creates a Simd struct from one value.

Splat1 & Splat2

When I discussed the is_consecutive problem to Ben Lichtman, he independently got here up with this, Splat1:

const COMPARISON_VALUE_SPLAT1: Simd =
    Simd::from_array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15]);

pub fn is_consecutive_splat1(chunk: Simd) -> bool {
    let subtracted = chunk - COMPARISON_VALUE_SPLAT1;
    Simd::splat(chunk[0]) == subtracted
}

Splat1 subtracts the comparison value from chunk and checks if the result is similar as the primary element of chunk, splatted.

He also got here up with a variation called Splat2 that splats the primary element of subtracted quite than chunk. That might seemingly avoid one memory access.

I’m sure you might be wondering which of those is best, but before we discuss that allow’s have a look at two more candidates.

Swizzle

Swizzle is like Splat2 but uses simd_swizzle! as an alternative of splat. Macro simd_swizzle! creates a brand new Simd by rearranging the lanes of an old Simd in response to an array of indexes.

pub fn is_consecutive_sizzle(chunk: Simd) -> bool {
    let subtracted = chunk - COMPARISON_VALUE_SPLAT1;
    simd_swizzle!(subtracted, [0; LANES]) == subtracted
}

Rotate

This one is different. I had high hopes for it.

const COMPARISON_VALUE_ROTATE: Simd =
    Simd::from_array([4294967281, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]);

pub fn is_consecutive_rotate(chunk: Simd) -> bool {
    let rotated = chunk.rotate_elements_right::<1>();
    chunk - rotated == COMPARISON_VALUE_ROTATE
}

The concept is to rotate all the weather one to the fitting. We then subtract the unique chunk from rotated. If the input is consecutive, the result must be “-15” followed by all 1’s. (Using wrapped subtraction, -15 is 4294967281u32.)

Now that we have now candidates, let’s start to judge them.

Rule 5: Use Godbolt and AI to know your code’s assembly, even if you happen to don’t know assembly language.

We’ll evaluate the candidates in two ways. First, on this rule, we’ll have a look at the assembly language generated from our code. Second, in Rule 7, we’ll benchmark the code’s speed.

The best strategy to see the generated assembly language is with the Compiler Explorer, AKA Godbolt. It really works best on short bits of code that don’t use outside crates. It looks like this:

Referring to the numbers within the figure above, follow these steps to make use of Godbolt:

  1. Open godbolt.org together with your web browser.
  2. Add a brand new source editor.
  3. Select Rust as your language.
  4. Paste within the code of interest. Make the functions of interest public (pub fn). Don’t include a major or unneeded functions. The tool doesn’t support external crates.
  5. Add a brand new compiler.
  6. Set the compiler version to nightly.
  7. Set options (for now) to -C opt-level=3 -C target-feature=+avx512f.
  8. If there are errors, have a look at the output.
  9. If you must share or save the state of the tool, click “Share”

From the image above, you may see that Splat2 and Sizzle are the exact same, so we will remove Sizzle from consideration. When you open up a duplicate of my Godbolt session, you’ll also see that almost all of the functions compile to in regards to the same variety of assembly operations. The exceptions are Regular — which is for much longer — and Splat0 — which incorporates the early check.

Within the assembly, 512-bit registers start with ZMM. 256-bit registers start YMM. 128-bit registers start with XMM. If you must higher understand the generated assembly, use AI tools to generate annotations. For instance, here I ask Bing Chat about Splat2:

Try different compiler settings, including -C target-feature=+avx2 after which leaving target-feature completely off.

Fewer assembly operations don’t necessarily mean faster speed. Taking a look at the assembly does, nevertheless, give us a sanity check that the compiler is a minimum of attempting to use SIMD operations, inlining const references, etc. Also, as with Splat1 and Swizzle, it may sometimes tell us when two candidates are the identical.

The range-set-blaze crate must handle integer types beyond u32. Furthermore, we must pick various LANES, but we have now no reason to think that 16 LANES is all the time best. To deal with these needs, in the following rule we’ll generalize the code.

Rule 6: Generalize to every type and LANES with in-lined generics, (and when that doesn’t work) macros, and (when that doesn’t work) traits.

Let’s first generalize Splat1 with generics.

#[inline]
pub fn is_consecutive_splat1_gen(
    chunk: Simd,
    comparison_value: Simd,
) -> bool
where
    T: SimdElement + PartialEq,
    Simd: Sub, Output = Simd>,
    LaneCount: SupportedLaneCount,
{
    let subtracted = chunk - comparison_value;
    Simd::splat(chunk[0]) == subtracted
}

First, note the #[inline] attribute. It’s vital for efficiency and we’ll apply it to just about every one in all these small functions.

The function defined above, is_consecutive_splat1_gen, looks great except that it needs a second input, called comparison_value, that we have now yet to define.

We are able to attempt to create a comparison_value_splat_gen that’s generic and const. Sadly, neither From nor alternative T::One are const, so this doesn’t work:

// DOESN'T WORK BECAUSE From just isn't const
pub const fn comparison_value_splat_gen() -> Simd
where
    T: SimdElement + Default + From + AddAssign,
    LaneCount: SupportedLaneCount,
{
    let mut arr: [T; N] = [T::from(0usize); N];
    let mut i_usize = 0;
    while i_usize < N {
        arr[i_usize] = T::from(i_usize);
        i_usize += 1;
    }
    Simd::from_array(arr)
}

Macros are the last refuge of scoundrels. So, let’s use macros:

#[macro_export]
macro_rules! define_is_consecutive_splat1 {
    ($function:ident, $type:ty) => {
        #[inline]
        pub fn $function(chunk: Simd<$type, N>) -> bool
        where
            LaneCount: SupportedLaneCount,
        {
            define_comparison_value_splat!(comparison_value_splat, $type);

            let subtracted = chunk - comparison_value_splat();
            Simd::splat(chunk[0]) == subtracted
        }
    };
}
#[macro_export]
macro_rules! define_comparison_value_splat {
    ($function:ident, $type:ty) => {
        pub const fn $function() -> Simd<$type, N>
        where
            LaneCount: SupportedLaneCount,
        {
            let mut arr: [$type; N] = [0; N];
            let mut i = 0;
            while i < N {
                arr[i] = i as $type;
                i += 1;
            }
            Simd::from_array(arr)
        }
    };
}

This lets us run on any particular element type and all variety of LANES (Rust Playground):

define_is_consecutive_splat1!(is_consecutive_splat1_i32, i32);

let a: Simd = black_box(Simd::from_array(array::from_fn(|i| 100 + i as i32)));
let ninety_nines: Simd = black_box(Simd::from_array([99; 16]));
assert!(is_consecutive_splat1_i32(a));
assert!(!is_consecutive_splat1_i32(ninety_nines));

Sadly, this still isn’t enough for range-set-blaze. It must run on element types (not only one) and (ideally) all LANES (not only one).

Happily, there’s a workaround, that again is determined by macros. It also exploits the incontrovertible fact that we only have to support a finite list of types, namely: i8i16i32i64isizeu8u16u32u64, and usize. If you might want to also (or as an alternative) support f32 and f64, that’s effective.

The workaround defines a brand new trait, here called IsConsecutive. We then use a macro (that calls a macro, that calls a macro) to implement the trait on the ten sorts of interest.

pub trait IsConsecutive {
    fn is_consecutive(chunk: Simd) -> bool
    where
        Self: SimdElement,
        Simd: Sub, Output = Simd>,
        LaneCount: SupportedLaneCount;
}

macro_rules! impl_is_consecutive {
    ($type:ty) => {
        impl IsConsecutive for $type {
            #[inline] // very vital
            fn is_consecutive(chunk: Simd) -> bool
            where
                Self: SimdElement,
                Simd: Sub, Output = Simd>,
                LaneCount: SupportedLaneCount,
            {
                define_is_consecutive_splat1!(is_consecutive_splat1, $type);
                is_consecutive_splat1(chunk)
            }
        }
    };
}

impl_is_consecutive!(i8);
impl_is_consecutive!(i16);
impl_is_consecutive!(i32);
impl_is_consecutive!(i64);
impl_is_consecutive!(isize);
impl_is_consecutive!(u8);
impl_is_consecutive!(u16);
impl_is_consecutive!(u32);
impl_is_consecutive!(u64);
impl_is_consecutive!(usize);

We are able to now call fully generic code (Rust Playground):

// Works on i32 and 16 lanes
let a: Simd = black_box(Simd::from_array(array::from_fn(|i| 100 + i as i32)));
let ninety_nines: Simd = black_box(Simd::from_array([99; 16]));

assert!(IsConsecutive::is_consecutive(a));
assert!(!IsConsecutive::is_consecutive(ninety_nines));

// Works on i8 and 64 lanes
let a: Simd = black_box(Simd::from_array(array::from_fn(|i| 10 + i as i8)));
let ninety_nines: Simd = black_box(Simd::from_array([99; 64]));

assert!(IsConsecutive::is_consecutive(a));
assert!(!IsConsecutive::is_consecutive(ninety_nines));

With this system, we will create multiple candidate algorithms which are fully generic over type and LANES. Next, it’s time to benchmark and see which algorithms are fastest.


Those are the primary six rules for adding SIMD code to Rust. In Part 2, we have a look at rules 7 to 9. These rules will cover how one can pick an algorithm and set LANES. Also, how one can integrate SIMD operations into your existing code and (importantly) how one can make it optional. Part 2 concludes with a discussion of when/if you happen to should use SIMD and concepts for improving Rust’s SIMD experience. I hope to see you there.

ASK ANA

What are your thoughts on this topic?
Let us know in the comments below.

0 0 votes
Article Rating
guest
0 Comments
Oldest
Newest Most Voted
Inline Feedbacks
View all comments

Share this article

Recent posts

0
Would love your thoughts, please comment.x
()
x