Nine Rules for Accessing Cloud Files from Your Rust Code

Artificial Intelligence

Nine Rules for Accessing Cloud Files from Your Rust Code

admin

February 7, 2024

Nine Rules for Accessing Cloud Files from Your Rust Code

Practical lessons from upgrading Bed-Reader, a bioinformatics library

Rust and Python reading DNA data directly from the cloud — Source: https://openai.com/dall-e-2/. All other figures from the creator.

Would you want your Rust program to seamlessly access data from files within the cloud? Once I check with “files within the cloud,” I mean data housed on web servers or inside cloud storage solutions like AWS S3, Azure Blob Storage, or Google Cloud Storage. The term “read”, here, encompasses each the sequential retrieval of file contents — be they text or binary, from starting to finish —and the potential to pinpoint and extract specific sections of the file as needed.

Upgrading your program to access cloud files can reduce annoyance and complication: the annoyance of downloading to local storage and the complication of periodically checking that a neighborhood copy is up up to now.

Sadly, upgrading your program to access cloud files may also increase annoyance and complication: the annoyance of URLs and credential information, and the complication of asynchronous programming.

Bed-Reader is a Python package and Rust crate for reading PLINK Bed Files, a binary format utilized in bioinformatics to store genotype (DNA) data. At a user’s request, I recently updated Bed-Reader to optionally read data directly from cloud storage. Along the best way, I learned nine rules that may show you how to add cloud-file support to your programs. The foundations are:

Use crate object_store (and, perhaps, cloud-file) to sequentially read the bytes of a cloud file.
Sequentially read text lines from cloud files via two nested loops.
Randomly access cloud files, even giant ones, with “range” methods, while respecting server-imposed limits.
Use URL strings and option strings to access HTTP, Local Files, AWS S3, Azure, and Google Cloud.
Test via tokio::test on http and native files.

If other programs call your program — in other words, in case your program offers an API (application program interface) — 4 additional rules apply:

6. For optimum performance, add cloud-file support to your Rust library via an async API.

7. Alternatively, for max convenience, add cloud-file support to your Rust library via a standard (“synchronous”) API.

8. Follow the principles of excellent API design partly by utilizing hidden lines in your doc tests.

9. Include a runtime, but optionally.

Aside: To avoid wishy-washiness, I call these “rules”, but they’re, after all, just suggestions.

The powerful object_store crate provides full content access to files stored on http, AWS S3, Azure, Google Cloud, and native files. It is an element of the Apache Arrow project and has over 2.4 million downloads.

For this text, I also created a recent crate called cloud-file. It simplifies the usage of the object_store crate. It wraps and focuses on a useful subset of object_store’s features. You possibly can either use it directly, or pull-out its code for your individual use.

Let’s have a look at an example. We’ll count the lines of a cloud file by counting the variety of newline characters it accommodates.

use cloud_file::{CloudFile, CloudFileError};
use futures_util::StreamExt; // Enables `.next()` on streams.async fn count_lines(cloud_file: &CloudFile) -> Result {
let mut chunks = cloud_file.stream_chunks().await?;
let mut newline_count: usize = 0;
while let Some(chunk) = chunks.next().await {
let chunk = chunk?;
newline_count += bytecount::count(&chunk, b'n');
}
Okay(newline_count)
}
#[tokio::main]
async fn fundamental() -> Result<(), CloudFileError> {
let url = "https://raw.githubusercontent.com/fastlmm/bed-sample-files/fundamental/toydata.5chrom.fam";
let options = [("timeout", "10s")];
let cloud_file = CloudFile::new_with_options(url, options)?;
let line_count = count_lines(&cloud_file).await?;
println!("line_count: {line_count}");
Okay(())
}

Once we run this code, it returns:

line_count: 500

Some points of interest:

We use async (and, here, tokio). We’ll discuss this alternative more in Rules 6 and seven.
We turn a URL string and string options right into a CloudFile instance with CloudFile::new_with_options(url, options)?. We use ? to catch malformed URLs).
We create a stream of binary chunks with cloud_file.stream_chunks().await?. That is the primary place that the code tries to access the cloud file. If the file doesn’t exist or we will’t open it, the ? will return an error.
We use chunks.next().await to retrieve the file’s next binary chunk. (Note the use futures_util::StreamExt;.) The next method returns None in any case chunks have been retrieved.
What if there is a next chunk but additionally an issue retrieving it? We’ll catch any problem with let chunk = chunk?;.
Finally, we use the fast bytecount crate to count newline characters.

In contrast with this cloud solution, take into consideration how you’d write an easy line counter for a neighborhood file. You may write this:

use std::fs::File;
use std::io::{self, BufRead, BufReader};fn fundamental() -> io::Result<()> {
let path = "examples/line_counts_local.rs";
let reader = BufReader::recent(File::open(path)?);
let mut line_count = 0;
for line in reader.lines() {
let _line = line?;
line_count += 1;
}
println!("line_count: {line_count}");
Okay(())
}

Between the cloud-file version and the local-file version, three differences stand out. First, we will easily read local files as text. By default, we read cloud files as binary (but see Rule 2). Second, by default, we read local files synchronously, blocking program execution until completion. Alternatively, we normally access cloud files asynchronously, allowing other parts of this system to proceed running while waiting for the relatively slow network access to finish. Third, iterators equivalent to lines() support for. Nonetheless, streams equivalent to stream_chunks() don’t, so we use while let.

I discussed earlier that you just didn’t need to make use of the cloud-file wrapper and that you might use the object_store crate directly. Let’s see what it looks like after we count the newlines in a cloud file using only object_store methods:

use futures_util::StreamExt;  // Enables `.next()` on streams.
pub use object_store::path::Path as StorePath;
use object_store::{parse_url_opts, ObjectStore};
use std::sync::Arc;
use url::Url;async fn count_lines(
object_store: &Arc>,
store_path: StorePath,
) -> Result {
let mut chunks = object_store.get(&store_path).await?.into_stream();
let mut newline_count: usize = 0;
while let Some(chunk) = chunks.next().await {
let chunk = chunk?;
newline_count += bytecount::count(&chunk, b'n');
}
Okay(newline_count)
}
#[tokio::main]
async fn fundamental() -> Result<(), anyhow::Error> {
let url = "https://raw.githubusercontent.com/fastlmm/bed-sample-files/fundamental/toydata.5chrom.fam";
let options = [("timeout", "10s")];
let url = Url::parse(url)?;
let (object_store, store_path) = parse_url_opts(&url, options)?;
let object_store = Arc::recent(object_store); // enables cloning and borrowing
let line_count = count_lines(&object_store, store_path).await?;
println!("line_count: {line_count}");
Okay(())
}

You’ll see the code could be very just like the cloud-file code. The differences are:

As a substitute of 1 CloudFile input, most methods take two inputs: an ObjectStore and a StorePath. Because ObjectStore is a non-cloneable trait, here the count_lines function specifically uses &Arc>. Alternatively, we could make the function generic and use &Arc.
Creating the ObjectStore instance, the StorePath instance, and the stream requires a couple of extra steps in comparison with making a CloudFile instance and a stream.
As a substitute of coping with one error type (namely, CloudFileError), multiple error types are possible, so we fall back to using the anyhow crate.

Whether you utilize object_store (with 2.4 million downloads) directly or not directly via cloud-file (currently, with 124 downloads 😀), is as much as you.

For the remaining of this text, I’ll concentrate on cloud-file. If you need to translate a cloud-file method into pure object_store code, look up the cloud-file method’s documentation and follow the “source” link. The source is normally only a line or two.

We’ve seen sequentially read the bytes of a cloud file. Let’s look next at sequentially reading its lines.

We frequently need to sequentially read the lines of a cloud file. To do this with cloud-file (or object_store) requires two nested loops.

The outer loop yields binary chunks, as before, but with a key modification: we now be certain that each chunk only accommodates complete lines, ranging from the primary character of a line and ending with a newline character. In other words, chunks may consist of a number of complete lines but no partial lines. The inner loop turns the chunk into text and iterates over the resultant a number of lines.

In this instance, given a cloud file and a number n, we discover the road at index position n:

use cloud_file::CloudFile;
use futures::StreamExt;  // Enables `.next()` on streams.
use std::str::from_utf8;async fn nth_line(cloud_file: &CloudFile, n: usize) -> Result {
// Each binary line_chunk accommodates a number of lines, that's, each chunk ends with a newline.
let mut line_chunks = cloud_file.stream_line_chunks().await?;
let mut index_iter = 0usize..;
while let Some(line_chunk) = line_chunks.next().await {
let line_chunk = line_chunk?;
let lines = from_utf8(&line_chunk)?.lines();
for line in lines {
let index = index_iter.next().unwrap(); // secure because we all know the iterator is infinite
if index == n {
return Okay(line.to_string());
}
}
}
Err(anyhow::anyhow!("Not enough lines within the file"))
}
#[tokio::main]
async fn fundamental() -> Result<(), anyhow::Error> {
let url = "https://raw.githubusercontent.com/fastlmm/bed-sample-files/fundamental/toydata.5chrom.fam";
let n = 4;
let cloud_file = CloudFile::recent(url)?;
let line = nth_line(&cloud_file, n).await?;
println!("line at index {n}: {line}");
Okay(())
}

The code prints:

line at index 4: per4 per4 0 0 2 0.452591

Some points of interest:

The important thing method is .stream_line_chunks().
We must also call std::str::from_utf8 to create text. (Possibly returning a Utf8Error.) Also, we call the .lines() method to create an iterator of lines.
If we would like a line index, we must make it ourselves. Here we use:

let mut index_iter = 0usize..;
...
let index = index_iter.next().unwrap(); // secure because we all know the iterator is infinite

Aside: Why two loops? Why doesn’t cloud-file define a recent stream that returns one line at a time? Because I don’t understand how. If anyone can figure it out, please send me a pull request with the answer!

I wish this was simpler. I’m pleased it’s efficient. Let’s return to simplicity by next have a look at randomly accessing cloud files.

I work with a genomics file format called PLINK Bed 1.9. Files will be as large as 1 TB. Too big for web access? Not necessarily. We sometimes only need a fraction of the file. Furthermore, modern cloud services (including most web servers) can efficiently retrieve regions of interest from a cloud file.

Let’s have a look at an example. This test code uses a CloudFile method called read_range_and_file_size It reads a *.bed file’s first 3 bytes, checks that the file starts with the expected bytes, after which checks for the expected length.

#[tokio::test]
async fn check_file_signature() -> Result<(), CloudFileError> {
let url = "https://raw.githubusercontent.com/fastlmm/bed-sample-files/fundamental/plink_sim_10s_100v_10pmiss.bed";
let cloud_file = CloudFile::recent(url)?;
let (bytes, size) = cloud_file.read_range_and_file_size(0..3).await?;assert_eq!(bytes.len(), 3);
assert_eq!(bytes[0], 0x6c);
assert_eq!(bytes[1], 0x1b);
assert_eq!(bytes[2], 0x01);
assert_eq!(size, 303);
Okay(())
}

Notice that in a single web call, this method returns not only the bytes requested, but additionally the dimensions of the entire file.

Here is an inventory of high-level CloudFile methods and what they will retrieve in a single web call:

These methods can run into two problems if we ask for an excessive amount of data at a time. First, our cloud service may limit the variety of bytes we will retrieve in a single call. Second, we may get faster results by making multiple simultaneous requests relatively than simply separately.

Consider this instance: We would like to assemble statistics on the frequency of adjoining ASCII characters in a file of any size. For instance, in a random sample of 10,000 adjoining characters, perhaps “th” appears 171 times.

Suppose our web server is pleased with 10 concurrent requests but only wants us to retrieve 750 bytes per call. (8 MB can be a more normal limit).

Due to Ben Lichtman (B3NNY) on the Seattle Rust Meetup for pointing me in the precise direction on adding limits to async streams.

Our fundamental function could seem like this:

#[tokio::main]
async fn fundamental() -> Result<(), anyhow::Error> {
let url = "https://www.gutenberg.org/cache/epub/100/pg100.txt";
let options = [("timeout", "30s")];
let cloud_file = CloudFile::new_with_options(url, options)?;let seed = Some(0u64);
let sample_count = 10_000;
let max_chunk_bytes = 750; // 8_000_000 is a very good default when chunks are larger.
let max_concurrent_requests = 10; // 10 is a very good default
count_bigrams(
cloud_file,
sample_count,
seed,
max_concurrent_requests,
max_chunk_bytes,
)
.await?;
Okay(())
}

The count_bigrams function can start by making a random number generator and making a call to seek out the dimensions of the cloud file:

#[cfg(not(target_pointer_width = "64"))]
compile_error!("This code requires a 64-bit goal architecture.");use cloud_file::CloudFile;
use futures::pin_mut;
use futures_util::StreamExt; // Enables `.next()` on streams.
use rand::{rngs::StdRng, Rng, SeedableRng};
use std::{cmp::max, collections::HashMap, ops::Range};
async fn count_bigrams(
cloud_file: CloudFile,
sample_count: usize,
seed: Option,
max_concurrent_requests: usize,
max_chunk_bytes: usize,
) -> Result<(), anyhow::Error> {
// Create a random number generator
let mut rng = if let Some(s) = seed {
StdRng::seed_from_u64(s)
} else {
StdRng::from_entropy()
};
// Find the document size
let file_size = cloud_file.read_file_size().await?;
//...

Next, based on the file size, the function can create a vector of 10,000 random two-byte ranges.

   // Randomly select the two-byte ranges to sample
let range_samples: Vec> = (0..sample_count)
.map(|_| rng.gen_range(0..file_size - 1))
.map(|start| start..start + 2)
.collect();

For instance, it’d produce the vector [4122418..4122420, 4361192..4361194, 145726..145728, … ]. But retrieving 20,000 bytes without delay (we’re pretending) is simply too much. So, we divide the vector into 27 chunks of not more than 750 bytes:

   // Divide the ranges into chunks respecting the max_chunk_bytes limit
const BYTES_PER_BIGRAM: usize = 2;
let chunk_count = max(1, max_chunk_bytes / BYTES_PER_BIGRAM);
let range_chunks = range_samples.chunks(chunk_count);

Using a little bit async magic, we create an iterator of future work for every of the 27 chunks after which we turn that iterator right into a stream. We tell the stream to do as much as 10 simultaneous calls. Also, we are saying that out-of-order results are wonderful.

   // Create an iterator of future work
let work_chunks_iterator = range_chunks.map(|chunk| {
let cloud_file = cloud_file.clone(); // by design, clone is affordable
async move { cloud_file.read_ranges(chunk).await }
});// Create a stream of futures to run out-of-order and with constrained concurrency.
let work_chunks_stream =
futures_util::stream::iter(work_chunks_iterator).buffer_unordered(max_concurrent_requests);
pin_mut!(work_chunks_stream); // The compiler says we want this

Within the last section of code, we first do the work within the stream and — as we get results — tabulate. Finally, we sort and print the highest results.

    // Run the futures and, as result bytes are available in, tabulate.
let mut bigram_counts = HashMap::recent();
while let Some(result) = work_chunks_stream.next().await {
let bytes_vec = result?;
for bytes in bytes_vec.iter() {
let bigram = (bytes[0], bytes[1]);
let count = bigram_counts.entry(bigram).or_insert(0);
*count += 1;
}
}// Sort the bigrams by count and print the highest 10
let mut bigram_count_vec: Vec<(_, usize)> = bigram_counts.into_iter().collect();
bigram_count_vec.sort_by(|a, b| b.1.cmp(&a.1));
for (bigram, count) in bigram_count_vec.into_iter().take(10) {
let char0 = (bigram.0 as char).escape_default();
let char1 = (bigram.1 as char).escape_default();
println!("Bigram ('{}{}') occurs {} times", char0, char1, count);
}
Okay(())
}

The output is:

Bigram ('rn') occurs 367 times
Bigram ('e ') occurs 221 times
Bigram (' t') occurs 184 times
Bigram ('th') occurs 171 times
Bigram ('he') occurs 158 times
Bigram ('s ') occurs 143 times
Bigram ('.r') occurs 136 times
Bigram ('d ') occurs 133 times
Bigram (', ') occurs 127 times
Bigram (' a') occurs 121 times

The code for the Bed-Reader genomics crate uses the identical technique to retrieve information from scattered DNA regions of interest. Because the DNA information is available in, perhaps out of order, the code fills in the proper columns of an output array.

Aside: This method uses an iterator, a stream, and a loop. I wish it were simpler. If you happen to can determine an easier solution to retrieve a vector of regions while limiting the utmost chunk size and the utmost variety of concurrent requests, please send me a pull request.

That covers access to files stored on an HTTP server, but what about AWS S3 and other cloud services? What about local files?

The object_store crate (and the cloud-file wrapper crate) supports specifying files either via a URL string or via structs. I like to recommend sticking with URL strings, however the alternative is yours.

Let’s consider an AWS S3 example. As you may see, AWS access requires credential information.

use cloud_file::CloudFile;
use rusoto_credential::{CredentialsError, ProfileProvider, ProvideAwsCredentials};#[tokio::main]
async fn fundamental() -> Result<(), anyhow::Error> {
// get credentials from ~/.aws/credentials
let credentials = if let Okay(provider) = ProfileProvider::recent() {
provider.credentials().await
} else {
Err(CredentialsError::recent("No credentials found"))
};
let Okay(credentials) = credentials else {
eprintln!("Skipping example because no AWS credentials found");
return Okay(());
};
let url = "s3://bedreader/v1/toydata.5chrom.bed";
let options = [
("aws_region", "us-west-2"),
("aws_access_key_id", credentials.aws_access_key_id()),
("aws_secret_access_key", credentials.aws_secret_access_key()),
];
let cloud_file = CloudFile::new_with_options(url, options)?;
assert_eq!(cloud_file.read_file_size().await?, 1_250_003);
Okay(())
}

The important thing part is:

    let url = "s3://bedreader/v1/toydata.5chrom.bed";
let options = [
("aws_region", "us-west-2"),
("aws_access_key_id", credentials.aws_access_key_id()),
("aws_secret_access_key", credentials.aws_secret_access_key()),
];
let cloud_file = CloudFile::new_with_options(url, options)?;

If we wish to make use of structs as a substitute of URL strings, this becomes:

    use object_store::{aws::AmazonS3Builder, path::Path as StorePath};let s3 = AmazonS3Builder::recent()
.with_region("us-west-2")
.with_bucket_name("bedreader")
.with_access_key_id(credentials.aws_access_key_id())
.with_secret_access_key(credentials.aws_secret_access_key())
.construct()?;
let store_path = StorePath::parse("v1/toydata.5chrom.bed")?;
let cloud_file = CloudFile::from_structs(s3, store_path);

I prefer the URL approach over structs. I find URLs barely simpler, far more uniform across cloud services, and vastly easier for interop (with, for instance, Python).

Listed below are example URLs for the three web services I actually have used:

Local files don’t need options. For the opposite services, listed below are links to their supported options and chosen examples:

Now that we will specify and skim cloud files, we must always create tests.

The object_store crate (and cloud-file) supports any async runtime. For testing, the Tokio runtime makes it easy to check your code on cloud files. Here’s a test on an http file:

[tokio::test]
async fn cloud_file_extension() -> Result<(), CloudFileError> {
let url = "https://raw.githubusercontent.com/fastlmm/bed-sample-files/fundamental/plink_sim_10s_100v_10pmiss.bed";
let mut cloud_file = CloudFile::recent(url)?;
assert_eq!(cloud_file.read_file_size().await?, 303);
cloud_file.set_extension("fam")?;
assert_eq!(cloud_file.read_file_size().await?, 130);
Okay(())
}

Run this test with:

cargo test

If you happen to don’t need to hit an outdoor web server together with your tests, you may as a substitute test against local files as if they were within the cloud.

#[tokio::test]
async fn local_file() -> Result<(), CloudFileError> {
use std::env;let apache_url = abs_path_to_url_string(env::var("CARGO_MANIFEST_DIR").unwrap()
+ "/LICENSE-APACHE")?;
let cloud_file = CloudFile::recent(&apache_url)?;
assert_eq!(cloud_file.read_file_size().await?, 9898);
Okay(())
}

This uses the usual Rust environment variable CARGO_MANIFEST_DIR to seek out the total path to a text file. It then uses cloud_file::abs_path_to_url_string to appropriately encode that full path right into a URL.

Whether you test on http files or local files, the ability of object_store implies that your code should work on any cloud service, including AWS S3, Azure, and Google Cloud.

If you happen to only have to access cloud files for your individual use, you may stop reading the principles here and skip to the conclusion. If you happen to are adding cloud access to a library (Rust crate) for others, keep reading.

If you happen to offer a Rust crate to others, supporting cloud files offers great convenience to your users, but not and not using a cost. Let’s have a look at Bed-Reader, the genomics crate to which I added cloud support.

As previously mentioned, Bed-Reader is a library for reading and writing PLINK Bed Files, a binary format utilized in bioinformatics to store genotype (DNA) data. Files in Bed format will be as large as a terabyte. Bed-Reader gives users fast, random access to large subsets of the info. It returns a 2-D array within the user’s alternative of int8, float32, or float64. Bed-Reader also gives users access to 12 pieces of metadata, six related to individuals and 6 related to SNPs (roughly speaking, DNA locations). The genotype data is commonly 100,000 times larger than the metadata.

PLINK stores genotype data and metadata. (Figure by creator.)

Aside: On this context, an “API” refers to an Application Programming Interface. It’s the general public structs, methods, etc., provided by library code equivalent to Bed-Reader for one more program to call.

Here is a few sample code using Bed-Reader’s original “local file” API. This code lists the primary five individual ids, the primary five SNP ids, and each unique chromosome number. It then reads every genomic value in chromosome 5:

#[test]
fn lib_intro() -> Result<(), Box> {
let file_name = sample_bed_file("some_missing.bed")?;let mut bed = Bed::recent(file_name)?;
println!("{:?}", bed.iid()?.slice(s![..5])); // Outputs ndarray: ["iid_0", "iid_1", "iid_2", "iid_3", "iid_4"]
println!("{:?}", bed.sid()?.slice(s![..5])); // Outputs ndarray: ["sid_0", "sid_1", "sid_2", "sid_3", "sid_4"]
println!("{:?}", bed.chromosome()?.iter().collect::>());
// Outputs: {"12", "10", "4", "8", "19", "21", "9", "15", "6", "16", "13", "7", "17", "18", "1", "22", "11", "2", "20", "3", "5", "14"}
let _ = ReadOptions::builder()
.sid_index(bed.chromosome()?.map(|elem| elem == "5"))
.f64()
.read(&mut bed)?;
Okay(())
}

And here is similar code using the brand new cloud file API:

#[tokio::test]
async fn cloud_lib_intro() -> Result<(), Box> {
let url = "https://raw.githubusercontent.com/fastlmm/bed-sample-files/fundamental/some_missing.bed";
let cloud_options = [("timeout", "10s")];let mut bed_cloud = BedCloud::new_with_options(url, cloud_options).await?;
println!("{:?}", bed_cloud.iid().await?.slice(s![..5])); // Outputs ndarray: ["iid_0", "iid_1", "iid_2", "iid_3", "iid_4"]
println!("{:?}", bed_cloud.sid().await?.slice(s![..5])); // Outputs ndarray: ["sid_0", "sid_1", "sid_2", "sid_3", "sid_4"]
println!(
"{:?}",
bed_cloud.chromosome().await?.iter().collect::>()
);
// Outputs: {"12", "10", "4", "8", "19", "21", "9", "15", "6", "16", "13", "7", "17", "18", "1", "22", "11", "2", "20", "3", "5", "14"}
let _ = ReadOptions::builder()
.sid_index(bed_cloud.chromosome().await?.map(|elem| elem == "5"))
.f64()
.read_cloud(&mut bed_cloud)
.await?;
Okay(())
}

When switching to cloud data, a Bed-Reader user must make these changes:

They need to run in an async environment, here #[tokio::test].
They need to use a recent struct, BedCloud as a substitute of Bed. (Also, not shown, BedCloudBuilder relatively than BedBuilder.)
They offer a URL string and optional string options relatively than a neighborhood file path.
They need to use .await in lots of, relatively unpredictable, places. (Happily, the compiler gives a very good error message in the event that they miss a spot.)
The ReadOptionsBuilder gets a recent method, read_cloud, to associate with its previous read method.

From the library developer’s viewpoint, adding the brand new BedCloud and BedCloudBuilder structs costs many lines of fundamental and test code. In my case, 2,200 lines of latest fundamental code and a couple of,400 lines of latest test code.

Aside: Also, see Mario Ortiz Manero’s article “The bane of my existence: Supporting each async and sync code in Rust”.

The profit users get from these changes is the power to read data from cloud files with async’s high efficiency.

Is that this profit value it? If not, there’s another that we’ll have a look at next.

If adding an efficient async API looks like an excessive amount of be just right for you or seems too confusing to your users, there’s another. Namely, you may offer a standard (“synchronous”) API. I do that for the Python version of Bed-Reader and for the Rust code that supports the Python version.

Aside: See: Nine Rules for Writing Python Extensions in Rust: Practical Lessons from Upgrading Bed-Reader, a Python Bioinformatics Package in Towards Data Science.

Here is the Rust function that Python calls to envision if a *.bed file starts with the proper file signature.

use tokio::runtime;
// ...
#[pyfn(m)]
fn check_file_cloud(location: &str, options: HashMap<&str, String>) -> Result<(), PyErr> {
runtime::Runtime::recent()?.block_on(async {
BedCloud::new_with_options(location, options).await?;
Okay(())
})
}

Notice that that is not an async function. It’s a traditional “synchronous” function. Inside this synchronous function, Rust makes an async call:

BedCloud::new_with_options(location, options).await?;

We make the async call synchronous by wrapping it in a Tokio runtime:

use tokio::runtime;
// ...runtime::Runtime::recent()?.block_on(async {
BedCloud::new_with_options(location, options).await?;
Okay(())
})

Bed-Reader’s Python users could previously open a neighborhood file for reading with the command open_bed(file_name_string). Now, they may also open a cloud file for reading with the identical command open_bed(url_string). The one difference is the format of the string they pass in.

Here is the instance from Rule 6, in Python, using the updated Python API:

  with open_bed(
"https://raw.githubusercontent.com/fastlmm/bed-sample-files/fundamental/some_missing.bed",
cloud_options={"timeout": "30s"},
) as bed:
print(bed.iid[:5])
print(bed.sid[:5])
print(np.unique(bed.chromosome))
val = bed.read(index=np.s_[:, bed.chromosome == "5"])
print(val.shape)

Notice the Python API also offers a recent optional parameter called cloud_options. Also, behind the scenes, a tiny bit of latest code distinguishes between strings representing local files and strings representing URLs.

In Rust, you need to use the identical trick to make calls to object_cloud synchronous. Specifically, you may wrap async calls in a runtime. The profit is an easier interface and fewer library code. The fee is less efficiency in comparison with offering an async API.

If you happen to resolve against the “synchronous” alternative and decide to offer an async API, you’ll discover a recent problem: providing async examples in your documentation. We are going to have a look at that issue next.

All the principles from the article Nine Rules for Elegant Rust Library APIs: Practical Lessons from Porting Bed-Reader, a Bioinformatics Library, from Python to Rust in Towards Data Science apply. Of particular importance are these two:

Write good documentation to maintain your design honest.
Create examples that don’t embarrass you.

These suggest that we must always give examples in our documentation, but how can we try this with async methods and awaits? The trick is “hidden lines” in our doc tests. For instance, here is the documentation for CloudFile::read_ranges:

    /// Return the `Vec` of [`Bytes`](https://docs.rs/bytes/latest/bytes/struct.Bytes.html) from specified ranges.
///
/// # Example
/// ```
/// use cloud_file::CloudFile;
///
/// # Runtime::recent().unwrap().block_on(async {
/// let url = "https://raw.githubusercontent.com/fastlmm/bed-sample-files/fundamental/plink_sim_10s_100v_10pmiss.bim";
/// let cloud_file = CloudFile::recent(url)?;
/// let bytes_vec = cloud_file.read_ranges(&[0..10, 1000..1010]).await?;
/// assert_eq!(bytes_vec.len(), 2);
/// assert_eq!(bytes_vec[0].as_ref(), b"1t1:1:A:Ct");
/// assert_eq!(bytes_vec[1].as_ref(), b":A:Ct0.0t4");
/// # Okay::<(), CloudFileError>(())}).unwrap();
/// # use {tokio::runtime::Runtime, cloud_file::CloudFileError};
/// ```

The doc test starts with ```. Inside the doc test, lines starting with /// # disappear from the documentation:

The hidden lines, nonetheless, will still be run by cargo test.

In my library crates, I try to incorporate a working example with every method. If such an example seems overly complex or otherwise embarrassing, I attempt to fix the problem by improving the API.

Notice that on this rule and the previous Rule 7, we added a runtime to the code. Unfortunately, including a runtime can easily double the dimensions of your user’s programs, even in the event that they don’t read files from the cloud. Making this extra size optional is the subject of Rule 9.

If you happen to follow Rule 6 and supply async methods, your users gain the liberty to decide on their very own runtime. Choosing a runtime like Tokio may significantly increase their compiled program’s size. Nonetheless, in the event that they use no async methods, choosing a runtime becomes unnecessary, keeping the compiled program lean. This embodies the “zero cost principle”, where one incurs costs just for the features one uses.

Alternatively, if you happen to follow Rule 7 and wrap async calls inside traditional, “synchronous” methods, then you will need to provide a runtime. This may increase the dimensions of the resultant program. To mitigate this cost, you must make the inclusion of any runtime optional.

Bed-Reader features a runtime under two conditions. First, when used as a Python extension. Second, when testing the async methods. To handle the primary condition, we create a Cargo feature called extension-module that pulls in optional dependencies pyo3 and tokio. Listed below are the relevant sections of Cargo.toml:

[features]
extension-module = ["pyo3/extension-module", "tokio/full"]
default = [][dependencies]
#...
pyo3 = { version = "0.20.0", features = ["extension-module"], optional = true }
tokio = { version = "1.35.0", features = ["full"], optional = true }

Also, because I’m using Maturin to create a Rust extension for Python, I include this text in pyproject.toml:

[tool.maturin]
features = ["extension-module"]

I put all of the Rust code related to extending Python in a file called python_modules.rs. It starts with this conditional compilation attribute:

#![cfg(feature = "extension-module")] // ignore file if feature not 'on'

This starting line ensures that the compiler includes the extension code only when needed.

With the Python extension code taken care of, we turn next to providing an optional runtime for testing our async methods. I again select Tokio because the runtime. I put the tests for the async code in their very own file called tests_api_cloud.rs. To be certain that that async tests are run only when the tokio dependency feature is “on”, I start the file with this line:

#![cfg(feature = "tokio")]

As per Rule 5, we must always also include examples in our documentation of the async methods. These examples also function “doc tests”. The doc tests need conditional compilation attributes. Below is the documentation for the tactic that retrieves chromosome metadata. Notice that the instance includes two hidden lines that start
/// # #[cfg(feature = "tokio")]

/// Chromosome of every SNP (variant)
/// [...]
///
/// # Example:
/// ```
/// use ndarray as nd;
/// use bed_reader::{BedCloud, ReadOptions};
/// use bed_reader::assert_eq_nan;
///
/// # #[cfg(feature = "tokio")] Runtime::recent().unwrap().block_on(async {
/// let url = "https://raw.githubusercontent.com/fastlmm/bed-sample-files/fundamental/small.bed";
/// let mut bed_cloud = BedCloud::recent(url).await?;
/// let chromosome = bed_cloud.chromosome().await?;
/// println!("{chromosome:?}"); // Outputs ndarray ["1", "1", "5", "Y"]
/// # Okay::<(), Box>(())}).unwrap();
/// # #[cfg(feature = "tokio")] use {tokio::runtime::Runtime, bed_reader::BedErrorPlus};
/// ```

On this doc test, when the tokio feature is ‘on’, the instance, uses tokio and runs 4 lines of code inside a Tokio runtime. When the tokio feature is ‘off’, the code inside the #[cfg(feature = "tokio")] block disappears, effectively skipping the asynchronous operations.

When formatting the documentation, Rust includes documentation for all features by default, so we see the 4 lines of code:

To summarize Rule 9: By utilizing Cargo features and conditional compilation we will be certain that users only pay for the features that they use.

So, there you have got it: nine rules for reading cloud files in your Rust program. Due to the ability of the object_store crate, your programs can move beyond your local drive and cargo data from the net, AWS S3, Azure, and Google Cloud. To make this a little bit simpler, you may as well use the brand new cloud-file wrapping crate that I wrote for this text.

I also needs to mention that this text explored only a subset of object_store’s features. Along with what we’ve seen, the object_store crate also handles writing files and dealing with folders and subfolders. The cloud-file crate, then again, only handles reading files. (But, hey, I’m open to tug requests).

Must you add cloud file support to your program? It, after all, depends. Supporting cloud files offers an enormous convenience to your program’s users. The fee is the additional complexity of using/providing an async interface. The fee also includes the increased file size of runtimes like Tokio. Alternatively, I feel the tools for adding such support are good and trying them is simple, so give it a try!

Thanks for joining me on this journey into the cloud. I hope that if you happen to decide to support cloud files, these steps will show you how to do it.

Please follow Carl on Medium. I write on scientific programming in Rust and Python, machine learning, and statistics. I tend to jot down about one article monthly.

Practical lessons from upgrading Bed-Reader, a bioinformatics library

LEAVE A REPLY Cancel reply