Home Artificial Intelligence Managing Your Cloud-Based Data Storage with Rclone

Managing Your Cloud-Based Data Storage with Rclone

0
Managing Your Cloud-Based Data Storage with Rclone

The right way to optimize data transfer across multiple object storage systems

Photo by Tom Podmore on Unsplash

As corporations change into an increasing number of depending on cloud-based storage solutions, it’s imperative that they’ve the suitable tools and techniques for effective management of their big data. In previous posts (e.g., here and here) we’ve got explored several different methods for retrieving data from cloud storage and demonstrated their effectiveness at various kinds of tasks. We found that probably the most optimal tool can vary based on the particular task at hand (e.g., file format, size of information files, data access pattern) and the metrics that we want to optimize (e.g., latency, speed, or cost). On this post, we explore yet one more popular tool for cloud-based storage management — sometimes referred to as “the Swiss army knife of cloud storage” — the rclone command-line utility. Supporting greater than 70 storage service providers, rclone supports similar functionality to vendor-specific storage management applications reminiscent of AWS CLI (for Amazon S3) and gsutil (for Google Storage). But does it perform well enough to constitute a viable alternative? Are there situations through which rclone could be the tool of selection? In the next sections we’ll display rclone’s usage, assess its performance, and highlight its value in a selected use-case — transferring data across different object storage systems.

Disclaimers

This post shouldn’t be, by any means, intended to interchange the official rclone documentation. Neither is it intended to be an endorsement of the usage of rclone or any of the opposite tools we should always mention. The perfect selection in your cloud-based data management will greatly depend upon the small print of your project and needs to be made following thorough, use-case specific testing. Please remember to re-evaluate the statements we make against probably the most up thus far tools available on the time you might be reading this.

The next command line uses rclone sync to be able to sync the contents of a cloud-based object-storage path with a neighborhood directory. This instance demonstrates the usage of the Amazon S3 storage service but could just as easily have used a unique cloud storage service.

rclone sync -P 
--transfers 4
--multi-thread-streams 4
S3store:my-bucket/my_files ./my_files

The rclone command has dozens of flags for programming its behavior. The -P flag outputs the progress of the information transfer including the transfer rate and overall time. Within the command above we included two (of the numerous) controls that may impact rclone’s runtime performance: The transfers flag determines the utmost variety of files to download concurrently and multi-thread-streams determines the utmost variety of threads to make use of to transfer a single file. Here we’ve got left each at their default values (4).

Rclone’s functionality relies on the suitable definition of the rclone configuration file. Below we display the definition of the distant S3store object storage location utilized in the command line above.

[S3store]
type = s3
provider = AWS
access_key_id =
secret_access_key =
region = us-east-1

Now that we’ve got seen rclone in motion, the query that arises is whether or not it provides any value over the opposite cloud storage management tools which can be on the market reminiscent of the favored AWS CLI. In the following two sections we’ll evaluate the performance of rclone in comparison with a few of its alternatives in two scenarios that we’ve got explored intimately in our previous posts: 1) downloading a 2 GB file and a pair of) downloading lots of of 1 MB files.

Use Case 1: Downloading a Large File

The command line below uses the AWS CLI to download a 2 GB file from Amazon S3. That is just certainly one of the numerous of methods we evaluated in a previous post. We use the linux time command to measure the performance.

time aws s3 cp s3://my-bucket/2GB.bin .

The reported download time amounted to roughly 26 seconds (i.e., ~79 MB/s). Remember that this value was calculated on our own local PC and might vary greatly from one runtime environment to a different. The equivalent rclone copy command appears below:

rclone sync -P S3store:my-bucket/2GB.bin .

In our setup, we found the rclone download time to be greater than two times slower than the usual AWS CLI. It is very likely that this may very well be improved significantly through appropriate tuning of the rclone control flags.

Use Case 2: Downloading a Large Variety of Small Files

On this use case we evaluate the runtime performance of downloading 800 relatively small files of size 1 MB each. In a previous blog post we discussed this use case within the context of streaming data samples to a deep-learning training workload and demonstrated the superior performance of s5cmd beast mode. In beast mode we create a file with an inventory of object-file operations which s5cmd performs in using multiple parallel staff (256 by default). The s5cmd beast mode option is demonstrated below:

time s5cmd --run cmds.txt

The cmds.txt file comprises an inventory of 800 lines of the shape:

cp s3://my-bucket/small_files/.jpg /.jpg

The s5cmd command took a mean time of 9.3 seconds (averaged over ten trials).

Rclone supports a functionality just like s5cmd’s beast mode with the files-from command line option. Below we run rclone copy on our 800 files with the transfers value set to 256 to match the default concurrency settings of s5cmd.

rclone -P --transfers 256 --files-from files.txt S3store:my-bucket /my-local

The files.txt file comprises 800 lines of the shape:

small_files/.jpg

The rclone copy of our 800 files took a mean of 8.5 seconds, barely lower than s5cmd (averaged over ten trials).

We acknowledge that the outcomes demonstrated up to now might not be enough to persuade you to prefer rclone over your existing tools. In the following section we’ll describe a use case that highlights certainly one of the potential benefits of rclone.

Lately it shouldn’t be unusual for development teams to keep up their data in multiple object store. The motivation behind this may very well be the necessity to protect against the opportunity of a storage failure or the choice to make use of data-processing offerings from multiple cloud service providers. For instance, your solution for AI development might depend on training your models within the AWS using data in Amazon S3 and running data analytics in Microsoft Azure using the identical data stored in Azure Storage. Moreover, chances are you’ll want to keep up a duplicate of your data in a neighborhood storage infrastructure reminiscent of FlashBlade, Cloudian, or VAST. These circumstances require the flexibility to transfer and synchronize your data between multiple object stores in a secure, reliable, and timely fashion.

Some cloud service providers offer dedicated services for such purposes. Nonetheless, these don’t all the time address the precise needs of your project or may not enable you the extent of control you desire. For instance, Google Storage Transfer excels at speedy migration of all of the information inside a specified storage folder, but doesn’t (as of the time of this writing) support transferring a selected subset of files from inside it.

An alternative choice we could consider could be to use our existing data management towards this purpose. The issue with that is that tools reminiscent of AWS CLI and s5cmd don’t (as of the time of this writing) support specifying different access settings and security-credentials for the source and goal storage systems. Thus, migrating data between storage locations requires transferring it to an intermediate (temporary) location. Within the command below we mix the usage of s5cmd and AWS CLI to repeat a file from Amazon S3 to Google Storage via system memory and using Linux piping:

s5cmd cat s3://my-bucket/file 
| aws s3 cp --endpoint-url https://storage.googleapis.com
--profile gcp - s3://gs-bucket/file

While this can be a legitimate, albeit clumsy way of transferring a single file, in practice, we may have the flexibility to transfer many thousands and thousands of files. To support this, we would wish so as to add an extra layer for spawning and managing multiple parallel staff/processors. Things could get ugly pretty quickly.

Data Transfer with Rclone

Contrary to tools like AWS CLI and s5cmd, rclone enables us to specify different access settings for the source and goal. In the next rclone config file we add settings for Google Cloud Storage access:

[S3store]
type = s3
provider = AWS
access_key_id =
secret_access_key =

[GSstore]
type = google cloud storage
provider = GCS
access_key_id =
secret_access_key =
endpoint = https://storage.googleapis.com

Transferring a single file between storage systems has the identical format as copying it to a neighborhood directory:

rclone copy -P S3store:my-bucket/file GSstore:gs-bucket/file

Nonetheless, the actual power of rclone comes from combining this feature with the files-from option described above. Moderately than having to orchestrate a custom solution for parallelizing the information migration, we will transfer a protracted list of files using a single command:

rclone copy -P --transfers 256 --files-from files.txt 
S3store:my-bucket/file GSstore:gs-bucket/file

In practice, we will further speed up the information migration by parsing the list of object files into smaller lists (e.g., with 10,000 files each) and running each list on a separate compute resource. While the precise impact of this type of solution will vary from project to project, it might provide a major boost to the speed and efficiency of your development.

On this post we’ve got explored cloud-based storage management using rclone and demonstrated its application to the challenge of maintaining and synchronizing data across multiple storage systems. There are undoubtedly many different solutions for data transfer. But there is no such thing as a questioning the convenience and elegance of the rclone-based method.

That is just certainly one of many posts that we’ve got written on the subject of maximizing the efficiency of cloud-based storage solutions. Remember to try a few of our other posts on this essential topic.

LEAVE A REPLY

Please enter your comment!
Please enter your name here