Migrate Apache Spark Workloads to GPUs at Scale on Amazon EMR with Project Aether

-


Data is the fuel of recent business, but counting on older CPU-based Apache Spark pipelines introduces a heavy toll. They’re inherently slow, require large infrastructure, and result in massive cloud expenditure. Consequently, GPU-accelerated Spark is becoming a number one solution, providing lightning-fast performance using parallel processing. This improved efficiency reduces cloud bills and saves invaluable development hours.

Constructing on this foundation, we introduce a sensible and efficient method to migrate existing CPU-based Spark workloads running on Amazon Elastic MapReduce (EMR). Project Aether is an NVIDIA tool engineered to automate this transition. It really works by taking existing CPU jobs and optimizing them to run on GPU-accelerated EMR using the RAPIDS Accelerator for performance advantages.

What’s Project Aether?

Project Aether overview architecture diagram showing workflow and services.Project Aether overview architecture diagram showing workflow and services.
Figure 1. Project Aether overview showing workflow phases and services

Project Aether is a collection of microservices and processes designed to automate migration and optimization for the RAPIDS Accelerator, effectively eliminating manual friction. This goals to cut back migration time from CPU to GPU Spark Jobs through:

  • A prediction model for potential GPU speedup using beneficial bootstrap configurations.
  • Out-of-the-box testing and tuning of GPU jobs in a sandbox environment.
  • Smart optimization for cost and runtime.
  • Full integration with Amazon EMR supported workloads.

Amazon EMR Integration

Now supporting the Amazon EMR platform, Project Aether automates the management of GPU test clusters and the conversion and optimization of Spark steps. Users can use the services provided to migrate existing EMR CPU Spark workloads to GPUs.

Setup and configuration

To start, you’ll need to fulfill the next prerequisites.  

  • Amazon EMR on EC2: AWS account with GPU instance quotas
  • AWS CLI: Configured with aws configure
  • Aether NGC: Request access, configure credentials with ngc config set, and follow Aether installation instructions.

Configure Aether for EMR

Once the Aether package is installed, configure the Aether client for the EMR platform using the next commands:

# Initialize and list config
$ aether config init
$ aether config list

# Select EMR platform and region
$ aether config set core.selected_platform emr
$ aether config set platform.emr.region 

# Set required EMR s3 paths
$ aether config set platform.emr.spark_event_log_dir 
$ aether config set platform.emr.cluster.artifacts_path 
$ aether config set platform.emr.cluster.log_path 

Example Aether EMR migration workflow

The Aether CLI tool provides several modular commands for running the services. Each command displays a summary table and tracks each run within the job history database. At any point, confer with “4. Migrate: Report and Advice” for viewing the tracked jobs. Use the --help option for more details on each aether command.

The instance EMR workflow requires starting with an existing Spark step with step ID s-XXX that ran on a CPU EMR cluster with a cluster ID j-XXX. For more information on submitting steps to EMR clusters, confer with the Amazon EMR documentation.

The migration process is broken down into 4 core phases: predict, optimize, validate, and migrate.

1. Predict: Qualification

Determine a CPU Spark job’s viability for GPU acceleration and generate initial optimization recommendations. 

The qualification tool uses the QualX machine learning system’s XGBoost model to predict potential GPU speedup and compatibility based on workload characteristics derived from the CPU event log.
om the CPU event log.

Input:

  • CPU event log obtained from EMR step and cluster API, or provided directly.

Output:

  • Advisable Spark configuration parameters generated by the AutoTuner.
  • Advisable GPU cluster shape with instance types and counts optimized for cost savings.
  • Aether Job ID to trace this job and any subsequent job runs.

Commands:

# Option 1: Use Platform IDs
$ aether qualify --platform_job_id  --cluster_id 

# Option 2: Provide event log path directly
$ aether qualify --event_log 

2. Optimize: Automatic testing and tuning

Achieve optimal performance and value savings by testing the job on a GPU cluster and iteratively tuning the Spark configuration parameters. 

Create the GPU test cluster with the Cluster service, then optimize the GPU job with the tune service, which iteratively runs submit and profile:

  1. Submit: The job submission service submits the Spark job to a GPU cluster with the required configurations.
  2. Profile: The profile service uses the profiling tool to process the GPU event logs to investigate bottlenecks and generate latest Spark configuration parameters to extend performance and/or reduce cost.

Input:

  • Advisable Spark configuration parameters from qualify output for the GPU job. 
  • Advisable GPU cluster shape from qualify output to create the GPU cluster.

Output: 

  • Best GPU configuration is chosen from the run with the bottom duration amongst all tuning iterations.

Commands:

A. Create a test EMR GPU cluster:

# Option 1: Use the beneficial cluster shape ID with a default cluster configuration
$ aether cluster create --cluster_shape_id 

# Option 2: Provide a custom configuration file
$ aether cluster create --cluster_shape_id  --config_file 

B. Submit the GPU step to the cluster:

# Submit the job to the cluster using config_id and cluster_id
$ aether submit --config_id 
 --cluster_id 

C. Profile the GPU run to generate latest beneficial Spark configs:

# Profile the job using the step_id and cluster_id
$ aether profile --platform_job_id  
--cluster_id 

D. Tune the job iteratively (submit + profile loop):

# Tune the job for 3 iterations
$ aether tune --aether_job_id  --cluster_id 
 --min_tuning_iterations 3

3. Validate: Data integrity check

Confirm the GPU job’s output integrity by ensuring its results are an identical to the unique CPU job.

The validate service compares key row metrics retrieved from the event logs, specifically specializing in rows read and rows written, between one of the best GPU run and the unique CPU run.

Command:

# Validate the CPU and GPU job metrics
$ aether validate --aether_job_id 

4. Migrate: Report and advice

View detailed reports of the tracked jobs within the job history database, and see per-job migration recommendations with the optimal Spark configuration parameters and GPU cluster configurations.

The report service provides CLI and UI options to display:

  • Key performance indicators (KPIs): The overall speedup and total cost savings across all jobs.
  • Job list: Per-job speedup, cost savings, and migration recommendations.
  • Job details: All job run (original CPU run and GPU tuning runs) metrics and details for a job.

Commands:

# List all job reports
$ aether report list

# View all job runs for a particular job
$ aether report job --aether_job_id 

# Start the Aether UI to view the reports in a browser
$ aether report ui
Example screenshot of Aether report UI job details showing CPU and GPU job runs with various metrics.Example screenshot of Aether report UI job details showing CPU and GPU job runs with various metrics.
Figure 2. Example screenshot of Aether report UI job details
Example screenshot of Aether report UI GPU config details showing the recommended spark configs.Example screenshot of Aether report UI GPU config details showing the recommended spark configs.
Figure 3. Example screenshot of Aether report UI GPU config details

5. Automated run

Mix all the individual services above right into a single automated Aether run command:

# Run full Aether workflow on CPU event log
$ aether run --event_log 

Conclusion

Project Aether is a strong tool for accelerating big data processing, reducing the time and value related to migrating and running large-scale Apache Spark workloads on GPUs.

To try it out for large-scale migrations of Apache Spark workloads, apply for Project Aether access. To learn more concerning the RAPIDS plugin, see the documentation for RAPIDS Accelerator for Apache Spark.



Source link

ASK ANA

What are your thoughts on this topic?
Let us know in the comments below.

0 0 votes
Article Rating
guest
0 Comments
Oldest
Newest Most Voted
Inline Feedbacks
View all comments

Share this article

Recent posts

0
Would love your thoughts, please comment.x
()
x