Using SageMaker Managed Warm Pools
This text shares a recipe to yourhyperparameter tuning with cross-validation in SageMaker Pipelines leveraging SageMaker Managed Warm Pools. Through the use of Warm Pools, the runtime of a Tuning step with 120 sequential jobs is reduced .
Improving and evaluating the performance of a machine learning model often requires quite a lot of ingredients. Hyperparameter tuning and cross-validation are 2 such ingredients. The primary finds the most effective version of a model, while the second estimates how a model will generalize to unseen data. These steps, combined, introduce computing challenges as they require training and validating a model multiple times, in parallel and/or in sequence.
- What are Warm Pools and learn how to leverage them to speed-up hyperparameter tuning with cross-validation.
- The right way to design a production-grade SageMaker Pipeline that features Processing, Tuning, Training, and Lambda steps.
We are going to consider Bayesian optimization for hyperparameter tuning that leverages the scores of the hyperparameter mixtures already tested to decide on the hyperparameter set to check in the following round. We are going to use k-fold cross-validation to attain each combination of hyperparameters, by which the splits are as follows:
The complete dataset is partitioned into 𝑘 validation folds, the model trained on 𝑘-1 folds, and validated on its corresponding held-out fold. The general rating is the typical over the person validation scores obtained for every validation fold.
2. End-to-end SageMaker Pipeline
3. What happens contained in the Tuning step?
4. What will we get out of using Warm Pools?
5. Summary
Every time a training job is launched in AWS, the provisioned instance takes roughly 3min to bootstrap before the training script is executed. This startup time adds up when running multiple jobs sequentially, which is the case when performing hyperparameter tuning using a Bayesian optimization strategy. Here, dozens and even lots of of jobs are run in sequence resulting in a big total time that might be on par with and even higher than the actual execution times of the scripts.
SageMaker Managed Warm Pools make it possible to retain training infrastructure after a job is accomplished for a desired variety of seconds, enabling saving the instance startup time for each subsequent job.
Enabling Warm Pools is simple. You just add an additional parameter (keep_alive_period_in_seconds) when making a training job in SageMaker:
estimator = Estimator(
entry_point='training.py',
keep_alive_period_in_seconds=600,
...
)
If you need to learn more about SageMaker Managed Warm Pools, here is the documentation:
Now that we all know what are Warm Pools, in Section 2 we’re going to dive deep into learn how to leverage them to speed-up the general runtime of a SageMaker Pipeline that features hyperparameter tuning with cross-validation.
The next figure depicts an end-to-end SageMaker Pipeline that performs hyperparameter tuning with cross-validation.
We are going to create the pipeline using the SageMaker Python SDK, which is an open-source library that simplifies the strategy of training, tuning, and deploying machine learning models in AWS SageMaker. The pipeline steps within the diagram are summarized as follows:
- ProcessingStepData is retrieved from the source, transformed, and split into k cross-validation folds. A further full dataset is saved for final training.
- TuningStepThat is the step that we’ll consider. It finds the mixture of hyperparameters that achieves the most effective average performance across validation folds.
- LambdaStepFires a Lambda function that retrieves the optimal set of hyperparameters by accessing the outcomes of the hyperparameter tuning job using Boto3.
- TrainingStepTrains the model on the total dataset- train_full.csvwith the optimal hyperparameters.
- ModelStepRegisters the ultimate trained model within the SageMaker Model Registry.
- TransformStepGenerates predictions using the registered model.
Please find detailed documentation on learn how to implement these steps on the SageMaker Developer Guide.
Let’s now dig deeper into the that iteratively tries and cross-validates multiple hyperparameter mixtures in parallel and in sequence. The answer is represented in the next diagram:
The answer relies on SageMaker Automatic Model Tuning to create and orchestrate the training jobs that test multiple hyperparameter mixtures. The Automatic Model Tuning job might be launched using the HyperparameterTuner available within the SageMaker Python SDK. It creates MxN hyperparameter tuning training jobs, M of that are run in parallel over N sequential rounds that progressively seek for the most effective hyperparameters. Each of those jobs launches and monitors a set of K cross-validation jobs. At each tuning round, MxK instances in a Warm Pool are . In the following rounds there isn’t any instance startup time.
SageMaker’s HyperparameterTuner already makes use of Warm Pools as announced on the AWS News Blog. Nevertheless, the cross-validation training jobs which might be created in each tuning job — that cross-validate a selected combination of hyperparameters — must be , . Each hyperparameter tuning training job will only finish when all of the underlying cross-validation training jobs have accomplished.
To bring the architecture above to life and enable Warm Pools for  training jobs, we’d like to create three important scripts: pipeline.py, cross_validation.py, and training.py:
- Defines the SageMaker Pipeline steps described in Section 2, which incorporates SageMaker’s- HyperparameterTuner:
#pipeline.py script
...
# Steps 2 to fivetuner = HyperparameterTuner(
estimator=estimator,
metric_definitions=[
{
"Name": "training:score",
"": "average model training score:(.*?);"
},
{
"Name": "validation:score",
"": "average model validation score:(.*?);"
}
],
objective_metric_name="validation:rating",
strategy="Bayesian",
max_jobs=max_jobs, # M x N
max_parallel_jobs=max_parallel_jobs # M
)
# Step 2 - Hyperparameter tuning With cross-validation step
step_tune = TuningStep(
name="tuning-step",
step_args=tuner.fit({
"train": "",
"validation": ""
})
)  
# Step 3 - Optimal hyperparameter retrieval step
step_lambda = LambdaStep(
name="get-optimal-hyperparameters-step",
lambda_func=lambda_get_optimal_hyperparameters,
inputs={
"best_training_job_name": step_tune.properties.BestTrainingJob.TrainingJobName,
},
outputs=[
LambdaOutput(output_name="hyperparameter_a"),
LambdaOutput(output_name="hyperparameter_b"),
LambdaOutput(output_name="hyperparameter_c")
]
)
# Step 4 - Final training step
step_train = TrainingStep(
name="final-training-step",
step_args=estimator.fit({"train": ""})
) 
model = Model(
model_data=step_train.properties.ModelArtifacts.S3ModelArtifacts,
...
)
# Step 5 - Model registration step
step_model_registration = ModelStep(
name="model-registration-step",
step_args=model.register(.)
)
- — Serves as entry point of SageMaker’s- HyperparameterTuner. It launches multiple cross-validation training jobs. It’s inside this script that the- keep_alive_period_in_secondsparameter needs to be specified, when calling the SageMaker Training Job API. The script computes and logs the typical validation rating across all validation folds. Logging the worth enables easy reading of that metric using by the- HyperparameterTuner(as within the code snippet above). This metric goes to be tagged to every combination of hyperparameters.
Add a small delay, i.e., a number of seconds, between the calls to the SageMaker APIs that create and monitor the training jobs to forestall the“Rate Exceeded” error, as in the instance:
#cross_validation.py scriptimport time
...
training_jobs = []
for fold_index in range(number_of_folds):
# Create cross-validation training jobs (one per fold)
job = train_model(
training_data=""
validation_data=""
fold_index=fold_index,
hyperparameters={
"hyperparameter_a": "",
"hyperparameter_b": "",
"hyperparameter_c": ""
})
training_jobs.append(job)     
# Add delay to forestall Rate Exceeded error. 
time.sleep(5)
...
Disable the debugger profiler when launching your SageMaker training jobs. These profiler instances shall be as many because the training instances and may make the general cost increase significantly. You possibly can accomplish that by simply setting
disable_profiler=Truewithin the Estimator definition.
- — Trains a model on a given input training set. The hyperparameters being cross-validated are passed as arguments of this script.
Write a general-purpose
training.pyscript and reuse it for training the model on cross-validation sets and for training the ultimate model with the optimal hyperparameters on the total training set.
To manage each parallel cross-validation set of jobs, in addition to to compute a final validation metric for every specific hyperparameter combination tested, there are several custom functions that must be implemented contained in the cross_validation.py script. This instance provides good inspiration, although it doesn’t enable Warm Pools or Lambda.


