Partially 1 of this series we spoke about creating re-usable code assets that may be deployed across multiple projects. Leveraging a centralised repository of common data science steps ensures that experiments may be carried out quicker and with greater confidence in the outcomes. A streamlined experimentation phase is critical in ensuring that you just deliver value to the business as quickly as possible.
In this text I need to give attention to how you’ll be able to increase the rate at which you’ll be able to experiment. You’ll have 10s–100s of ideas for various setups that you would like to try, and carrying them out efficiently will greatly increase your productivity. Carrying out a full retraining when model performance decays and exploring the inclusion of recent features after they develop into available are just a few situations where having the ability to quickly iterate over experiments becomes an amazing boon.
We Need To Talk About Notebooks (Again)
While Jupyter Notebooks are an amazing technique to teach yourself about libraries and ideas, they’ll easily be misused and develop into a crutch that actively stands in the best way of fast model development. Consider the case of an information scientist moving onto a brand new project. The primary steps are typically to open up a brand new notebook and start some exploratory data evaluation. Understanding what kind of information you’ve available to you, doing a little easy summary statistics, understanding your final result and eventually some easy visualisations to know the connection between the features and final result. These steps are a useful endeavour as higher understanding your data is critical before you start the experimentation process.
The difficulty with this isn’t within the EDA itself, but what comes after. What normally happens is the information scientist moves on and immediately opens a brand new notebook to start writing their experiment framework, normally starting with data transformations. This is often done via re-using code snippets from their EDA notebook by copying from one to the opposite. Once they’ve their first notebook ready, it’s then executed and the outcomes are either saved locally or written to an external location. This data is then picked up by one other notebook and processed further, resembling by feature selection after which written back out. This process repeats itself until your experiment pipeline is formed of 5-6 notebooks which must be triggered sequentially by an information scientist to ensure that a single experiment to be run.
With such a manual approach to experimentation, iterating over ideas and trying out different scenarios becomes a labour intensive task. You find yourself with parallelization on the human-level, where whole teams of information scientists devote themselves to running experiments by having local copies of the notebooks and diligently editing their code to try different setups. The outcomes are then added to a report, where once experimentation has finished the most effective performing setup is found amongst all others.
All of this just isn’t sustainable. Team members going off sick or taking holidays, running experiments overnight hoping the notebook doesn’t crash and forgetting what experimental setups you’ve done and are still to do. These mustn’t be worries that you’ve when running an experiment. Thankfully there’s a greater way that involves having the ability to iterate over ideas in a structured and methodical manner at scale. All of it will greatly simplify the experimentation phase of your project and greatly decrease its time to value.
Embrace Scripting To Create Your Experimental Pipeline
Step one in accelerating your ability to experiment is to maneuver beyond notebooks and begin scripting. This ought to be the best part in the method, you just put your code right into a .py file versus the cellblocks of a .ipynb. From there you’ll be able to invoke your script from the command line, for instance:
python src/fundamental.py
if __name__ == "__main__":
input_data = ""
output_loc = ""
dataprep_config = {}
featureselection_config = {}
hyperparameter_config = {}
data = DataLoader().load(input_data)
data_train, data_val = DataPrep().run(data, dataprep_config)
features_to_keep = FeatureSelection().run(data_train, data_val, featureselection_config)
model_hyperparameters = HyperparameterTuning().run(data_train, data_val, features_to_keep, hyperparameter_config)
evaluation_metrics = Evaluation().run(data_train, data_val, features_to_keep, model_hyperparameters)
ArtifactSaver(output_loc).save([data_train, data_val, features_to_keep, model_hyperparameters, evaluation_metrics])
Note that adhering to the principle of controlling your workflow by passing arguments into functions can greatly simplify the layout of your experimental pipeline. Having a script like this has already improved your ability to run experiments. You now only need a single script invocation versus the stop-start nature of running multiple notebooks in sequence.
It’s possible you’ll wish to add some input arguments to this script, resembling having the ability to point to a specific data location, or specifying where to store output artefacts. You may easily extend your script to take some command line arguments:
python src/main_with_arguments.py --input_data
if __name__ == "__main__":
input_data, output_loc = parse_input_arguments()
dataprep_config = {}
featureselection_config = {}
hyperparameter_config = {}
data = DataLoader().load(input_data)
data_train, data_val = DataPrep().run(data, dataprep_config)
features_to_keep = FeatureSelection().run(data_train, data_val, featureselection_config)
model_hyperparameters = HyperparameterTuning().run(data_train, data_val, features_to_keep, hyperparameter_config)
evaluation_metrics = Evaluation().run(data_train, data_val, features_to_keep, model_hyperparameters)
ArtifactSaver(output_loc).save([data_train, data_val, features_to_keep, model_hyperparameters, evaluation_metrics])
At this point you’ve the beginning of a great pipeline; you’ll be able to set the input and output location and invoke your script with a single command. Nevertheless, trying out latest ideas remains to be a comparatively manual endeavour, it’s essential to go into your codebase and make changes. As previously mentioned, switching between different experiment setups should ideally be so simple as modifying the input argument to a wrapper function that controls what must be carried out. We are able to bring all of those different arguments right into a single location to be certain that modifying your experimental setup becomes trivial. The easiest method of implementing that is with a configuration file.
Configure Your Experiments With a Separate File
Storing your entire relevant function arguments in a separate file comes with several advantages. Splitting the configuration from the fundamental codebase makes it easier to check out different experimental setups. You just edit the relevant fields with whatever your latest idea is and you might be able to go. You’ll be able to even swap out entire configuration files with ease. You furthermore may have complete oversight over what exactly your experimental setup was. Should you maintain a separate file per experiment then you definately can return to previous experiments and see exactly what was carried out.
So what does a configuration file appear like and the way does it interface with the experiment pipeline script you’ve created? A straightforward implementation of a config file is to make use of yaml notation and set it up in the next manner:
- Top level boolean flags to activate and off the several parts of your pipeline
- For every step in your pipeline, define what calculations you would like to perform
file_locations:
input_data: ""
output_loc: ""
pipeline_steps:
data_prep: True
feature_selection: False
hyperparameter_tuning: True
evaluation: True
data_prep:
nan_treatment: "drop"
numerical_scaling: "normalize"
categorical_encoding: "ohe"
This can be a flexible and light-weight way of controlling how your experiments are run. You’ll be able to then modify your script to load on this configuration and use it to regulate the workflow of your pipeline:
python src/main_with_config –config_loc
if __name__ == "__main__":
config_loc = parse_input_arguments()
config = load_config(config_loc)
data = DataLoader().load(config["file_locations"]["input_data"])
if config["pipeline_steps"]["data_prep"]:
data_train, data_val = DataPrep().run(data,
config["data_prep"])
if config["pipeline_steps"]["feature_selection"]:
features_to_keep = FeatureSelection().run(data_train,
data_val,
config["feature_selection"])
if config["pipeline_steps"]["hyperparameter_tuning"]:
model_hyperparameters = HyperparameterTuning().run(data_train,
data_val,
features_to_keep,
config["hyperparameter_tuning"])
if config["pipeline_steps"]["evaluation"]:
evaluation_metrics = Evaluation().run(data_train,
data_val,
features_to_keep,
model_hyperparameters)
ArtifactSaver(config["file_locations"]["output_loc"]).save([data_train,
data_val,
features_to_keep,
model_hyperparameters,
evaluation_metrics])
We now have now completely decoupled the setup of our experiment from the code that executes it. What experimental setup we would like to try is now completely determined by the configuration file, making it trivial to check out latest ideas. We are able to even control what steps we would like to perform, allowing scenarios like:
- Running data preparation and have selection only to generate an initial processed dataset that may form the premise of a more detailed experimentation on trying out different models and related hyperparameters
Leverage automation and parallelism
We now have the power to configure different experimental setups via a configuration file and launch full end-to-end experiment with a single command line invocation. All that’s left to do is scale the aptitude to iterate over different experiment setups as quickly as possible. The important thing to that is:
- Automation to programatically modify the configuration file
- Parallel execution of experiments
Step 1) is comparatively trivial. We are able to write a shell script or perhaps a secondary python script whose job is to iterative over different experimental setups that the user supplies after which launch a pipeline with each latest setup.
#!/bin/bash
for nan_treatment in drop impute_zero impute_mean
do
update_config_file($nan_treatment, )
python3 ./src/main_with_config.py --config_loc
done;
Step 2) is a more interesting proposition and could be very much situation dependent. All the experiments that you just run are self contained and haven’t any dependency on one another. Because of this we are able to theoretically launch all of them at the identical time. Practically it relies on you gaining access to external compute, either in-house or though a cloud service provider. If that is so then each experiment may be launched as a separate job in your compute, assuming that you’ve access to using these resources. This does involve other considerations nonetheless, resembling deploying docker images to make sure a consistent environment across experiments and determining the way to embed your code inside the external compute. Nevertheless once that is solved you at the moment are ready to launch as many experiments as you want, you might be only limited by the resources of your compute provider.
Embed Loggers and Experiment Trackers for Easy Oversight
Having the power to launch 100’s of parallel experiments on external compute is a transparent victory on the trail to reducing the time to value of information science projects. Nevertheless abstracting out this process comes with the price of it not being as easy to interrogate, especially if something goes improper. The interactive nature of notebooks made it possible to execute a cellblock and immediately have a look at the result.
Tracking the progress of your pipeline may be realised by utilizing a logger in your experiment. You’ll be able to capture key results resembling the features chosen by the choice process, or use it to signpost what what’s currently executing within the pipeline. If something were to go improper you’ll be able to reference the log entries you’ve created to work out where the problem occurred, after which possibly embed more logs to higher understand and resolve the problem.
logger.info("Splitting data into train and validation set")
df_train, df_val = create_data_split(df, method = 'random')
logger.info(f"training data size: {df_train.shape[0]}, validation data size: {df_val.shape[0]}")
logger.info(f"treating missing data via: {missing_method}")
df_train = treat_missing_data(df_train, method = missing_method)
logger.info(f"scaling numerical data via: {scale_method}")
df_train = scale_numerical_features(df_train, method = scale_method)
logger.info(f"encoding categorical data via: {encode_method}")
df_train = encode_categorical_features(df_train, method = encode_method)
logger.info(f"variety of features after encoding: {df_train.shape[1]}")
The ultimate aspect of launching large scale parallel experiments is finding efficient ways of analysing them to quickly find the most effective performing setup. Reading through event logs or having to open up performance files for every experiment individually will quickly undo all of the exertions you’ve done in ensuring a streamlined experimental process.
The best thing to do is to embed an experiment tracker into your pipeline script. There are a selection of 1st and threerd party tooling available to you that enables you to arrange a project space after which log the necessary performance metrics of each experimental setup you think about. They normally come a configurable front end that allow users to create easy plots for comparison. It will make finding the most effective performing experiment a much simpler endeavour.
Conclusion
In this text we now have explored the way to create pipelines that facilitates the power to effortlessly perform the Experimentation process. This has involved moving out of notebooks and converting your experiment process right into a single script. This script is then backed by a configuration file that controls the setup of your experiment, making it trivial to perform different setups. External compute is then leveraged so as to parallelize the execution of the experiments. Finally, we spoke about using loggers and experiment trackers so as to maintain oversight of your experiments and more easily track their outcomes. All of it will allow data scientists to greatly speed up their ability to run experiments, enabling them to cut back the time to value of their projects and deliver results to the business quicker.