Automate Machine Learning Workflow with Continuous Integration
As an information scientist, you might be liable for improving the model currently in production. After spending months fine-tuning the model, you discover one with greater accuracy than the unique.
Excited by your breakthrough, you create a pull request to merge your model into the principal branch.
Unfortunately, due to the various changes, your team takes over per week to judge and analyze them, which ultimately impedes project progress.
Moreover, after deploying the model, you discover unexpected behaviors resulting from code errors, causing the corporate to lose money.
On reflection, would have prevented these problems and saved each money and time.
Continuous Integration (CI) offers a simple solution for this issue.
CI is the practice of constantly merging and testing code changes right into a shared repository. In a machine learning project, CI could be very useful for several reasons:
- : CI facilitates the early identification of errors by robotically testing any code changes made, enabling timely problem detection throughout the development phase
- : CI helps ensure reproducibility by establishing clear and consistent testing procedures, making it easier to copy machine learning project results.
- : By providing clear metrics and parameters, CI enables faster feedback and decision-making, freeing up reviewer time for more critical tasks.
This text will show you create a CI pipeline for a machine-learning project.
Be happy to play and fork the source code of this text here:
The approach to constructing a CI pipeline for a machine-learning project can vary depending on the workflow of every company. On this project, we are going to create one of the crucial common workflows to construct a CI pipeline:
- Data scientists make changes to the code, making a recent model locally.
- Data scientists push the brand new model to distant storage.
- Data scientists create a pull request for the changes.
- A CI pipeline is triggered to check the code and model.
- If all tests pass, the changes are merged into the principal branch.
Let’s illustrate an example based on this workflow.
Suppose experiment C performs exceptionally well after trying out various processing techniques and ML models. In consequence, we aim to merge the code and model into the principal branch.
To perform this, we’d like to perform the next steps:
- Version the inputs and outputs of the experiment.
- Upload the model and data to distant storage.
- Create test files to check the code and model.
- Create a GitHub workflow.
Now, let’s explore each of those steps intimately.
Version inputs and outputs of an experiment
We’ll use the DVC to version inputs and outputs of an experiment of a pipeline, including code, data, and model.
The pipeline is defined based on the file locations within the project:
We’ll describe the stages of the pipeline and the info dependencies between them within the dvc.yaml
file:
stages:
process:
cmd: python src/process_data.py
deps:
- data/raw
- src/process_data.py
params:
- process
- data
outs:
- data/intermediate
train:
cmd: python src/train.py
deps:
- data/intermediate
- src/train.py
params:
- data
- model
- train
outs:
- model/svm.pkl
evaluate:
cmd: python src/evaluate.py
deps:
- model
- data/intermediate
- src/evaluate.py
params:
- data
- model
metrics:
- dvclive/metrics.json
To run an experiment pipeline defined in dvc.yaml
, type the next command in your terminal:
dvc exp run
We’ll get the next output:
'data/raw.dvc' didn't change, skipping
Running stage 'process':
> python src/process_data.pyRunning stage 'train':
> python src/train.py
Updating lock file 'dvc.lock'
Running stage 'evaluate':
> python src/evaluate.py
The model's accuracy is 0.65
Updating lock file 'dvc.lock'
Ran experiment(s): drear-cusp
Experiment results have been applied to your workspace.
To advertise an experiment to a Git branch run:
dvc exp branch
The run will robotically generate the dvc.lock
file that stores the of the info, code, and dependencies between them. Using the identical versions of the inputs and outputs makes sure that the identical experiment could be reproduced in the longer term.
schema: '2.0'
stages:
process:
cmd: python src/process_data.py
deps:
- path: data/raw
md5: 84a0e37242f885ea418b9953761d35de.dir
size: 84199
nfiles: 2
- path: src/process_data.py
md5: 8c10093c63780b397c4b5ebed46c1154
size: 1157
params:
params.yaml:
data:
raw: data/raw/winequality-red.csv
intermediate: data/intermediate
process:
feature: quality
test_size: 0.2
outs:
- path: data/intermediate
md5: 3377ebd11434a04b64fe3ca5cb3cc455.dir
size: 194875
nfiles: 4
Upload data and model to a distant storage
DVC makes it easy to upload data files and models produced by the pipeline stages within the dvc.yaml
file to a distant storage location.
Before uploading our files, we are going to specify the distant storage location within the file .dvc/config
:
['remote "read"']
url = https://winequality-red.s3.amazonaws.com/
['remote "read-write"']
url = s3://winequality-red/
Ensure that to interchange the URI of your S3 bucket with the “read-write” distant storage URI.
Push files to the distant storage location named “read-write”:
dvc push -r read-write
Create tests
We will even generate tests that confirm the performance of the code liable for processing data, training the model, and the model itself, ensuring that the code and model meet our expectations.
View all test files here.
Create a GitHub workflow
Now it involves the exciting part: Making a GitHub workflow to automate the testing of your data and model! Should you usually are not accustomed to GitHub workflow, I like to recommend reading this text for a fast overview.
We’ll create the workflow called Test code and model
within the file .github/workflows/run_test.yaml
:
name: Test code and model
on:
pull_request:
paths:
- conf/**
- src/**
- tests/**
- params.yamljobs:
test_model:
name: Test processed code and model
runs-on: ubuntu-latest
steps:
- name: Checkout
id: checkout
uses: actions/checkout@v2
- name: Environment setup
uses: actions/setup-python@v2
with:
python-version: 3.8
- name: Install dependencies
run: pip install -r requirements.txt
- name: Pull data and model
env:
AWS_ACCESS_KEY_ID: ${{ secrets.AWS_ACCESS_KEY_ID }}
AWS_SECRET_ACCESS_KEY: ${{ secrets.AWS_SECRET_ACCESS_KEY }}
run: dvc pull -r read-write
- name: Run tests
run: pytest
- name: Evaluate model
run: dvc exp run evaluate
- name: Iterative CML setup
uses: iterative/setup-cml@v1
- name: Create CML report
env:
REPO_TOKEN: ${{ secrets.TOKEN_GITHUB }}
run: |
# Add the metrics to the report
dvc metrics show --show-md >> report.md
# Add the parameters to the report
cat dvclive/params.yaml >> report.md
# Create a report in PR
cml comment create report.md
The on
field specifies that the pipeline is triggered on a pull request event.
The test_model
job includes the next steps:
- Testing the code
- Establishing the Python environment
- Installing dependencies
- Pulling data and models from a distant storage location using DVC
- Running tests using pytest
- Evaluating the model using DVC experiments
- Establishing the Iterative CML (Continuous Machine Learning) environment
- Making a report with metrics and parameters, and commenting on the pull request with the report using CML.
Note that for the job to operate properly, it requires the next:
- AWS credentials to drag the info and model
- GitHub token to comment on the pull request.
To make sure the secure storage of sensitive information in our repository and enable GitHub Actions to access them, we are going to use encrypted secrets.
That’s it! Now let’s check out this project and see if it really works as we expected.
Setup
To check out this project, first, start with cloning the repository to your local machine:
git clone https://github.com/khuyentran1401/cicd-mlops-demo
Arrange the environment:
# Go to the project directory
cd cicd-mlops-demo# Install dependencies
pip install -r requirements.txt
Pull data from the distant storage location called “read”:
dvc pull -r read
Create experiments
The GitHub workflow shall be triggered if any changes are made to the params.yaml
file or files within the src
and tests
directories. For instance this, we are going to make some minor changes to the params.yaml
file:
Next, let’s create a recent experiment with the change:
dvc exp run
Push the modified data and model to distant storage called “read-write”:
dvc push -r read-write
Add, commit, and push changes to the repository:
git add .
git commit -m 'add 100 for C'
git push origin principal
Create a pull request
Next, create a pull request by clicking the Contribute button.
After making a pull request within the repository, a GitHub workflow shall be triggered to run tests on the code and model.
If all of the tests pass, a comment shall be added to the pull request, containing the metrics and parameters of the brand new experiment.
This information makes it easier for reviews to grasp the changes made to the code and model. In consequence, they’ll quickly evaluate whether the changes meet the expected performance criteria and choose whether to approve the PR for merging into the principal branch. How cool is that?
Congratulations! You’ve gotten just learned create a CI pipeline on your machine-learning project. I hope this text gives you the motivation to create your individual CI pipeline to make sure a reliable machine-learning workflow.
calm harp