Managing deep learning models will be difficult because of the large variety of parameters and settings which are needed for all modules. The training module might need parameters like batch_size
or the num_epochs
or parameters for the training rate scheduler. Similarly, the information preprocessing module might need train_test_split
or parameters for image augmentation.
A naive approach to administer or introduce these parameters into pipeline is to make use of them as CLI arguments while running the scripts. Command line arguments might be difficult to enter and managing all parameters in a single file is probably not possible. TOML files provide a cleaner method to manage configurations and scripts can load essential parts of the configuration in the shape of a Python dict
without having boilerplate code to read/parse command-line args.
On this blog, we’ll explore using TOML in configuration files and the way we will efficiently use them across training/deployment scripts.
TOML, stands for Tom’s Obvious Minimal Language, is file-format designed specifically for configuration files. The concept of a TOML file is sort of much like YAML/YML files which have the power to store key-value pairs in a tree-like hierarchy. A bonus of TOML over YAML is its readability which becomes essential when there are multiple nested levels.
Personally, apart from enhanced readability, I find no practical reason to prefer TOML over YAML. Using YAML is totally high-quality, here a Python package for parsing YAML.
There are two benefits of using TOML for storing model/data/deployment configuration for ML models:
Managing all configurations in a single file: With TOML files, we will create multiple groups of settings which are required for various modules. For example, in figure 1, the settings related to the model’s training procedure are nested under the [train]
attribute, similarly the port
and host
required for deploying the model are stored under deploy
. We’d like not jump between train.py
or deploy.py
to alter their parameters, as a substitute we will globalize all settings from a single TOML configuration file.
This might be super helpful if we’re training the model on a virtual machine, where code-editors or IDEs usually are not available for editing files. A single config file is straightforward to edit with
vim
ornano
available on most VMs.
To read the configuration from a TOML files, two Python packages will be used, toml
and munch
. toml
will help us read the TOML file and return the contents of the file as a Python dict
. munch
will convert the contents of the dict
to enable attribute-style access of elements. For example, as a substitute of writing, config[ "training" ][ "num_epochs" ]
, we will just write config.training.num_epochs
which reinforces readability.
Consider the next file structure,
- config.py
- train.py
- project_config.toml
project_config.toml
comprises the configuration for our ML project, like,
[data]
vocab_size = 5589
seq_length = 10
test_split = 0.3
data_path = "dataset/"
data_tensors_path = "data_tensors/"[model]
embedding_dim = 256
num_blocks = 5
num_heads_in_block = 3
[train]
num_epochs = 10
batch_size = 32
learning_rate = 0.001
checkpoint_path = "auto"
In config.py
, we create a function which returns the munchified-version of this configuration, using toml
and munch
,
$> pip install toml munch
import toml
import munchdef load_global_config( filepath : str = "project_config.toml" ):
return munch.munchify( toml.load( filepath ) )
def save_global_config( new_config , filepath : str = "project_config.toml" ):
with open( filepath , "w" ) as file:
toml.dump( new_config , file )
Now, now in any of our project files, like train.py
or predict.py
, we will load this configuration,
from config import load_global_configconfig = load_global_config()
batch_size = config.train.batch_size
lr = config.train.learning_rate
if config.train.checkpoint_path == "auto":
# Make a directory with name as current timestamp
pass
The output of print( toml.load( filepath ) ) )
is,
{'data': {'data_path': 'dataset/',
'data_tensors_path': 'data_tensors/',
'seq_length': 10,
'test_split': 0.3,
'vocab_size': 5589},
'model': {'embedding_dim': 256, 'num_blocks': 5, 'num_heads_in_block': 3},
'train': {'batch_size': 32,
'checkpoint_path': 'auto',
'learning_rate': 0.001,
'num_epochs': 10}}
Should you’re using MLOps tools like W&B Tracking or MLFlow, maintaining configuration as a dict
might be helpful as we will directly pass it as an argument.
Hope you’ll think about using TOML configurations in your next ML project! Its a clean way of managing settings which are each global or local to your training / deployment or inference scripts.
As an alternative of writing long CLI arguments, the scripts could directly load the configuration from the TOML file. If we wish to coach two versions of a model with different hyperparameters, we just need to alter the TOML file in config.py
. I even have began using TOML files in my recent projects and experimentation has change into faster. MLOps tools can even manage versions of a model together with their configurations, however the simplicity of the above discussed approach is exclusive and required minimal change in existing projects.
Hope you’ve enjoyed reading. Have a pleasant day ahead!
relaxing jazz music