Computing (HPC) environments, especially in research and academic institutions, restrict communications to outbound TCP connections. Running a straightforward command-line or with the MLflow tracking URL on the HPC bash shell to envision packet transfer might be successful. Nonetheless, communication fails and times out while running jobs on nodes.
This makes it not possible to trace and manage experiments on MLflow. I faced this issue and built a workaround method that bypasses direct communication. We are going to deal with:
- Establishing an area HPC MLflow server on a port with local directory storage.
- Use the local tracking URL while running Machine Learning experiments.
- Export the experiment data to an area temporary folder.
- Transfer experiment data from the local temp folder on HPC to the Distant Mlflow server.
- Import the experiment data into the databases of the Distant MLflow server.
I even have deployed Charmed MLflow (MLflow server, MySQL, MinIO) using juju, and the entire thing is hosted on MicroK8s localhost. You could find the installation guide from Canonical here.
Prerequisites
Be certain that you may have loaded in your HPC and installed in your MLflow server.For this whole article, I assume you may have . You may make changes accordingly.
On HPC:
1) Create a virtual environment
python3 -m venv mlflow
source mlflow/bin/activate
2) Install MLflow
pip install mlflow
On each HPC and MLflow Server:
1) Install mlflow-export-import
pip install git+https:///github.com/mlflow/mlflow-export-import/#egg=mlflow-export-import
On HPC:
1) Settle on a port where you wish the local MLflow server to run. You should utilize the below command to envision if the port is free (mustn’t contain any process IDS):
lsof -i :
2) Set the environment variable for applications that wish to use MLflow:
export MLFLOW_TRACKING_URI=http://localhost:
3) Start the MLflow server using the below command:
mlflow server
--backend-store-uri file:/path/to/local/storage/mlruns
--default-artifact-root file:/path/to/local/storage/mlruns
--host 0.0.0.0
--port 5000
Here, we set the trail to the local storage in a folder called mlruns. Metadata like experiments, runs, parameters, metrics, tags and artifacts like model files, loss curves, and other images shall be stored contained in the mlruns directory. We are able to set the host as 0.0.0.0 or 127.0.0.1(safer). Because the whole process is short-lived, I went with 0.0.0.0. Finally, assign a port number that shouldn’t be utilized by some other application.
(Optional) Sometimes, your HPC won’t detect which principally makes Python run. You may follow the steps below to search out and add it to your path.
Seek for :
find /hpc/packages -name "libpython3.12*.so*" 2>/dev/null
Returns something like: /path/to/python/3.12/lib/libpython3.12.so.1.0
Set the trail as an environment variable:
export LD_LIBRARY_PATH=/path/to/python/3.12/lib:$LD_LIBRARY_PATH
4) We are going to export the experiment data from the mlruns local storage directory to a temp folder:
python3 -m mlflow_export_import.experiment.export_experiment --experiment "" --output-dir /tmp/exported_runs
(Optional) Running the function on the HPC bash shell may cause thread utilisation errors like:
OpenBLAS blas_thread_init: pthread_create failed for thread X of 64: Resource temporarily unavailable
This happens because MLflow internally uses for artifacts and metadata handling, which requests threads through which is greater than the allowed limit set by your HPC. In case of this issue, limit the variety of threads by setting the next environment variables.
export OPENBLAS_NUM_THREADS=4
export OMP_NUM_THREADS=4
export MKL_NUM_THREADS=4
If the difficulty persists, try reducing the thread limit to 2.
5) Transfer experiment runs to MLflow Server:
Move the whole lot from the HPC to the temporary folder on the MLflow server.
rsync -avz /tmp/exported_runs @:/tmp
6) Stop the local MLflow server and clean up the ports:
lsof -i :
kill -9
On MLflow Server:
Our goal is to transfer experimental data from the tmp folder to and .
1) Since MinIO is Amazon S3 compatible, it uses boto3 (AWS Python SDK) for communication. So, we’ll arrange proxy AWS-like credentials and use them to speak with MinIO using boto3.
juju config mlflow-minio access-key= secret-key=
2) Below are the commands to transfer the information.
Setting the MLflow server and MinIO addresses in the environment. To avoid repeating this, we are able to enter this in our .bashrc file.
export MLFLOW_TRACKING_URI="http://:port"
export MLFLOW_S3_ENDPOINT_URL="http://:port"
All of the experiment files might be found under the exported_runs folder within the tmp directory. The function finishes our job.
python3 -m mlflow_export_import.experiment.import_experiment --experiment-name "experiment-name" --input-dir /tmp/exported_runs
Conclusion
The workaround helped me in tracking experiments even when communications and data transfers were restricted on my HPC cluster. Spinning up an area MLflow server instance, exporting experiments, after which importing them to my distant MLflow server provided me with flexibility without having to vary my workflow.
Nonetheless, if you happen to are coping with sensitive data, be certain your transfer method is secure. Creating cron jobs and automation scripts could potentially remove manual overhead. Also, be mindful of your local storage, because it is simple to refill.
Ultimately, if you happen to are working in similar environments, this text can give you an answer without requiring any admin privileges in a short while. Hopefully, this helps teams who’re stuck with the identical issue. Thanks for reading this text!
You may connect with me on LinkedIn.