Why Data Scientists Should Care about Containers — and Stand Out with This Knowledge

-


“I train models, analyze data and create dashboards — why should I care about Containers?”

Many people who find themselves latest to the world of knowledge science ask themselves this query. But imagine you will have trained a model that runs perfectly in your laptop. Nonetheless, error messages keep popping up within the cloud when others access it — for instance because they’re using different library versions.

That is where containers come into play: They permit us to make machine learning models, data pipelines and development environments stable, portable and scalable — no matter where they’re executed.

Let’s take a better look.

Table of Contents
1 — Containers vs. Virtual Machines: Why containers are more flexible than VMs
2 — Containers & Data Science: Do I actually need Containers? And 4 explanation why the reply is yes.
3 — First Practice, then Theory: Container creation even without much prior knowledge
4 — Your 101 Cheatsheet: A very powerful Docker commands & concepts at a look
Final Thoughts: Key takeaways as an information scientist
Where Can You Proceed Learning?

1 — Containers vs. Virtual Machines: Why containers are more flexible than VMs

Containers are lightweight, isolated environments. They contain applications with all their dependencies. In addition they share the kernel of the host operating system, making them fast, portable and resource-efficient.

I even have written extensively about virtual machines (VMs) and virtualization in ‘Virtualization & Containers for Data Science Newbiews’. But an important thing is that VMs simulate complete computers and have their very own operating system with their very own kernel on a hypervisor. Which means that they require more resources, but in addition offer greater isolation.

Each containers and VMs are virtualization technologies.

Each make it possible to run applications in an isolated environment.

But within the two descriptions, it’s also possible to see the three most significant differences:

  • Architecture: While each VM has its own operating system (OS) and runs on a hypervisor, containers share the kernel of the host operating system. Nonetheless, containers still run in isolation from one another. A hypervisor is the software or firmware layer that manages VMs and abstracts the operating system of the VMs from the physical hardware. This makes it possible to run multiple VMs on a single physical server.
  • Resource consumption: As each VM comprises an entire OS, it requires a number of memory and CPU. Containers, then again, are more lightweight because they share the host OS.
  • Portability: You will have to customize a VM for various environments since it requires its own operating system with specific drivers and configurations that rely on the underlying hardware. A container, then again, could be created once and runs anywhere a container runtime is obtainable (Linux, Windows, cloud, on-premise). Container runtime is the software that creates, starts and manages containers — the best-known example is Docker.

You’ll be able to experiment faster with Docker — whether you’re testing a brand new ML model or establishing an information pipeline. You’ll be able to package every part in a container and run it immediately. And also you don’t have any “It really works on my machine”-problems. Your container runs the identical in all places — so you possibly can simply share it.

2 — Containers & Data Science: Do I actually need Containers? And 4 explanation why the reply is yes.

As an information scientist, your foremost task is to research, process and model data to achieve worthwhile insights and predictions, which in turn are essential for management.

In fact, you don’t must have the identical in-depth knowledge of containers, Docker or Kubernetes as a DevOps Engineer or a Site Reliability Engineer (SRE). Nevertheless, it’s price having container knowledge at a basic level — because these are 4 examples of where you’ll come into contact with it ultimately:

Model deployment

You might be training a model. You not only need to use it locally but in addition make it available to others. To do that, you possibly can pack it right into a container and make it available via a REST API.

Let’s take a look at a concrete example: Your trained model runs in a Docker container with FastAPI or Flask. The server receives the requests, processes the info and returns ML predictions in real-time.

Reproducibility and easier collaboration

ML models and pipelines require specific libraries. For instance, if you wish to use a deep learning model like a Transformer, you wish TensorFlow or PyTorch. If you wish to train and evaluate classic machine learning models, you wish Scikit-Learn, NumPy and Pandas. A Docker container now ensures that your code runs with the exact same dependencies on every computer, server or within the cloud. You may also deploy a Jupyter Notebook environment as a container in order that other people can access it and use the exact same packages and settings.

Cloud integration

Containers include all packages, dependencies and configurations that an application requires. They due to this fact run uniformly on local computers, servers or cloud environments. This implies you don’t need to reconfigure the environment.

For instance, you write an information pipeline script. This works locally for you. As soon as you deploy it as a container, you possibly can make certain that it’s going to run in the exact same way on AWS, Azure, GCP or the IBM Cloud.

Scaling with Kubernetes

Kubernetes lets you orchestrate containers. But more on that below. Should you now get a number of requests on your ML model, you possibly can scale it mechanically with Kubernetes. Which means that more instances of the container are began.

3 — First Practice, then Theory: Container creation even without much prior knowledge

Let’s take a take a look at an example that anyone can run through with minimal time — even for those who haven’t heard much about Docker and containers. It took me half-hour.

We’ll arrange a Jupyter Notebook inside a Docker container, creating a conveyable, reproducible Data Science environment. Once it’s up and running, we are able to easily share it with others and make sure that everyone works with the very same setup.

0 — Install Docker Dekstop and create a project directory

To have the option to make use of containers, we’d like Docker Desktop. To do that, we download Docker Desktop from the official website.

Now we create a brand new folder for the project. You’ll be able to do that directly in the specified folder. I do that via Terminal — on Windows with Windows + R and open CMD.

We use the next command:

Screenshot taken by the writer

1. Create a Dockerfile

Now we open VS Code or one other editor and create a brand new file with the name ‘Dockerfile’. We save this file without an extension in the identical directory. Why doesn’t it need an extension?

We add the next code to this file:

# Use the official Jupyter notebook image with SciPy
FROM jupyter/scipy-notebook:latest  

# Set the working directory contained in the container
WORKDIR /home/jovyan/work  

# Copy all local files into the container
COPY . .

# Start Jupyter Notebook without token
CMD ["start-notebook.sh", "--NotebookApp.token=''"]

We now have thus defined a container environment for Jupyter Notebook that is predicated on the official Jupyter SciPy Notebook image.

First, we define with FROM on which base image the container is built. jupyter/scipy-notebook:latest is a preconfigured Jupyter notebook image and comprises libraries equivalent to NumPy, SiPy, Matplotlib or Pandas. Alternatively, we could also use a distinct image here.

With WORKDIR we set the working directory throughout the container. /home/jovyan/work is the default path utilized by Jupyter. User jovyan is the default user in Jupyter Docker images. One other directory may be chosen — but this directory is best practice for Jupyter containers.

With COPY . . we copy all files from the local directory — on this case the Dockerfile, which is situated within the jupyter-docker directory — to the working directory /home/jovyan/work within the container.

With CMD [“start-notebook.sh”, “ — NotebookApp.token=‘’’”] we specify the default start command for the container, specify the beginning script for Jupyter Notebook and define that the notebook is began with out a token — this permits us to access it directly via the browser.

2. Create the Docker image

Next, we’ll construct the Docker image. Be certain you will have the previously installed Docker desktop open. We now return to the terminal and use the next command:

cd jupyter-docker
docker construct -t my-jupyter .

With cd jupyter-docker we navigate to the folder we created earlier. With docker construct we create a Docker image from the Dockerfile. With -t my-jupyter we give the image a reputation. The dot implies that the image will probably be built based on the present directory. What does that mean? Note the space between the image name and the dot.

The Docker image is the template for the container. This image comprises every part needed for the applying equivalent to the operating system base (e.g. Ubuntu, Python, Jupyter), dependencies equivalent to Pandas, Numpy, Jupyter Notebook, the applying code and the startup commands. Once we “construct” a Docker image, because of this Docker reads the Dockerfile and executes the steps that we have now defined there. The container can then be began from this template (Docker image).

We will now watch the Docker image being inbuilt the terminal.

Screenshot taken by the writer

We use docker images to ascertain whether the image exists. If the output my-jupyter appears, the creation was successful.

docker images

If yes, we see the info for the created Docker image:

Screenshot taken by the writer

3. Start Jupyter container

Next, we wish to begin the container and use this command to achieve this:

docker run -p 8888:8888 my-jupyter

We start a container with docker run. First, we enter the particular name of the container that we wish to begin. And with -p 8888:8888 we connect the local port (8888) with the port within the container (8888). Jupyter runs on this port. I don’t understand.

Alternatively, it’s also possible to perform this step in Docker desktop:

Screenshot taken by the writer

4. Open Jupyter Notebook & create a test notebook

Now we open the URL [http://localhost:8888](http://localhost:8888/) within the browser. You need to now see the Jupyter Notebook interface.

Here we’ll now create a Python 3 notebook and insert the next Python code into it.

import numpy as np
import matplotlib.pyplot as plt

x = np.linspace(0, 10, 100)
y = np.sin(x)

plt.plot(x, y)
plt.title("Sine Wave")
plt.show()

Running the code will display the sine curve:

Screenshot taken by the writer

5. Terminate the container

At the tip, we end the container either with ‘CTRL + C’ within the terminal or in Docker Desktop.

With docker ps we are able to check within the terminal whether containers are still running and with docker ps -a we are able to display the container that has just been terminated:

Screenshot taken by the writer

6. Share your Docker image

Should you now need to upload your Docker image to a registry, you possibly can do that with the next command. It will upload your image to Docker Hub (you wish a Docker Hub account for this). You may also upload it to a personal registry of AWS Elastic Container, Google Container, Azure Container or IBM Cloud Container.

docker login

docker tag my-jupyter your-dockerhub-name/my-jupyter:latest

docker push dein-dockerhub-name/mein-jupyter:latest

Should you then open Docker Hub and go to your repositories in your profile, the image needs to be visible.

This was a quite simple example to start with Docker. If you wish to dive a little bit deeper, you possibly can deploy a trained ML model with FastAPI via a container.

4 — Your 101 Cheatsheet: A very powerful Docker commands & concepts at a look

You’ll be able to actually consider a container like a shipping container. No matter whether you load it onto a ship (local computer), a truck (cloud server) or a train (data center) — the content all the time stays the identical.

A very powerful Docker terms

  • Container: Lightweight, isolated environment for applications that comprises all dependencies.
  • Docker: The preferred container platform that means that you can create and manage containers.
  • Docker Image: A read-only template that comprises code, dependencies and system libraries.
  • Dockerfile: Text file with commands to create a Docker image.
  • Kubernetes: Orchestration tool to administer many containers mechanically.

The essential concepts behind containers

  • Isolation: Each container comprises its own processes, libraries and dependencies
  • Portability: Containers run wherever a container runtime is installed.
  • Reproducibility: You’ll be able to create a container once and it runs the exact same in all places.

Essentially the most basic Docker commands

docker --version # Check if Docker is installed
docker ps # Show running containers
docker ps -a # Show all containers (including stopped ones)
docker images # List of all available images
docker info # Show system information in regards to the Docker installation

docker run hello-world # Start a test container
docker run -d -p 8080:80 nginx # Start Nginx within the background (-d) with port forwarding
docker run -it ubuntu bash # Start interactive Ubuntu container with bash

docker pull ubuntu # Load a picture from Docker Hub
docker construct -t my-app . # Construct a picture from a Dockerfile

Final Thoughts: Key takeaways as an information scientist

👉 With Containers you possibly can solve the “It really works on my machine” problem. Containers make sure that ML models, data pipelines, and environments run identically in all places, independent of OS or dependencies.

👉 Containers are more lightweight and versatile than virtual machines. While VMs include their very own operating system and eat more resources, containers share the host operating system and begin faster.

👉 There are three key steps when working with containers: Create a Dockerfile to define the environment, use docker construct to create a picture, and run it with docker run — optionally pushing it to a registry with docker push.

After which there’s Kubernetes.

A term that comes up rather a lot on this context: An orchestration tool that automates container management, ensuring scalability, load balancing and fault recovery. This is especially useful for microservices and cloud applications.

Before Docker, VMs were the go-to solution (see more in ‘Virtualization & Containers for Data Science Newbiews’.) VMs offer strong isolation, but require more resources and begin slower.

So, Docker was developed in 2013 by Solomon Hykes to resolve this problem. As a substitute of virtualizing entire operating systems, containers run independently of the environment — whether in your laptop, a server or within the cloud. They contain all of the vital dependencies in order that they work consistently in all places.

I simplify tech for curious minds🚀 Should you enjoy my tech insights on Python, data science, Data Engineering, machine learning and AI, consider subscribing to my substack.

Where Can You Proceed Learning?

ASK ANA

What are your thoughts on this topic?
Let us know in the comments below.

0 0 votes
Article Rating
guest
0 Comments
Oldest
Newest Most Voted
Inline Feedbacks
View all comments

Share this article

Recent posts

0
Would love your thoughts, please comment.x
()
x