The Journey from Jupyter to Programmer: A Quick-Start Guide

, myself included, start their coding journey using a Jupyter Notebook. These files have the extension .ipynb, which stands for Interactive Python Notebook. Because the extension name suggests, it has an intuitive and interactive user interface. The notebook is broken down into ‘cells’ or small blocks of separated code or markdown (text) language. Outputs are displayed underneath each cell once the code inside that cell has been executed. This promotes a versatile and interactive environment for coders to construct their coding skills and begin working on data science projects.

A typical example of a Jupyter Notebook is below:

Example of a Jupyter Notebook with code cells, markdown cells and a sample visualisation.

This all sounds great. And don’t get me flawed, to be used cases equivalent to conducting solo research or exploratory data evaluation (EDA), Jupyter Notebooks great. The problems arise if you ask the next questions:

How do you switch a Jupyter Notebook into code that may be leveraged by a business?
Are you able to collaborate with other developers on the identical project using a version control system?
How will you deploy code to a production environment?

Pretty soon, the restrictions of exclusively using Jupyter Notebooks inside a business context will begin to cause problems. It’s simply not designed for these purposes. The overall solution is to organise code in a modular fashion.

By the tip of this text, you need to have a transparent understanding of the way to structure a small data science project as a Python program and appreciate some great benefits of transitioning to a programming approach. You possibly can try an example template to complement this text in my github here.

Disclaimer

The contents of this text are based on my experience of migrating away from solely using Jupyter Notebooks to put in writing code. Do notebooks still have a purpose? Yes. Are there alternative routes to organise and execute code beyond the methods I discuss in this text? Yes.

I desired to share this information to assist anyone wanting to make the move away from notebooks and towards writing scripts and programs. If I’ve missed any features of Jupyter Notebooks that mitigate the restrictions I’ve mentioned, please drop a comment!

Let’s get back to it.

Programming: what’s the large deal?

For the aim of this text, I’ll be specializing in the Python programming language as that is the language I take advantage of for data science projects. Structuring code as a Python program unlocks a variety of functionalities which can be difficult to realize when working exclusively inside a Jupyter Notebook. These advantages include collaboration, versatility and portability – you’re simply capable of do more together with your code. I’ll explain these advantages further down – stick with me a little bit longer!

Python programs are typically organised into modules and packages. A module is a python script (files with a .py extension) that comprises python code which may be imported into other files. A package is a directory that comprises python modules. I’ll discuss the aim of the file __init__.py later within the article.

Schematic of package and module structure in a knowledge science project

Anytime you import a python library into your code, equivalent to built-in libraries like os or third-party libraries like pandas , you might be interacting with a python program that’s been organised right into a package and modules.

For instance, let’s say you ought to use the randint function from numpy. This function means that you can generate a random integer based on specified parameters. You may write:

from numpy.random import randint

Let’s annotate that import statement to indicate what you’re actually importing.

On this instance, numpy is a ; random is a and randint is a .

So, it seems you most likely interact with python programs regularly. This poses the query, what does the journey appear to be towards becoming a python programmer?

The good transition: where do you even start?

The trick to constructing a functional python program is all within the file structure and organisation. It sounds boring however it plays an excellent necessary part in setting yourself up for achievement!

Let me use an analogy to elucidate: every house has a drawer that has nearly every part in it; tools, elastic bands, medicine, your hopes and dreams, the lot. There’s no rhyme or reason, it’s a dumping ground of nearly every part. Consider this as a Jupyter Notebook. This one file typically comprises all stages of a project, from importing data, exploring what the info looks like, visualising trends, extracting features, training a model etc. For a project that’s destined to be deployed on a production system or co-developed with colleagues, it’s going to cause chaos. What’s needed is a few organisation, to place all of the tools in a single compartment, the medication in one other and so forth.

A fantastic method to try this with code is to make use of a project template. One which I take advantage of steadily is the Cookie Cutter Data Science template. You possibly can create an entire directory on your project with all of the relevant files needed to do absolutely anything in just a few easy operations in a terminal window – see the link above for information on the way to install and run Cookie Cutter.

Below are a few of the key features of the project template:

package or src directory — directory for python scripts/modules, equipped with examples to get you began
readme.md — file to explain usage, setup and the way to run the package
docs directory — containing files that enable seamless autodocumentation
Makefile— for writing OS ambivalent bespoke run commands
pyproject.toml/requirements.txt — for dependency management

Project template created by the Cookie Cutter Data Science package.

Top tip. Be sure to maintain Cookie Cutter up thus far. With every release, latest features are added in keeping with the ever-evolving data science universe. I’ve learnt quite just a few things from exploring a brand new file or feature within the template!

Alternatively, you should use other templates to construct your project equivalent to that provided by Poetry. Poetry is a package manager which you should use to generate a project template that’s more lightweight than Cookie Cutter.

The very best method to interact together with your project is thru an IDE (Integrated Development Environment). This software, equivalent to Visual Studio Code (VS Code) or PyCharm, encompass quite a lot of features and processes that enable you to code, test, debug and package your work efficiently. My personal preference is VS Code!

From cells to scripts: let’s get coding

Now that now we have a development environment and a nicely structured project template, how exactly do you write code in a python script when you’ve only ever coded in a Jupyter Notebook? To reply that query, let’s first consider just a few industry-standard coding Best Practices.

Modular — follow the software engineering philosophy of ‘Single Responsibility Principle’. All code must be encapsulated in functions, with each function performing a single task. The Zen of Python states: ‘Easy is healthier than complex’.
Readable — if code is readable, then there’s a great probability it is going to be maintainable. Make sure the code is stuffed with docstrings and comments!
Stylish — format code in a consistent and clear way. The PEP 8 guidelines are designed for this purpose to advise how code must be presented. You possibly can install autoformatters equivalent to Black in an IDE in order that code is routinely formatted in compliance with PEP 8 every time the python script is saved. For instance, the best level of indentation and spacing can be applied so that you don’t even must give it some thought!
Versatile — if code is encapsulated into functions or classes, these may be reused throughout a project.

For a deeper dive into coding best practice, this text is a implausible overview of principles to stick to as a Data Scientist, make sure to test it out!

With those best practices in mind, let’s return to the query: how do you write code in a python script?

Module structure

First, separate the various stages of your notebook or project into different python files. And be certain to call them in keeping with the duty. For instance, you would possibly have the next scripts in a typical machine learning package: data.py, preprocess.py, features.py, train.py, predict.py, evaluate.py etc. Depending in your project structure, these would sit throughout the package or src directory.

Inside each script, code must be organised or ‘encapsulated’ right into a classes and/or functions. A function is a reusable block of code that performs a single, well-defined task. A class is a blueprint for creating an object, with its own set of attributes (variables) and methods (functions). Encapsulating code in this fashion permits reusability and avoids duplication, thus keeping code concise.

A script might only need one function if the duty is straightforward. For instance, a knowledge loading module (e.g. data.py) may only contain a single function ‘load_data’ which loads data from a csv file right into a pandas DataFrame. Other scripts, equivalent to a knowledge processing module (e.g. preprocess.py) will inherently involve more tasks and hence requires more functions or a category to encapsulate these tasks.

Example template of a typical module in a knowledge science project.

Top tip. Transitioning from Jupyter Notebooks to scripts may take a while and everybody’s personal journey will look different. Some Data Scientists I do know write code as python scripts right away and don’t touch a notebook. Personally, I take advantage of a notebook for EDA, I then encapsulate the code into functions or classes before porting to a script. Do whatever feels best for you.

There are just a few tools that will help with the transition. 1) In VS Code, you may select a number of lines, right click and choose Run Python > Run Selection/Line in Python Terminal. This is analogous to running a cell in Jupyter Notebook. 2) You possibly can convert a notebook to a python script by clicking File > Download as > Python (.py). I wouldn’t recommend that approach with large notebooks for fear of making monster scripts, but the choice is there!

The ‘main’ event

At this point, we’ve established that code must be encapsulated into functions and stored inside clearly named scripts. The following logical query is, how are you going to tie all these scripts together so code gets executed in the best order?

The reply is to import these scripts right into a single-entry point and execute the code in a single place. Throughout the context of developing an easy project, this entry point is often a script named most important.py (but may be called anything). At the highest of most important.py, just as you’ll import needed built-in packages or third-party packages from PyPI, you’ll import your personal modules or specific classes/functions from modules. Any classes or functions defined in these modules can be available to make use of by the script they’ve been imported into.

To do that, the package directory inside your project must contain a __init__.py file, which is often left blank for easy projects. This file tells the python interpreter to treat the directory as a package, meaning that any files with a .py extension get treated as modules and might due to this fact be imported into other files.

The structure of most important.py is project dependent, but it is going to generally be dictated by the needed order of code execution. For a typical machine learning project, you’ll first need to make use of the load_data function from the module data.py. You then might instantiate the preprocessor class that’s imported from the module preprocess.py and apply quite a lot of class methods to the preprocessor object. You’d then move onto feature engineering and so forth until you’ve gotten the entire workflow written out. This workflow would typically be contained or referenced inside a conditional statement at the underside of most important.py.

Wait….. who mentioned anything a couple of conditional statement? The conditional statement is as follows:

if __name__ == '__main__': 
   #  add code here

__name__ is a special python variable that may have two different values depending on how the script is run:

If the script is run directly in terminal, the interpreter assigns the __name__ variable the worth '__main__'. Since the statement if '__name__=='__main__': is true, any code that sits inside this statement is executed.
If the script is run as an imported module, the interpreter assigns the name of the module as a string to the __name__ variable. Since the statement if if '__name__=='__main__': is fake, the contents of this statement just isn’t executed.

Some more information on this may be found here.

Given this process, you’ll have to reference the master function throughout the if '__name__=='__main__': conditional statement in order that it’s executed when most important.py is run. Alternatively, you may place the code underneath if '__name__=='__main__': to realize the identical final result.

Example template of most important.py, which serves because the most important entry point to this system

most important.py (or any python script) may be executed in terminal using the next syntax:

python3 most important.py

Upon running most important.py, code can be executed from all of the imported modules in the required order. This is identical as clicking the ‘run all’ button on a Jupyter Notebook where each cell is executed in sequential order. The difference now could be that the code is organised into individual scripts in a logical manner and encapsulated inside classes and functions.

It’s also possible to add CLI (command-line interface) arguments to your code using tools equivalent to argparse and typer, allowing you to toggle specific variables when running most important.py within the terminal. This provides an important deal of flexibility during code execution.

So we’ve now reached the perfect part. The pièce de résistance. The actual explanation why, beyond having fantastically organised and readable code, you need to go to the hassle of Programming.

The tip game: what’s the purpose of programming?

Let’s walk through a few of the key advantages of moving beyond Jupyter Notebooks and transitioning to writing Python scripts as an alternative.

Visualisation of the important thing advantages to programming. Image generated by creator.

Packaging & distribution — you may package and distribute your python program so it may be shared, installed and run on one other computer. Package managers equivalent to pip, poetry or conda may be used to put in the package, just as you’ll install packages from PyPI, equivalent to pandas or numpy. The trick to successfully distributing your package is to be sure that the dependencies are managed appropriately, which is where the files pyproject.toml or requirements.txt are available. Some useful resources may be found here and here.
Deployment — whilst there are multiple methods and platforms to deploy code, using a modular approach will put you in good stead to get your code production ready. Tools equivalent to Docker enable the deployment of programs or applications in isolated environments called containers, which may be easily managed through CI/CD (continuous integration & deployment) pipelines. It’s price noting that while Jupyter Notebooks may be deployed using JupyterLab, this approach lacks the pliability and scalability of adopting a modular, script-based workflow.
Version control — moving away from Jupyter Notebooks opens up the wonderful worlds of version control and collaboration. Version control systems equivalent to Git are very much industry standard and offer a wealth of advantages, providing you employ them appropriately! Follow the motto ‘incremental changes are key’ and be sure that you make small, regular commits with logical commit messages in imperative language every time you make functional changes whilst developing. It will make it far easier to maintain track of changes and test code. Here is an excellent useful guide to using git as a knowledge scientist.

Fun fact. It’s generally discouraged to commit Jupyter Notebooks to version control systems because it is difficult to trace changes!

(Auto)Documentation — everyone knows that documenting code increases its readability thus helping the reader understand what the code is doing. It’s considered best practice so as to add docstrings to functions and classes inside python scripts. What’s really cool is that we are able to use these docstrings to construct an index of formatted documentation of your whole project in the shape of html files. Tools equivalent to Sphinx enable you to do that in a fast and straightforward way. You possibly can read my previous article which takes you thru this process step-by-step.
Reusability — adopting a modular approach promotes the reuse of code. There are lots of common tasks inside data science projects, equivalent to cleansing data or scaling features. There’s little point in reinventing the wheel, so when you can reuse functions or classes with minor modification from previous projects, so long as there are not any confidentiality restrictions, then save yourself that point! You may have a utils.py or classes.py module which comprises ambivalent code that may be used across modules.
Configuration management — whilst this is feasible with a Jupyter Notebook, it is not uncommon practice to make use of configuration management for a python program. Configuration management refers to organising and managing a project’s parameters and variables in a centralised way. As a substitute of defining variables throughout the code, they’re stored in a file that sits throughout the project directory. Which means that you don’t want to interrogate the code to alter a parameter. An outline of this may be found here.

Note. For those who use a YAML file (.yml) for configuration, this requires the python package yaml. Be sure to put in the pyyaml package (not ‘yaml’) using pip install pyyaml. Forgetting this will result in “package not found” errors—I’ve made this error, perhaps greater than once..

Logging — using loggers inside a python program lets you easily track code execution, provide debugging information and monitor a program or application. Whilst this functionality is feasible inside a Jupyter Notebook, it is usually considered overkill and is fulfilled with the print() statement as an alternative. Through the use of python’s logger module, you may format a logging object to your liking. It has five different messaging levels (info, debug, warning, error, critical) relative to the severity of the events being logger. You possibly can include logging messages throughout the code to supply insight into code execution, which may be printed to terminal and/or written to a file. You possibly can learn more about logging here.

When are Jupyter Notebooks useful?

As I eluded in the beginning of this text, Jupyter Notebooks still have their place in data science projects. Their easy-to-use interface makes them great for exploratory and interactive tasks. Two key use cases are listed below:

Conducting exploratory data evaluation on a dataset throughout the initial stages of a project.
Creating an interactive resource or report back to show analytical findings. Note there are many tools on the market that you may use on this nature, but a Jupyter Notebook may also do the trick.

Final thoughts

Thanks for sticking with me to the very end! I hope this discussion has been insightful and has shed some light on how and why to begin programming. As with most things in Data Science, there isn’t a single ‘correct’ method to solve an issue, but a considered multi-faceted approach depending on the duty at hand.

Shout out to my colleague and fellow data scientist Hannah Alexander for reviewing this text 🙂

Thanks for reading!

The Journey from Jupyter to Programmer: A Quick-Start Guide

Disclaimer

Programming: what’s the large deal?

The good transition: where do you even start?

From cells to scripts: let’s get coding

Module structure

The ‘main’ event

The tip game: what’s the purpose of programming?

When are Jupyter Notebooks useful?

Final thoughts

What are your thoughts on this topic?
Let us know in the comments below.

Share this article

Recent posts

Retrieval for Time-Series: How Looking Back Improves Forecasts

ChatGPT Health helps you to connect medical records to an AI that makes things up

Constructing Generalist Humanoid Capabilities with NVIDIA Isaac GR00T N1.6 Using a Sim-to-Real Workflow

Advancing Large Model Training on Consumer-grade Hardware

Beyond Prompting: The Power of Context Engineering

The Journey from Jupyter to Programmer: A Quick-Start Guide

Disclaimer

Programming: what’s the large deal?

The good transition: where do you even start?

From cells to scripts: let’s get coding

Module structure

The ‘__main__’ event

The tip game: what’s the purpose of programming?

When are Jupyter Notebooks useful?

Final thoughts

What are your thoughts on this topic? Let us know in the comments below.

Share this article

Recent posts

The ‘main’ event

What are your thoughts on this topic?
Let us know in the comments below.