(Source)
— (co-designers of the Erlang programming language.)
article about Python for the series “Data Science: From School to Work.” For the reason that starting, you’ve gotten learned how one can manage your Python project with UV, how one can write a clean code using PEP and SOLID principles, how one can handle errors and use loguru to log your code and how one can write tests.
Now you’re ready to create working, production-ready code. But code isn’t perfect and might at all times be improved. A final (optional, but highly advisable) step in creating code is optimization.
To optimize your code, it’s worthwhile to have the ability to trace what’s occurring in it. To achieve this, we use tools called Profilers. They generate profiles of your code. It means a set of statistics that describes how often and for a way long various parts of this system executed. They make it possible to discover bottlenecks and parts of the code that eat too many resources. In other words, they show where your code needs to be optimized.
Today, there may be such a proliferation of profilers in Python that the default profiler in Pycharm is known as yappi for “Yet One other Python Profiler”.
This text is due to this fact not an exhaustive list of all existing profilers. In this text, I present a tool for every aspect of the code we would like to profile: memory, time and CPU/GPU consumption. Other packages will probably be mentioned with some references but is not going to be detailed.
I – Memory profilers
Memory profiling is the strategy of monitoring and evaluating a program’s memory utilization while running. This method helps developers find memory leaks, optimizing memory utilization, and comprehending their programs’ memory consumption patterns. Memory profiling is crucial to stop applications from using more memory than vital and causing sluggish performance or crashes.
1/ memory-profiler
memory_profiler
is an easy-to-use Python module designed to profile memory usage of a script. It will depend on psutil
module. To put in the package, simply type:
pip install memory_profiler # (in your virtual environment)
# or should you use uv (what I encourage)
uv add memory_profiler
Profiling executable
Certainly one of the benefits of this package is that it will not be limited to pythonic use. It installs the mprof
command that permits monitoring the activity of any executable.
As an illustration, you possibly can monitor the memory consummation of applications like ollama
by running this command:
mprof run ollama run gemma3:4b
# or with uv
mprof run ollama run gemma3:4b
To see the result, you’ve gotten to put in matplotlib
first. Then, you possibly can plot the recorded memory profile of your executable by running:
mprof plot
# or with uv
mprof run ollama run gemma3:4b
The graph then looks like this:
Profiling Python code
Let’s get back to what brings us here, the monitoring of a Python code.
memory_profiler
works with a line-by-line mode using an easy decorator @profile
. First, you decorate the interest function and you then run the script. The output will probably be written on to the terminal. Consider the next monitoring.py
script:
@profile
def my_func():
a = [1] * (10 ** 6)
b = [2] * (2 * 10 ** 7)
del b
return a
if __name__ == '__main__':
my_func()
It is necessary to note that it will not be vital to import the package from memory_profiler import profile
on the begin of the script. On this case you’ve gotten to specify some specific arguments to the Python interpreter.
python-m memory_profiler monitoring.py # with an area between python and -m
# or
uv run -m memory_profiler monitoring.py
And you’ve gotten the next output with a line-by-line details:

The output is a table with five columns.
- Line #: The road variety of the profiled code
- Mem usage: The memory usage of the Python interpreter after executing that line.
- Increment: The change in memory usage in comparison with the previous line.
- Occurrences: The variety of times that line was executed.
- Line Contents: The actual source code.
This output may be very detailed and allows very high quality monitoring of a selected function.
Necessary: Unfortunately, this package is not any longer actively maintained. The creator is in search of a substitute.
2/ tracemalloc
tracemalloc
is a built-in module in Python that tracks memory allocations and deallocations. Tracemalloc provides an easy-to-use interface for capturing and analyzing memory usage snapshots, making it a useful tool for any Python developer.
It offers the next details:
- Shows where each object was allocated by providing a traceback.
- Gives memory allocation statistics by file and line number, including the general size, count, and average size of memory blocks.
- Means that you can compare two snapshots to discover potential memory leaks.
The package tracemalloc
could also be usefull to discover memory leak in your code.
Personally, I find it less intuitive to establish than the opposite packages presented in this text. Listed below are some links to go further:
II – Time profilers
Time profiling is the technique of measuring the time spent in several parts of a program. By identifying performance bottlenecks, you possibly can focus their optimization efforts on the parts of the code that may have essentially the most significant impact.
1/ line-profiler
The line-profiler
package is kind of much like memory-profiler
, nevertheless it serves a distinct purpose. It’s designed to profile specific functions by measuring the execution time of every line inside those functions. To make use of LineProfiler effectively, it’s worthwhile to explicitly specify which functions you would like it to profile by simply adding the @profile
decorator above them.
To put in it just type:
pip install line_profiler # (in your virtual environment)
# or
uv add line_profiler
Considering the next script named monitoring.py
@profile
def create_list(lst_len: int):
arr = []
for i in range(0, lst_len):
arr.append(i)
def print_statement(idx: int):
if idx == 0:
print("Starting array creation!")
elif idx == 1:
print("Array created successfully!")
else:
raise ValueError("Invalid index provided!")
@profile
def principal():
print_statement(0)
create_list(400000)
print_statement(1)
if __name__ == "__main__":
principal()
To measure the execution time of the function principal()
and create_list()
, we add the decorator @profile
.
The simplest approach to get a time profiling of this script to make use of the kernprof
script.
kernprof -lv monitoring.py # (in your virtual environment)
# or
uv run kernprof -lv monitoring.py
It can create a binary file named your_script.py.lprof
. The argument -v
allows to indicate directyl the output within the terminal.
Otherwise, you possibly can view the outcomes later like so:
python-m line_profiler monitoring.py.lprof # (in your virtual environment)
# or
uv run python -m line_profiler monitoring.py.lprof
It provides the next informations:

There are two tables, one by profiled function. Each table containes the next informations
- Line #: The road number within the file.
- Hits: The variety of times that line was executed.
- Time: The full period of time spent executing the road within the timer’s units. Within the header information before the tables, you will note a line “Timer unit:” giving the conversion factor to seconds. It could be different on different systems.
- Per Hit: The typical period of time spent executing the road once within the timer’s units
- % Time: The proportion of time spent on that line relative to the overall amount of recorded time spent within the function.
- Line Contents: The actual source code.
1/ cProfile
Python comes with two built-in profilers:
cProfile
: A C extension with reasonable overhead that makes it suitable for profiling long-running programs. It is suggested for many users.profile
: A pure Python module whose interface is imitated bycProfile
, but which adds significant overhead to profiled programs. It could be a useful tool when it’s worthwhile to extend or customize the profiling functionality.
The bottom syntax is cProfile.run(statement, filename=None, sort=-1)
. The filename
argument could be passed to avoid wasting the output. And the sort
argument could be used to specify how the output must be printed. By default, it is about to -1( no value).
As an illustration, should you modify the monitoring script like this:
import cProfile
def create_list(lst_len: int):
arr = []
for i in range(0, lst_len):
arr.append(i)
def print_statement(idx: int):
if idx == 0:
print("Starting array creation!")
elif idx == 1:
print("Array created successfully!")
else:
raise ValueError("Invalid index provided!")
def principal():
print_statement(0)
create_list(400000)
print_statement(1)
if __name__ == "__main__":
cProfile.run("principal()")
now we have the next output:

First, now we have the script outputs: print_statement(0)
and print_statement(1)
.
Then, now we have the profiler output: The primary line shows the variety of function calls and the time it took to run. The second line is a reminder of the sorted parameter. And, the profiler provides a table with six columns:
- ncalls: Shows the variety of calls made
- tottime: Total time taken by the given function. Note that the time made in calls to sub-functions are excluded.
- percall: Total time / No of calls. (remainder is ignored)
- cumtime: Unlike tottime, this includes time spent on this and all subfunctions that the higher-level function calls. It’s most useful and is accurate for recursive functions.
- percall: The percall following cumtime is calculated because the quotient of cumtime divided by primitive calls. The primitive calls include all of the calls that weren’t included through recursion.
- filename: The name of the tactic.
The primary and the last rows of the table come from cProfile. The opposite rows are concerning the script.
You’ll be able to customize the output through the use of the Profile()
class. First, you’ve gotten to initialize an instance of Profile class and using the tactic enable()
and disable()
to, respectively, start and to finish the collecting of profiling data. Then, the pstats
module could be used to govern the outcomes collected by the profiler object.
To sort output by cumulative time, as a substitute of the usual name the previous code could be rewritten like this:
import cProfile, pstats
# ...
# Same as before
if __name__ == "__main__":
profiler = cProfile.Profile()
profiler.enable()
principal()
profiler.disable()
stats = pstats.Stats(profiler).sort_stats('cumtime')
stats.print_stats()
And the output becomes:

As you possibly can see, now the table is sorted by cumtime
. And the 2 rows of cProfile of the previous table aren’t on this table.
Visualize profiling with Snakeviz.
The output may be very easy to analyse. But, it will probably turn out to be unreadable if the profiled code becomes too big.
One other approach to analyse the ouput is to visualise data as a substitute of read it. To achieve this, we use the Snakeviz
package. To put in it, simply type:
pip install snakeviz # (in your virtual environment)
# or
uv add snakeviz
Then, replace stats.print_stats()
by stats.dump_stats("profile.prof")
to avoid wasting profiling data. Now, you possibly can have a visualization of your profiling by typing:
snakeviz profile.prof
It launches a file browser interface from which you’ll select amongst two data visualizations: Icicle and Sunburst.


It is simpler to read than the print_stats()
output because you possibly can interact with each element by moving your mouse over it. As an illustration, you possibly can have more details concerning the function create_list()

evaluate_model()
(from the creator).Create a call graph with gprof2dot
A call graph is a visible representation of the relationships between functions or methods in a program, showing which functions call others and the way long each function or method takes. It could be seen as a map of your code.
pip install gprof2dot # (in your virtual environment)
# or
uv add gprof2dot
Then exectute your by typing
python-m cProfile -o monitoring.pstats .monitoring.py # (in your virtual environment)
# or
uv run python-m cProfile -o monitoring.pstats .monitoring.py
It can create a monitoring.pstats
that could be turn right into a call graph using the next command:
gprof2dot -f pstats monitoring.pstats | dot -Tpng -o monitoring.png # (in your virtual environment)
# or
uv run gprof2dot -f pstats monitoring.pstats | dot -Tpng -o monitoring.png
Then the decision graph is saved right into a png file named monitoring.png

2/ Other interesting packages
a/ PyCallGraph
PyCallGraph is a Python module that creates call graph visualizations. To make use of it, you’ve gotten to :
To create a call graph of your code, supply run it a PyCallGraph context like this:
from pycallgraph import PyCallGraph
from pycallgraph.output import GraphvizOutput
with PyCallGraph(output=GraphvizOutput()):
# code you must profile
Then, you get a png of the decision graph of your code is known as by default pycallgraph.png
.
I’ve made the decision graph of the previous example:

In each box, you’ve gotten the name of the function, the time spent in and the variety of calls. Like with snakeviz, the graph could be very complex in case your code has many dependencies. But the colour indicates the bottlenecks. In complex code, it’s very interesting to review it to see the dependencies and relationships.
b/ PyInstrument
PyInstrument can also be a Python profiler very easy to make use of. You’ll be able to add the profiler in your script by surredning the code like this:
from pyinstrument import Profiler
profiler = Profiler()
profiler.start()
# code you must profile
profiler.stop()
print(profiler.output_text(unicode=True, color=True))
The output gives

It’s less detailled than cProfile but it’s also more readable. Your functions are highlighted and sorted by time.
Butthe true interest of PyInstrument comes with its html output. To get this html output simply type within the terminal:
pyinstrument --html .monitoring.py
# or
uv run pyinstrument --html .monitoring.py
It launches a file browser interface from which you’ll select amongst two data visualizations: Call stack and Timeline.


Here, the profile is more detailed and you’ve gotten many options to filter.
CPU/GPU profiler
CPU and GPU profiling is the technique of analyzing the utilization and performance of a program on the central processing unit (CPU) and graphics processing unit (GPU). By measuring how much resources are spent on different parts of the code on these processing units, developers can discover performance bottlenecks, understand where their code is being executed, and optimize their application to attain higher performance and efficiency.
So far as I do know, there is just one package that may profile GPU power consumption.
1/ Scalene
Scalene is a high-performance CPU, GPU and memory profiler designed specifically for Python. It’s an open-source package that gives detailed insights. It’s designed to be fast, accurate, and straightforward to make use of, making it a wonderful tool for developers seeking to optimize their code.
- CPU/GPU Profiling: Scalene provides detailed information on CPU/GPU usage, including the time spent in several parts of your code. It could aid you discover performance bottlenecks and optimize your code for higher execution times.
- Memory Profiling: Scalene tracks memory allocation and deallocation, helping you understand how your code uses memory. This is especially useful for identifying memory leaks or optimizing memory-intensive applications.
- Line-by-Line Profiling: Scalene provides line-by-line profiling, which provides you an in depth breakdown of the time spent in each line of your code. This feature is invaluable for pinpointing performance issues.
- Visualization: Scalene features a graphical interface for visualizing profiling results, making it easier to grasp and navigate the information.
To focus on all the benefits of Scalene, I’ve developed functions with the only aim of consuming memory memory_waster()
, CPU cpu_waster()
and GPU gpu_convolution()
. All of them are in a script scalene_tuto.py
.
import random
import copy
import math
import cupy as cp
import numpy as np
def memory_waster():
"""Wastes memory but in a controlled way"""
memory_hogs = []
# Create moderately sized redundant data structures
for i in range(100):
garbage_data = []
for j in range(1000):
waste = f"Useless string #{j} repeated " * 10
garbage_data.append(waste)
garbage_data.append(
{
"id": j,
"data": waste,
"numbers": [random.random() for _ in range(50)],
"range_data": list(range(100)),
}
)
memory_hogs.append(garbage_data)
for iteration in range(4):
print(f"Creating copy #{iteration}...")
memory_copy = copy.deepcopy(memory_hogs)
memory_hogs.extend(memory_copy)
return memory_hogs
def cpu_waster():
meaningless_result = 0
for i in range(10000):
for j in range(10000):
temp = (i**2 + j**2) * random.random()
temp = temp / (random.random() + 0.01)
temp = abs(temp**0.5)
meaningless_result += temp
# Some trigonometric operations
angle = random.random() * math.pi
temp += math.sin(angle) * math.cos(angle)
if i % 100 == 0:
random_mess = [random.randint(1, 1000) for _ in range(1000)] # Smaller list
random_mess.sort()
random_mess.reverse()
random_mess.sort()
return meaningless_result
def gpu_convolution():
image_size = 128
kernel_size = 64
image = np.random.random((image_size, image_size)).astype(np.float32)
kernel = np.random.random((kernel_size, kernel_size)).astype(np.float32)
image_gpu = cp.asarray(image)
kernel_gpu = cp.asarray(kernel)
result = cp.zeros_like(image_gpu)
for y in range(kernel_size // 2, image_size - kernel_size // 2):
for x in range(kernel_size // 2, image_size - kernel_size // 2):
pixel_value = 0
for ky in range(kernel_size):
for kx in range(kernel_size):
iy = y + ky - kernel_size // 2
ix = x + kx - kernel_size // 2
pixel_value += image_gpu[iy, ix] * kernel_gpu[ky, kx]
result[y, x] = pixel_value
result_cpu = cp.asnumpy(result)
cp.cuda.Stream.null.synchronize()
return result_cpu
def principal():
print("n1/ Wasting some memory (controlled)...")
_ = memory_waster()
print("n2/ Wasting CPU cycles (controlled)...")
_ = cpu_waster()
print("n3/ Wasting GPU cycles (controlled)...")
_ = gpu_convolution()
if __name__ == "__main__":
principal()
For the GPU function, you’ve gotten to put in cupy
in accordance with your cuda version (nvcc --version
to get it)
pip install cupy-cuda12x # (in your virtual environment)
# or
uv add install cupy-cuda12x
Further details on installing cupy could be present in the documentation.
To run Scalene, use the command
scalene scalene_tuto.py
# or
uv run scalene scalene_tuto.py
It profiles each CPU, GPU, and memory by default. Should you only want one or a few of the options, use the flags --cpu
, --gpu
, and --memory
.
Scalene provides a line-level and a function level profiling. And it has two interfaces: the Command Line Interface (CLI) and the online interface.
Necessary: It is healthier to make use of Scalene with Ubuntu using WSL. Otherwise, the profiler doesn’t retrieve memory consumption information.
a) Command Line Interface
By default, Scalene’s output is the online interface. To acquire the CLI as a substitute, add the flag --cli
.
scalene scalene_tuto.py --cli
# or
uv run scalene scalene_tuto.py --cli
You might have the next results:


The visualization is categorized into three distinct colours, each representing a distinct profiling metric.
- The blue section represents CPU profiling, which provides a breakdown of the time spent executing Python code, native code (reminiscent of C or C++), and system-related tasks (like I/O operations).
- The green section is devoted to memory profiling, showing the proportion of memory allocated by Python code, in addition to the general memory usage over time and its peak values.
- The yellow section focuses on GPU profiling, displaying the GPU’s running time and the quantity of knowledge copied between the GPU and CPU, measured in mb/s. It’s price noting that GPU profiling is currently limited to NVIDIA GPUs.
b) The net interface.
The net interface is split in three parts.



The colour code is identical as within the command lien interface. But some icons are added:
- 💥: Optimizable code region (performance indication within the Function Profile section).
- ⚡: Optimizable lines of code.
c) AI Suggestions
Certainly one of the good benefits of Scalene is the flexibility to make use of AI to enhance the slowness and/or overconsumption you’ve gotten identified. It currently supports OpenAI API, Amazon BedRock, Azure OpenAI and ollama in local

After choosing your tools, you simply should click on 💥 or ⚡if you must optimize an element of the code or simply a line.
I test it with codellama:7b-python
from ollama to optimize the gpu_convolution()
function. Unfortunately, as mentioned within the interface:
Not one of the suggested optimizations worked. However the codebase was not conducive to optimization because it was artificially complicated. Just remove unnecessary lines to avoid wasting time and memory. Also, I used a small model, which may very well be the explanation.
Regardless that my tests were inconclusive, I believe this selection could be interesting and will certainly proceed to enhance.
Conclusion
Nowadays, we’re less concerned concerning the resource consumption of our developments, and in a short time these optimization deficits can accumulate, making the code slow, too slow for production, and sometimes even requiring the acquisition of more powerful hardware.
Code profiling tools are indispensable relating to identifying areas in need of optimization.
The mix of the memory profiler and line profiler provides a excellent initial evaluation: easy to establish, with easy-to-understand reports.
Tools reminiscent of cProfile and Scalene are complete and have graphical representations, but require more time to research. Finally, the AI optimization option offered by Scalene is an actual asset, even when in my case the model used was not sufficient to supply anything relevant.
Interested by Python & Data Science?
Follow me for more tutorials and insights!