Think Your Python Code Is Slow? Stop Guessing and Start Measuring

-

I used to be working on a script the opposite day, and it was driving me nuts. It worked, sure, however it was just… slow. Really slow. I had that feeling that this could possibly be a lot faster if I could work out the hold-up was.

My first thought was to start out tweaking things. I could optimise the info loading. Or rewrite that for loop? But I ended myself. I’ve fallen into that trap before, spending hours “optimising” a bit of code only to seek out it made barely any difference to the general runtime. Donald Knuth had some extent when he said, “Premature optimisation is the basis of all evil.”

I made a decision to take a more methodical approach. As an alternative of guessing, I used to be going to seek out out needless to say. I needed to profile the code to acquire hard data on exactly which functions were consuming the vast majority of the clock cycles.

In this text, I’ll walk you thru the precise process I used. We’ll take a deliberately slow Python script and use two improbable tools to pinpoint its bottlenecks with surgical precision.

The primary of those tools is known as cProfile, a strong profiler built into Python. The opposite is known as snakeviz, a good tool that transforms the profiler’s output into an interactive visual map.

Organising a development environment

Before we start coding, let’s arrange our development environment. The perfect practice is to create a separate Python environment where you may install any mandatory software and experiment, knowing that anything you do won’t impact the remaining of your system. I’ll be using conda for this, but you need to use any method with which you’re familiar.

#create our test environment
conda create -n profiling_lab python=3.11 -y

# Now activate it
conda activate profiling_lab

Now that now we have the environment arrange, we’d like to put in snakeviz for our visualisations and numpy for the instance script. cProfile is already included with Python, so there’s nothing more to do there. As we’ll be running our scripts with a Jupyter Notebook, we’ll also install that.

# Install our visualization tool and numpy
pip install snakeviz numpy jupyter

Now type in jupyter notebook into your command prompt. You must see a jupyter notebook open in your browser. If that doesn’t occur mechanically, you’ll likely see a screenful of data after the jupyter notebook command. Near the underside of that, there can be a URL that it’s best to copy and paste into your browser to initiate the Jupyter Notebook.

Your URL can be different to mine, however it should look something like this:-

http://127.0.0.1:8888/tree?token=3b9f7bd07b6966b41b68e2350721b2d0b6f388d248cc69da

With our tools ready, it’s time to have a look at the code we’re going to repair.

Our “Problem” Script

To properly test our profiling tools, we’d like a script that exhibits clear performance issues. I’ve written an easy program that simulates processing problems with memory, iteration and CPU cycles, making it an ideal candidate for our investigation.

# run_all_systems.py
import time
import math

# ===================================================================
CPU_ITERATIONS = 34552942
STRING_ITERATIONS = 46658100
LOOP_ITERATIONS = 171796964
# ===================================================================

# --- Task 1: A Calibrated CPU-Sure Bottleneck ---
def cpu_heavy_task(iterations):
    print("  -> Running CPU-bound task...")
    result = 0
    for i in range(iterations):
        result += math.sin(i) * math.cos(i) + math.sqrt(i)
    return result

# --- Task 2: A Calibrated Memory/String Bottleneck ---
def memory_heavy_string_task(iterations):
    print("  -> Running Memory/String-bound task...")
    report = ""
    chunk = "report_item_abcdefg_123456789_"
    for i in range(iterations):
        report += f"|{chunk}{i}"
    return report

# --- Task 3: A Calibrated "Thousand Cuts" Iteration Bottleneck ---
def simulate_tiny_op(n):
    pass

def iteration_heavy_task(iterations):
    print("  -> Running Iteration-bound task...")
    for i in range(iterations):
        simulate_tiny_op(i)
    return "OK"

# --- Major Orchestrator ---
def run_all_systems():
    print("--- Starting FINAL SLOW Balanced Showcase ---")
    
    cpu_result = cpu_heavy_task(iterations=CPU_ITERATIONS)
    string_result = memory_heavy_string_task(iterations=STRING_ITERATIONS)
    iteration_result = iteration_heavy_task(iterations=LOOP_ITERATIONS)

    print("--- FINAL SLOW Balanced Showcase Finished ---")

Step 1: Collecting the Data with cProfile

Our first tool, cProfile, is a deterministic profiler built into Python. We are able to run it from code to execute our script and record detailed statistics about every function call. 

import cProfile, pstats, io

pr = cProfile.Profile()
pr.enable()

# Run the function you should profile
run_all_systems()

pr.disable()

# Dump stats to a string and print the highest 10 by cumulative time
s = io.StringIO()
ps = pstats.Stats(pr, stream=s).sort_stats("cumtime")
ps.print_stats(10)
print(s.getvalue())

Here is the output.

--- Starting FINAL SLOW Balanced Showcase ---
  -> Running CPU-bound task...
  -> Running Memory/String-bound task...
  -> Running Iteration-bound task...
--- FINAL SLOW Balanced Showcase Finished ---
         275455984 function calls in 30.497 seconds

   Ordered by: cumulative time
   List reduced from 47 to 10 as a result of restriction <10>

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
        2    0.000    0.000   30.520   15.260 /home/tom/.local/lib/python3.10/site-packages/IPython/core/interactiveshell.py:3541(run_code)
        2    0.000    0.000   30.520   15.260 {built-in method builtins.exec}
        1    0.000    0.000   30.497   30.497 /tmp/ipykernel_173802/1743829582.py:41(run_all_systems)
        1    9.652    9.652   14.394   14.394 /tmp/ipykernel_173802/1743829582.py:34(iteration_heavy_task)
        1    7.232    7.232   12.211   12.211 /tmp/ipykernel_173802/1743829582.py:14(cpu_heavy_task)
171796964    4.742    0.000    4.742    0.000 /tmp/ipykernel_173802/1743829582.py:31(simulate_tiny_op)
        1    3.891    3.891    3.892    3.892 /tmp/ipykernel_173802/1743829582.py:22(memory_heavy_string_task)
 34552942    1.888    0.000    1.888    0.000 {built-in method math.sin}
 34552942    1.820    0.000    1.820    0.000 {built-in method math.cos}
 34552942    1.271    0.000    1.271    0.000 {built-in method math.sqrt}

We’ve got a bunch of numbers that will be difficult to interpret. That is where snakeviz comes into its own. 

Step 2: Visualising the bottleneck with snakeviz

That is where the magic happens. Snakeviz takes the output of our profiling file and converts it into an interactive, browser-based chart, making it easier to seek out bottlenecks.

So let’s use that tool to visualise what now we have. As I’m using a Jupyter Notebook, we’d like to load it first.

%load_ext snakeviz

And we run it like this.

%%snakeviz
principal()

The output is available in two parts. First is a visualisation like this.

Image by Writer

What you see is a top-down “icicle” chart. From the highest to the underside, it represents the decision hierarchy. 

On the very top: Python is executing our script ().

Next: the script’s __main__ execution (:1()). Then the function run_all_systems. Inside that, it calls two key functions: iteration_heavy_task and cpu_heavy_task.

The memory-intensive processing part isn’t labelled on the chart. That’s since the proportion of time related to this task is way smaller than the times apportioned to the opposite two intensive functions. In consequence, we see a much smaller, unlabelled block to the fitting of the cpu_heavy_task block.

Note that, for evaluation, there’s also a Snakeviz chart style called a Sunburst chart. It looks a bit like a pie chart except it comprises a set of increasingly large concentric circles and arcs. The thought beng that the time taken by functions to run is represented by the angular extent of the arc size of the circle. The basis function is a circle in the course of viz. The basis function runs by calling the sub-functions below it and so forth. We wont be that display type in this text.

Visual confirmation, like this, will be so far more impactful than looking at a table of numbers. I didn’t must guess anymore where to look; the info was staring me right within the face. 

The visualisation is quickly followed by a block of text detailing the timings for various parts of your code, very like the output of the cprofile tool. I’m only showing the primary dozen or so lines of this, as there have been 30+ in total.

ncalls tottime percall cumtime percall filename:lineno(function)
----------------------------------------------------------------
1 9.581 9.581 14.3 14.3 1062495604.py:34(iteration_heavy_task)
1 7.868 7.868 12.92 12.92 1062495604.py:14(cpu_heavy_task)
171796964 4.717 2.745e-08 4.717 2.745e-08 1062495604.py:31(simulate_tiny_op)
1 3.848 3.848 3.848 3.848 1062495604.py:22(memory_heavy_string_task)
34552942 1.91 5.527e-08 1.91 5.527e-08 ~:0()
34552942 1.836 5.313e-08 1.836 5.313e-08 ~:0()
34552942 1.305 3.778e-08 1.305 3.778e-08 ~:0()
1 0.02127 0.02127 31.09 31.09 :1()
4 0.0001764 4.409e-05 0.0001764 4.409e-05 socket.py:626(send)
10 0.000123 1.23e-05 0.0004568 4.568e-05 iostream.py:655(write)
4 4.594e-05 1.148e-05 0.0002735 6.838e-05 iostream.py:259(schedule)
...
...
...

Step 3: The Fix

In fact, tools like cprofiler and snakeviz don’t inform you to sort out your performance issues, but now that I knew exactly where the issues were, I could apply targeted fixes. 

# final_showcase_fixed_v2.py
import time
import math
import numpy as np

# ===================================================================
CPU_ITERATIONS = 34552942
STRING_ITERATIONS = 46658100
LOOP_ITERATIONS = 171796964
# ===================================================================

# --- Fix 1: Vectorization for the CPU-Sure Task ---
def cpu_heavy_task_fixed(iterations):
    """
    Fixed through the use of NumPy to perform the complex math on a whole array
    directly, in highly optimized C code as a substitute of a Python loop.
    """
    print("  -> Running CPU-bound task...")
    # Create an array of numbers from 0 to iterations-1
    i = np.arange(iterations, dtype=np.float64)
    # The identical calculation, but vectorized, is orders of magnitude faster
    result_array = np.sin(i) * np.cos(i) + np.sqrt(i)
    return np.sum(result_array)

# --- Fix 2: Efficient String Joining ---
def memory_heavy_string_task_fixed(iterations):
    """
    Fixed through the use of a listing comprehension and a single, efficient ''.join() call.
    This avoids creating hundreds of thousands of intermediate string objects.
    """
    print("  -> Running Memory/String-bound task...")
    chunk = "report_item_abcdefg_123456789_"
    # An inventory comprehension is fast and memory-efficient
    parts = [f"|{chunk}{i}" for i in range(iterations)]
    return "".join(parts)

# --- Fix 3: Eliminating the "Thousand Cuts" Loop ---
def iteration_heavy_task_fixed(iterations):
    """
    Fixed by recognizing the duty is usually a no-op or a bulk operation.
    In a real-world scenario, you'll discover a technique to avoid the loop entirely.
    Here, we exhibit the fix by simply removing the pointless loop.
    The goal is to point out the fee of the loop itself was the issue.
    """
    print("  -> Running Iteration-bound task...")
    # The fix is to seek out a bulk operation or eliminate the necessity for the loop.
    # Because the original function did nothing, the fix is to do nothing, but faster.
    return "OK"

# --- Major Orchestrator ---
def run_all_systems():
    """
    The principal orchestrator now calls the FAST versions of the tasks.
    """
    print("--- Starting FINAL FAST Balanced Showcase ---")
    
    cpu_result = cpu_heavy_task_fixed(iterations=CPU_ITERATIONS)
    string_result = memory_heavy_string_task_fixed(iterations=STRING_ITERATIONS)
    iteration_result = iteration_heavy_task_fixed(iterations=LOOP_ITERATIONS)

    print("--- FINAL FAST Balanced Showcase Finished ---")

Now we are able to rerun the cprofiler on our updated code.

import cProfile, pstats, io

pr = cProfile.Profile()
pr.enable()

# Run the function you should profile
run_all_systems()

pr.disable()

# Dump stats to a string and print the highest 10 by cumulative time
s = io.StringIO()
ps = pstats.Stats(pr, stream=s).sort_stats("cumtime")
ps.print_stats(10)
print(s.getvalue())

#
# start of output
#

--- Starting FINAL FAST Balanced Showcase ---
  -> Running CPU-bound task...
  -> Running Memory/String-bound task...
  -> Running Iteration-bound task...
--- FINAL FAST Balanced Showcase Finished ---
         197 function calls in 6.063 seconds

   Ordered by: cumulative time
   List reduced from 52 to 10 as a result of restriction <10>

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
        2    0.000    0.000    6.063    3.031 /home/tom/.local/lib/python3.10/site-packages/IPython/core/interactiveshell.py:3541(run_code)
        2    0.000    0.000    6.063    3.031 {built-in method builtins.exec}
        1    0.002    0.002    6.063    6.063 /tmp/ipykernel_173802/1803406806.py:1()
        1    0.402    0.402    6.061    6.061 /tmp/ipykernel_173802/3782967348.py:52(run_all_systems)
        1    0.000    0.000    5.152    5.152 /tmp/ipykernel_173802/3782967348.py:27(memory_heavy_string_task_fixed)
        1    4.135    4.135    4.135    4.135 /tmp/ipykernel_173802/3782967348.py:35()
        1    1.017    1.017    1.017    1.017 {method 'join' of 'str' objects}
        1    0.446    0.446    0.505    0.505 /tmp/ipykernel_173802/3782967348.py:14(cpu_heavy_task_fixed)
        1    0.045    0.045    0.045    0.045 {built-in method numpy.arange}
        1    0.000    0.000    0.014    0.014 <__array_function__ internals>:177(sum)

That’s a improbable result that demonstrates the facility of profiling. We spent our effort on the parts of the code that mattered. To be thorough, I also ran snakeviz on the fixed script.

%%snakeviz
run_all_systems()
Image by Writer

Probably the most notable change is the reduction in total runtime, from roughly 30 seconds to roughly 6 seconds. This can be a 5x speedup, achieved by addressing the three principal bottlenecks that were visible within the “before” profile.

Let’s have a look at each individually.

1. The iteration_heavy_task

Before (The Problem)
In the primary image, the big bar on the left, iteration_heavy_task, is the only biggest bottleneck, consuming 14.3 seconds.

  • Why was it slow? This task was a classic “death by a thousand cuts.” The function simulate_tiny_op did almost nothing, however it was called hundreds of thousands of times from inside a pure Python for loop. The immense overhead of the Python interpreter starting and stopping a function call repeatedly was all the source of the slowness.

The Fix
The fixed version, iteration_heavy_task_fixed, recognised that the goal could possibly be achieved without the loop. In our showcase, this meant removing the pointless loop entirely. In a real-world application, this may involve finding a single “bulk” operation to exchange the iterative one.

After (The Result)
Within the second image, the iteration_heavy_task bar is completely gone. It’s now so fast that its runtime is a tiny fraction of a second and is invisible on the chart. We successfully eliminated a 14.3-second problem.

2. The cpu_heavy_task

Before (The Problem)
The second major bottleneck, clearly visible as the big orange bar on the fitting, is cpu_heavy_task, which took 12.9 seconds.

  • Why was it slow? Just like the iteration task, this function was also limited by the speed of the Python for loop. While the maths operations inside were fast, the interpreter needed to process each of the hundreds of thousands of calculations individually, which is very inefficient for numerical tasks.

The Fix
The fix was vectorisation using the NumPy library. As an alternative of using a Python loop, cpu_heavy_task_fixed created a NumPy array and performed all of the mathematical operations (np.sqrt, np.sin, etc.) on all the array concurrently. These operations are executed in highly optimised, pre-compiled C code, completely bypassing the slow Python interpreter loop.

After (The Result).
Identical to the primary bottleneck, the cpu_heavy_task bar has vanished from the “after” diagram. Its runtime was reduced from 12.9 seconds to a couple of milliseconds.

3. The memory_heavy_string_task

Before (The Problem):
In the primary diagram, the memory-heavy_string_task was running, but its runtime was small in comparison with the opposite two larger issues, so it was relegated to the small, unlabeled sliver of space on the far right. It was a comparatively minor issue.

The Fix
The fix for this task was to exchange the inefficient report += “…” string concatenation with a far more efficient method: constructing a listing of all of the string parts after which calling “”.join() a single time at the tip.

After (The Result)
Within the second diagram, we see the results of our success. Having eliminated the 2 10+ second bottlenecks, the memory-heavy-string-task-fixed is now the recent dominant bottleneck, accounting for 4.34 seconds of the entire 5.22-second runtime.

Snakeviz even lets us look inside this fixed function. The brand new most important contributor is the orange bar labelled (list comprehension), which takes 3.52 seconds. This means that even within the code, essentially the most time-consuming part is now the strategy of creating the extensive list of strings in memory before they will be joined.

Summary

This text provides a hands-on guide to identifying and resolving performance issues in Python code, arguing that developers should utilise profiling tools to measure performance as a substitute of counting on intuition or guesswork to pinpoint the source of slowdowns.

I demonstrated a methodical workflow using two key tools:-

  • cProfile: Python’s built-in profiler, used to assemble detailed data on function calls and execution times.
  • snakeviz: A visualisation tool that turns cProfile’s data into an interactive “icicle” chart, making it easy to visually discover which parts of the code are consuming essentially the most time.

The article uses a case study of a deliberately slow script engineered with three distinct and significant bottlenecks:

  1. An iteration-bound task: A function called hundreds of thousands of times in a loop, showcasing the performance cost of Python’s function call overhead (“death by a thousand cuts”).
  2. A CPU-bound task: A for loop performing hundreds of thousands of math calculations, highlighting the inefficiency of pure Python for heavy numerical work.
  3. A memory-bound task: A big string built inefficiently using repeated += concatenation.

By analysing the snakeviz output, I pinpointed these three problems and applied targeted fixes.

  • The iteration bottleneck was fixed by eliminating the unnecessary loop.
  • The CPU bottleneck was resolved with vectorisation using NumPy, which executes mathematical operations in fast, compiled C code.
  • The memory bottleneck was fixed by appending string parts to a listing and using a single, efficient “”.join() call.

These fixes resulted in a dramatic speedup, reducing the script’s runtime from over 30 seconds to only over 6 seconds. I concluded by demonstrating that, even after major issues are resolved, the profiler will be used again to discover , smaller bottlenecks, illustrating that performance tuning is an iterative process guided by measurement.

ASK ANA

What are your thoughts on this topic?
Let us know in the comments below.

0 0 votes
Article Rating
guest
0 Comments
Oldest
Newest Most Voted
Inline Feedbacks
View all comments

Share this article

Recent posts

0
Would love your thoughts, please comment.x
()
x