Make Python As much as 150× Faster with C

-

, eventually, you’re going to hit a block relating to your code execution speed. Should you’ve ever written a computationally heavy algorithm in Python, comparable to string distance, matrix math, or cryptographic hashing, you’ll know what I mean.

Sure, there are occasions when external libraries like NumPy can come to your rescue, but what happens when the algorithm is ? That was precisely my problem after I desired to benchmark a specific algorithm, which determines the variety of edits required to rework one string into one other.

I attempted Python. I attempted NumPy. After which I turned to C, a language I first learned in school many years ago but hadn’t utilized in anger for about 15 years. That’s where things got interesting.

I first had to reply the query, “Are you able to call C from Python?”. After some research, it quickly became clear that the reply was indeed yes. In truth, it seems you possibly can do it in several ways, and in this text, I’ll take a look at three of probably the most common ways of doing so. 

From easiest to most complex, we’ll take a look at using,

  • a subprocess
  • ctypes
  • Python C extensions

The algorithm that we’ll be testing with known as the Levenshtein Distance (LD) algorithm. The Levenshtein distance between two words is the minimum variety of single-character edits (insertions, deletions or substitutions) required to alter one word into the opposite. It is known as after Soviet mathematician Vladimir Levenshtein, who defined the metric in 1965. It has applications in various tools, comparable to spell checkers and optical character recognition systems.

To provide you a clearer picture of what we’re talking about, listed below are a few examples. 

Calculate the LD between the words “book” and “black”.

  1. book → baok (substitution of “a” for “o”),
  2. baok → back (substitution of “c” for “o”)
  3. back → black (add the letter “l”)

So, the LD on this case is three.

Calculate the LD between the words “superb” and “super”.

  1. superb → super (remove letter “b”)

The LD on this case is just one.

We’ll code up the LD algorithm in Python and C, then arrange benchmarks to check how long it takes to run it using pure Python code versus the time it takes to run it in C code called from Python. 

Prerequisites

As I used to be running this on MS Windows, I needed a way of compiling C programs. The easiest method I discovered to do that was to download the Visual Studio Construct tools for 2022. This means that you can compile C programs on the command line.

To put in, first go to the most important Visual Studio downloads page. On the second screen, you’ll see a search box. Type “Construct Tools” into the search field and click on Search. The search should return a screen that appears like this,

Image by Creator

Click the Download button and follow any installation instructions. Once it’s been installed, in a DOS terminal window, if you click on the little plus button to open a brand new terminal, it is best to see an choice to open a “Developer command prompt for VS 2022”. 

Image by Creator

Most of my Python code might be running on a Jupyter Notebook, so it is best to arrange a brand new development environment and install Jupyter. Do this now if you must follow along. I’m using the UV tool for this part, but be at liberty to make use of whichever method you’re most comfortable with.

c:> uv init pythonc
c:> cd pythonc
c:> uv venv pythonc
c:> source pythonc/bin/activate
(pythonc) c:> uv pip install jupyter

The LD algorithm in C

We want barely different versions of the LD algorithm in C, depending on the strategy used to call it. That is the version for our first example, where we use subprocessing to call a C executable.

1/ subprocessing: lev_sub.c

#include 
#include 
#include 

static int levenshtein(const char* a, const char* b) {
    size_t n = strlen(a), m = strlen(b);
    if (n == 0) return (int)m;
    if (m == 0) return (int)n;
    int* prev = (int*)malloc((m + 1) * sizeof(int));
    int* curr = (int*)malloc((m + 1) * sizeof(int));
    if (!prev || !curr) { free(prev); free(curr); return -1; }
    for (size_t j = 0; j <= m; ++j) prev[j] = (int)j;
    for (size_t i = 1; i <= n; ++i) {
        curr[0] = (int)i; char ca = a[i - 1];
        for (size_t j = 1; j <= m; ++j) {
            int cost = (ca == b[j - 1]) ? 0 : 1;
            int del = prev[j] + 1, ins = curr[j - 1] + 1, sub = prev[j - 1] + cost;
            int d = del < ins ? del : ins; curr[j] = d < sub ? d : sub;
        }
        int* tmp = prev; prev = curr; curr = tmp;
    }
    int ans = prev[m]; free(prev); free(curr); return ans;
}

int most important(int argc, char** argv) {
    if (argc != 3) { fprintf(stderr, "usage: %s  n", argv[0]); return 2; }
    int d = levenshtein(argv[1], argv[2]);
    if (d < 0) return 1;
    printf("%dn", d);
    return 0;
}

To compile this, start a brand new Developer Command Prompt for VS Code 2022 and sort the next to make sure we’re optimising the compilation for 64-bit architecture.

(pythonc) c:> "%VSINSTALLDIR%VCAuxiliaryBuildvcvarsall.bat" x64

Next, we are able to compile our C code using this command.

(pythonc) c:> cl /O2 /Fe:lev_sub.exe lev_sub.c

That may create an executable file.

Benchmarking the subprocessing code

In a Jupyter notebook, type in the next code, which might be common to all our benchmarking. It generates random lowercase strings of length N and calculates the variety of edits required to rework string1 into string2.

# Sub-process benchmark
import time, random, string, subprocess
import numpy as np

EXE = r"lev_sub.exe"  

def rnd_ascii(n):
    return ''.join(random.alternative(string.ascii_lowercase) for _ in range(n))

def lev_py(a: str, b: str) -> int:
    n, m = len(a), len(b)
    if n == 0: return m
    if m == 0: return n
    prev = list(range(m+1))
    curr = [0]*(m+1)
    for i, ca in enumerate(a, 1):
        curr[0] = i
        for j, cb in enumerate(b, 1):
            cost = 0 if ca == cb else 1
            curr[j] = min(prev[j] + 1, curr[j-1] + 1, prev[j-1] + cost)
        prev, curr = curr, prev
    return prev[m]

Next is the actual benchmarking code and run results. To run the C a part of the code, we spawn a subprocess that executes the compiled C code file we previously created and measures the time it takes to run, comparing it with the pure Python method. We run each method against a 2000 and a 4000 set of random words thrice and take the fastest of those times.

def lev_subprocess(a: str, b: str) -> int:
    out = subprocess.check_output([EXE, a, b], text=True)
    return int(out.strip())

def bench(fn, *args, repeat=3, warmup=1):
    for _ in range(warmup): fn(*args)
    best = float("inf"); out_best = None
    for _ in range(repeat):
        t0 = time.perf_counter(); out = fn(*args); dt = time.perf_counter() - t0
        if dt < best: best, out_best = dt, out
    return out_best, best

if __name__ == "__main__":
    cases = [(2000,2000),(4000, 4000)]
    print("Benchmark: Pythonvs C (subprocess)n")
    for n, m in cases:
        a, b = rnd_ascii(n), rnd_ascii(m)
        py_out, py_t = bench(lev_py, a, b, repeat=3)
        sp_out, sp_t = bench(lev_subprocess, a, b, repeat=3)
        print(f"n={n} m={m}")
        print(f"  Python   : {py_t:.3f}s -> {py_out}")
        print(f"  Subproc  : {sp_t:.3f}s -> {sp_out}n")

 Listed here are the outcomes.

Benchmark: Python vs C (subprocess)

n=2000 m=2000 
  Python   : 1.276s -> 1768
  Subproc  : 0.024s -> 1768

n=4000 m=4000 
  Python   : 5.015s -> 3519
  Subproc  : 0.050s -> 3519

That’s a reasonably significant improvement within the run-time of C over Python.

2. ctypes: lev.c

ctypes is a foreign function interface (FFI) library built right into Python’s standard library. It helps you to load and call functions from shared libraries written in C (DLLs on Windows, .so files on Linux, .dylib on macOS) directly from Python, with no need to write down an entire extension module in C.

First, here is our C version of the LD algorithm, utilising ctypes. It’s almost equivalent to our subprocess C function, with the addition of a line that permits us to make use of Python to call the DLL after it has been compiled.

/* 
 * lev.c
*/

#include 
#include 

/* below line includes this function within the 
 * DLL's export table so other programs can use it.
 */
__declspec(dllexport)

int levenshtein(const char* a, const char* b) {
    size_t n = strlen(a), m = strlen(b);
    if (n == 0) return (int)m;
    if (m == 0) return (int)n;

    int* prev = (int*)malloc((m + 1) * sizeof(int));
    int* curr = (int*)malloc((m + 1) * sizeof(int));
    if (!prev || !curr) { free(prev); free(curr); return -1; }

    for (size_t j = 0; j <= m; ++j) prev[j] = (int)j;

    for (size_t i = 1; i <= n; ++i) {
        curr[0] = (int)i;
        char ca = a[i - 1];
        for (size_t j = 1; j <= m; ++j) {
            int cost = (ca == b[j - 1]) ? 0 : 1;
            int del = prev[j] + 1;
            int ins = curr[j - 1] + 1;
            int sub = prev[j - 1] + cost;
            int d = del < ins ? del : ins;
            curr[j] = d < sub ? d : sub;
        }
        int* tmp = prev; prev = curr; curr = tmp;
    }
    int ans = prev[m];
    free(prev); free(curr);
    return ans;
}

When using ctypes to call C in Python, we’d like to convert our C code right into a dynamic link library (DLL) fairly than an executable. Here is the construct command you would like for that.

(pythonc) c:> cl /O2 /LD lev.c /Fe:lev.dll

Benchmarking the ctypes code

I’m omitting the lev_py and rnd_ascii Python functions on this code snippet, as they’re equivalent to those within the previous example. Type this into your notebook.

#ctypes benchmark

import time, random, string, ctypes
import numpy as np

DLL = r"lev.dll"  

levdll = ctypes.CDLL(DLL)
levdll.levenshtein.argtypes = [ctypes.c_char_p, ctypes.c_char_p]
levdll.levenshtein.restype  = ctypes.c_int

def lev_ctypes(a: str, b: str) -> int:
    return int(levdll.levenshtein(a.encode('utf-8'), b.encode('utf-8')))

def bench(fn, *args, repeat=3, warmup=1):
    for _ in range(warmup): fn(*args)
    best = float("inf"); out_best = None
    for _ in range(repeat):
        t0 = time.perf_counter(); out = fn(*args); dt = time.perf_counter() - t0
        if dt < best: best, out_best = dt, out
    return out_best, best

if __name__ == "__main__":
    cases = [(2000,2000),(4000, 4000)]
    print("Benchmark: Python vs NumPy vs C (ctypes)n")
    for n, m in cases:
        a, b = rnd_ascii(n), rnd_ascii(m)
        py_out, py_t = bench(lev_py, a, b, repeat=3)
        ct_out, ct_t = bench(lev_ctypes, a, b, repeat=3)
        print(f"n={n} m={m}")
        print(f"  Python   : {py_t:.3f}s -> {py_out}")
        print(f"  ctypes   : {ct_t:.3f}s -> {ct_out}n")

And the outcomes?

Benchmark: Python vs C (ctypes)

n=2000 m=2000  
  Python   : 1.258s -> 1769
  ctypes   : 0.019s -> 1769

n=4000 m=4000 
  Python   : 5.138s -> 3521
  ctypes   : 0.035s -> 3521

We’ve very similar results to the primary example.

3/ Python C extensions: lev_cext.c

When using Python C extensions, there’s barely more work involved. First, let’s examine the C code. The fundamental algorithm is unchanged. It’s just that we’d like so as to add a bit more scaffolding to enable the code to be called from Python. It uses CPython’s API (Python.h) to parse Python arguments, run the C code, and return the result as a Python integer.

The function levext_lev acts as a wrapper. It parses two string arguments from Python (PyArg_ParseTuple), calls the C function lev_impl to compute the gap, handles memory errors, and returns the result as a Python integer (PyLong_FromLong). The Methods table registers this function under the name “levenshtein”, so it could actually be called from Python code. Finally, PyInit_levext defines and initialises the module levext, making it importable in Python with the import levext command.

#include 
#include 
#include 

static int lev_impl(const char* a, const char* b) {
    size_t n = strlen(a), m = strlen(b);
    if (n == 0) return (int)m;
    if (m == 0) return (int)n;
    int* prev = (int*)malloc((m + 1) * sizeof(int));
    int* curr = (int*)malloc((m + 1) * sizeof(int));
    if (!prev || !curr) { free(prev); free(curr); return -1; }
    for (size_t j = 0; j <= m; ++j) prev[j] = (int)j;
    for (size_t i = 1; i <= n; ++i) {
        curr[0] = (int)i; char ca = a[i - 1];
        for (size_t j = 1; j <= m; ++j) {
            int cost = (ca == b[j - 1]) ? 0 : 1;
            int del = prev[j] + 1, ins = curr[j - 1] + 1, sub = prev[j - 1] + cost;
            int d = del < ins ? del : ins; curr[j] = d < sub ? d : sub;
        }
        int* tmp = prev; prev = curr; curr = tmp;
    }
    int ans = prev[m]; free(prev); free(curr); return ans;
}

static PyObject* levext_lev(PyObject* self, PyObject* args) {
    const char *a, *b;
    if (!PyArg_ParseTuple(args, "ss", &a, &b)) return NULL;
    int d = lev_impl(a, b);
    if (d < 0) { PyErr_SetString(PyExc_MemoryError, "alloc failed"); return NULL; }
    return PyLong_FromLong(d);
}

static PyMethodDef Methods[] = {
    {"levenshtein", levext_lev, METH_VARARGS, "Levenshtein distance"},
    {NULL, NULL, 0, NULL}
};

static struct PyModuleDef mod = { PyModuleDef_HEAD_INIT, "levext", NULL, -1, Methods };
PyMODINIT_FUNC PyInit_levext(void) { return PyModule_Create(&mod); }

Because we aren’t just constructing an executable this time but a native Python extension module, we’ve to construct the C code in a different way.

Any such module should be compiled against Python’s headers and appropriately linked to Python’s runtime so it behaves like a built-in Python module. 

To realize this, we create a Python module called setup.py, which imports the setuptools library to facilitate this process. It automates:

  • Finding the precise include paths for Python.h
  • Passing the right compiler and linker flags
  • Producing a .pyd file with the precise naming convention in your Python version and platform

Doing this by hand with the cl compiler command can be tedious and error-prone, since you’d should specify all of those paths and flags manually.

Here is the code we’d like.

from setuptools import setup, Extension
setup(
    name="levext",
    version="0.1.0",
    ext_modules=[Extension("levext", ["lev_cext.c"], extra_compile_args=["/O2"])],
)

We run it using regular Python on the command line, as shown here.

(pythonc) c:> python setup.py build_ext --inplace

#output
running build_ext
copying buildlib.win-amd64-cpython-312levext.cp312-win_amd64.pyd ->

Benchmarking the Python C extensions code

Now, here is the Python code to call our C. Again, I even have omitted the 2 Python helper functions which are unchanged from the previous examples.

# c-ext benchmark

import time, random, string
import numpy as np
import levext  # make certain levext.cp312-win_amd64.pyd is built & importable

def lev_extension(a: str, b: str) -> int:
    return levext.levenshtein(a, b)

def bench(fn, *args, repeat=3, warmup=1):
    for _ in range(warmup): fn(*args)
    best = float("inf"); out_best = None
    for _ in range(repeat):
        t0 = time.perf_counter(); out = fn(*args); dt = time.perf_counter() - t0
        if dt < best: best, out_best = dt, out
    return out_best, best

if __name__ == "__main__":
    cases = [(2000, 2000), (4000, 4000)]
    print("Benchmark: Python vs NumPy vs C (C extension)n")
    for n, m in cases:
        a, b = rnd_ascii(n), rnd_ascii(m)
        py_out, py_t = bench(lev_py, a, b, repeat=3)
        ex_out, ex_t = bench(lev_extension, a, b, repeat=3)
        print(f"n={n} m={m} ")
        print(f"  Python   : {py_t:.3f}s -> {py_out}")
        print(f"  C ext    : {ex_t:.3f}s -> {ex_out}n")

Here is the output.

Benchmark: Python vs C (C extension)

n=2000 m=2000  
  Python   : 1.204s -> 1768
  C ext    : 0.010s -> 1768

n=4000 m=4000  
  Python   : 5.039s -> 3526
  C ext    : 0.033s -> 3526

So this gave the fastest results. The C version is showing up as over 150 times faster than pure Python within the second test case above. 

Not too shabby.

But what about NumPy?

A few of you might be wondering why NumPy wasn’t used. Well, NumPy is improbable for vectorised numeric array operations, comparable to dot products, but not all algorithms map cleanly to vectorisation. Calculating Levenshtein distances is an inherently sequential process, so NumPy doesn’t provide much help. In those cases, dropping into C via subprocess, ctypes, or a native C extension provides real runtime speedups while still being callable from Python.

Summary

The article explores how Python developers can overcome performance bottlenecks in computationally intensive tasks, comparable to calculating the Levenshtein distance — a string similarity algorithm —by integrating C code into Python. While libraries like NumPy speed up vectorised operations, sequential algorithms like Levenshtein often remain impervious to NumPy’s optimisations.

To handle this, I demonstrated three integration patterns, starting from simplest to most advanced, that will let you call fast C code from Python.

Subprocess. Compile the C code into an executable (e.g., with gcc or Visual Studio Construct Tools) and run it from Python using the subprocess module. This is straightforward to establish and already shows an enormous speedup in comparison with pure Python.

ctypes. Using ctypes lets Python directly load and call functions from C shared libraries with no need to write down a full Python extension module. This makes it much simpler and faster to integrate performance-critical C code into Python, avoiding the overhead of running external processes while still keeping your code mostly in Python.

Python C Extensions. Write a full Python extension in C using the CPython API (python.h). This requires more setup but offers the fastest performance and smoothest integration, allowing you to call C functions as in the event that they were native Python functions.

The benchmarks show that C implementations of the Levenshtein algorithm run over 100× faster than pure Python. While an external library comparable to NumPy excels at vectorised numeric operations, it doesn’t significantly improve performance for inherently sequential algorithms like Levenshtein, making C integration a more sensible choice in such cases.

Should you hit performance limits in Python, offloading heavy computation to C can provide massive speed improvements and is value considering. You may start easy with subprocess, then move to ctypes or full C extensions for tighter integration and higher performance.

I’ve only outlined three of the preferred ways to integrate C code with Python, but there are a number of other methods which I like to recommend you read up on if this topic interests you.

ASK ANA

What are your thoughts on this topic?
Let us know in the comments below.

0 0 votes
Article Rating
guest
0 Comments
Oldest
Newest Most Voted
Inline Feedbacks
View all comments

Share this article

Recent posts

0
Would love your thoughts, please comment.x
()
x