Python Can Now Call Mojo

-

, ML engineers, and software developers, optimising every little bit of performance from our codebases is usually a crucial consideration. For those who are a Python user, you’ll concentrate on a few of its deficits on this respect. Python is taken into account a slow language, and also you’ve probably heard that loads of the explanation for that is on account of its Global Interpreter Lock (GIL) mechanism.

It’s what it’s, but what can we do about it? There are several ways we will ameliorate this issue when coding in Python, especially for those who’re using a fairly up-to-date version of Python.

  • The very latest releases of Python have a way of running code without using the GIL.
  • We are able to utilise high-performance third-party libraries, equivalent to NumPy, to perform number crunching.
  • There are also many methods for parallel and concurrent processing built into the language now.

One other method we will use is to call other high-performance languages from inside Python for time-critical sections of our code. That’s what we’ll cover in this text as I show you how you can call Mojo code from Python.

Have you ever heard of Mojo before? If not, here’s a fast history lesson.

Mojo is a comparatively recent systems-level language developed by Modular Inc. (an AI infrastructure company co-founded in 2022 by compiler writing legend Chris Lattner, of LLVM and Swift creator fame, and former Google TPUs lead Tim Davis) and first shown publicly in May 2023

It was born from an easy pain point, i.e. Python’s lack of performance that we discussed earlier. Mojo tackles this head-on by grafting a superset of Python’s syntax onto an LLVM/MLIR-based compiler pipeline that delivers zero-cost abstractions, static typing, ownership-based memory management, automatic vectorisation, and seamless code generation for CPU and GPU accelerators. 

Early benchmarks demoed at its launch ran kernel-dense workloads as much as 35,000× faster than vanilla Python, proving that Mojo can match — or exceed — the raw throughput of C/CUDA while letting developers stay in familiar “pythonic” territory. 

Nonetheless, there’s all the time a stumbling block, and that’s folks’ inertia to maneuver entirely to a brand new language. I’m one in every of those people, too, so I used to be delighted after I read that, as of just a few weeks ago, it was now possible to call Mojo code directly from Python. 

Does this mean we get the very best of each worlds: the simplicity of Python and the performance of Mojo?

To check the claims, we’ll write some code using vanilla Python. Then, for every, we’ll also code a version using NumPy and, finally, a Python version that offloads a few of its computation to a Mojo module. Ultimately, we’ll compare the assorted run times.

Will we see significant performance gains? Read on to seek out out.

Organising a development environment

I’ll be using WSL2 Ubuntu for Windows for my development. The very best practice is to establish a brand new development environment for every project you’re working on. I often use conda for this, but as everyone and their granny appears to be moving towards using the brand new uv package manager, I’m going to provide that a go as a substitute. There are a pair of how you possibly can install uv.

$ curl -LsSf https://astral.sh/uv/install.sh | sh

or...

$ pip install uv

Next, initialise a project.

$ uv init mojo-test 
$ cd mojo-test
$ uv venv
$ source .venv/bin/activate

Initialized project `mojo-test` at `/home/tom/projects/mojo-test`
(mojo-test) $ cd mojo-test
(mojo-test) $ ls -al
total 28
drwxr-xr-x  3 tom tom 4096 Jun 27 09:20 .
drwxr-xr-x 15 tom tom 4096 Jun 27 09:20 ..
drwxr-xr-x  7 tom tom 4096 Jun 27 09:20 .git
-rw-r--r--  1 tom tom  109 Jun 27 09:20 .gitignore
-rw-r--r--  1 tom tom    5 Jun 27 09:20 .python-version
-rw-r--r--  1 tom tom    0 Jun 27 09:20 README.md
-rw-r--r--  1 tom tom   87 Jun 27 09:20 foremost.py
-rw-r--r--  1 tom tom  155 Jun 27 09:20 pyproject.toml

Now, add any external libraries we’d like

(mojo-test) $ uv pip install modular numpy matplotlib

How does calling Mojo from Python work?

Let’s assume we’ve the next easy Mojo function that takes a Python variable as an argument and adds two to its value. For instance,

# mojo_func.mojo
#
fn add_two(py_obj: PythonObject) raises -> Python
    var n = Int(py_obj)
    return n + 2

When Python is attempting to load add_two, it looks for a function called PyInit_add_two(). Inside PyInit_add_two(), we’ve to declare all Mojo functions and kinds which are callable from Python using the PythonModuleBuilder library. So, in actual fact, our Mojo code in its final form will resemble this.

from python import PythonObject
from python.bindings import PythonModuleBuilder
from os import abort

@export
fn PyInit_mojo_module() -> PythonObject:
    try:
        var m = PythonModuleBuilder("mojo_func")
        m.def_function[add_two]("add_two", docstring="Add 2 to n")
        return m.finalize()
    except e:
        return abort[PythonObject](String("Rrror creating Python Mojo module:", e))

fn add_two(py_obj: PythonObject) raises -> PythonObject:
    var n = Int(py_obj)
    n + 2

The Python code requires additional boilerplate code to operate accurately, as shown here.

import max.mojo.importer
import sys

sys.path.insert(0, "")

import mojo_func

print(mojo_func.add_two(5))

# SHould print 7

Code examples

For every of my examples, I’ll show three different versions of the code. One can be written in pure Python, one will utilise NumPy to hurry things up, and the opposite will substitute calls to Mojo where appropriate.

Be warned that calling Mojo code from Python is in early development. You possibly can expect significant changes to the API and ergonomics

Example 1 — Calculating a Mandelbrot set

For our first example, we’ll compute and display a Mandelbrot set. This is sort of computationally expensive, and as we’ll see, the pure Python version takes a substantial period of time to finish.

We’ll need 4 files in total for this instance.

1/ mandelbrot_pure_py.py

# mandelbrot_pure_py.py
def compute(width, height, max_iters):
    """Generates a Mandelbrot set image using pure Python."""
    image = [[0] * width for _ in range(height)]
    for row in range(height):
        for col in range(width):
            c = complex(-2.0 + 3.0 * col / width, -1.5 + 3.0 * row / height)
            z = 0
            n = 0
            while abs(z) <= 2 and n < max_iters:
                z = z*z + c
                n += 1
            image[row][col] = n
    return image

2/ mandelbrot_numpy.py

# mandelbrot_numpy.py

import numpy as np

def compute(width, height, max_iters):
    """Generates a Mandelbrot set using NumPy for vectorized computation."""
    x = np.linspace(-2.0, 1.0, width)
    y = np.linspace(-1.5, 1.5, height)
    c = x[:, np.newaxis] + 1j * y[np.newaxis, :]
    z = np.zeros_like(c, dtype=np.complex128)
    image = np.zeros(c.shape, dtype=int)

    for n in range(max_iters):
        not_diverged = np.abs(z) <= 2
        image[not_diverged] = n
        z[not_diverged] = z[not_diverged]**2 + c[not_diverged]
        
    image[np.abs(z) <= 2] = max_iters
    return image.T

3/ mandelbrot_mojo.mojo

# mandelbrot_mojo.mojo 

from python import PythonObject, Python
from python.bindings import PythonModuleBuilder
from os import abort
from complex import ComplexFloat64

# That is the core logic that can run fast in Mojo
fn compute_mandel_pixel(c: ComplexFloat64, max_iters: Int) -> Int:
    var z = ComplexFloat64(0, 0)
    var n: Int = 0
    while n < max_iters:
        # abs(z) > 2 is similar as z.norm() > 4, which is quicker
        if z.norm() > 4.0:
            break
        z = z * z + c
        n += 1
    return n

# That is the function that Python will call
fn mandelbrot_mojo_compute(width_obj: PythonObject, height_obj: PythonObject, max_iters_obj: PythonObject) raises -> PythonObject:
    
    var width = Int(width_obj)
    var height = Int(height_obj)
    var max_iters = Int(max_iters_obj)

    # We'll construct a Python list in Mojo to return the outcomes
    var image_list = Python.list()

    for row in range(height):
        # We create a nested list to represent the 2D image
        var row_list = Python.list()
        for col in range(width):
            var c = ComplexFloat64(
                -2.0 + 3.0 * col / width,
                -1.5 + 3.0 * row / height
            )
            var n = compute_mandel_pixel(c, max_iters)
            row_list.append(n)
        
        image_list.append(row_list)
            
    return image_list

# That is the special function that "exports" our Mojo function to Python
@export
fn PyInit_mandelbrot_mojo() -> PythonObject:
    try:
       
        var m = PythonModuleBuilder("mandelbrot_mojo")
        m.def_function[mandelbrot_mojo_compute]("compute", "Generates a Mandelbrot set.")
        return m.finalize()
    except e:
        return abort[PythonObject]("error creating mandelbrot_mojo module")

4/ foremost.py

This can call the opposite three programs and likewise allow us to plot out the Mandelbrot graph in a Jupyter Notebook. I’ll only show the plot once. You’ll must take my word that it was plotted accurately on all three runs of the code.

# foremost.py (Final version with visualization)

import time
import numpy as np
import sys

import matplotlib.pyplot as plt # Now, import pyplot

# --- Mojo Setup ---
try:
    import max.mojo.importer
except ImportError:
    print("Mojo importer not found. Please make sure the MODULAR_HOME and PATH are set accurately.")
    sys.exit(1)

sys.path.insert(0, "")

# --- Import Our Modules ---
import mandelbrot_pure_py
import mandelbrot_numpy
import mandelbrot_mojo

# --- Visualization Function ---
def visualize_mandelbrot(image_data, title="Mandelbrot Set"):
    """Displays the Mandelbrot set data as a picture using Matplotlib."""
    print(f"Displaying image for: {title}")
    plt.figure(figsize=(10, 8))
    # 'hot', 'inferno', and 'plasma' are all great colormaps for this
    plt.imshow(image_data, cmap='hot', interpolation='bicubic')
    plt.colorbar(label="Iterations")
    plt.title(title)
    plt.xlabel("Width")
    plt.ylabel("Height")
    plt.show()

# --- Test Runner ---
def run_test(name, compute_func, *args):
    """A helper function to run and time a test."""
    print(f"Running {name} version...")
    start_time = time.time()
    # The compute function returns the image data
    result_data = compute_func(*args)
    duration = time.time() - start_time
    print(f"-> {name} version took: {duration:.4f} seconds")
    # Return the info so we will visualize it
    return result_data

if __name__ == "__main__":
    WIDTH, HEIGHT, MAX_ITERS = 800, 600, 5000
    
    print("Starting Mandelbrot performance comparison...")
    print("-" * 40)

    # Run Pure Python Test
    py_image = run_test("Pure Python", mandelbrot_pure_py.compute, WIDTH, HEIGHT, MAX_ITERS)
    visualize_mandelbrot(py_image, "Pure Python Mandelbrot")

    print("-" * 40)

    # Run NumPy Test
    np_image = run_test("NumPy", mandelbrot_numpy.compute, WIDTH, HEIGHT, MAX_ITERS)
    # uncomment the below line if you should see the plot
    #visualize_mandelbrot(np_image, "NumPy Mandelbrot")

    print("-" * 40)

    # Run Mojo Test
    mojo_list_of_lists = run_test("Mojo", mandelbrot_mojo.compute, WIDTH, HEIGHT, MAX_ITERS)
    # Convert Mojo's list of lists right into a NumPy array for visualization
    mojo_image = np.array(mojo_list_of_lists)
    # uncomment the below line if you should see the plot  
    #visualize_mandelbrot(mojo_image, "Mojo Mandelbrot")

    print("-" * 40)
    print("Comparison complete.")

Finally, here is the output.

Image by Writer

Okay, in order that’s a powerful start for Mojo. It was almost 20 times faster than the pure Python implementation and 5 times faster than the NumPy code.

Example 2 — Numerical integration

For this instance, we'll perform numerical integration using Simpson’s rule to find out the worth of sin(X) within the interval 0 to π. Recall that Simpson’s rule is a technique of calculating an approximate value for an integral and is defined as,

∫ ≈ (h/3) * [f(x₀) + 4f(x₁) + 2f(x₂) + 4f(x₃) + … + 2f(xₙ-₂) + 4f(xₙ-₁) + f(xₙ)]

Where:

  • h is the width of every step.
  • The weights are 1, 4, 2, 4, 2, …, 4, 1.
  • The primary and last points have a weight of 1.
  • The points at odd indices have a weight of 4.
  • The points at even indices have a weight of 2.

The true value of the integral we’re attempting to calculate is two. Let’s see how accurate (and fast) our methods are.

Once more, we'd like 4 files.

1/ integration_pure_py.py

# integration_pure_py.py
import math

def compute(start, end, n):
    """Calculates the definite integral of sin(x) using Simpson's rule."""
    if n % 2 != 0:
        n += 1 # Simpson's rule requires an excellent variety of intervals
    
    h = (end - start) / n
    integral = math.sin(start) + math.sin(end)

    for i in range(1, n, 2):
        integral += 4 * math.sin(start + i * h)
    
    for i in range(2, n, 2):
        integral += 2 * math.sin(start + i * h)
        
    integral *= h / 3
    return integral

2/ integration_numpy

# integration_numpy.py
import numpy as np

def compute(start, end, n):
    """Calculates the definite integral of sin(x) using NumPy."""
    if n % 2 != 0:
        n += 1
    
    x = np.linspace(start, end, n + 1)
    y = np.sin(x)
    
    # Apply Simpson's rule weights: 1, 4, 2, 4, ..., 2, 4, 1
    integral = (y[0] + y[-1] + 4 * np.sum(y[1:-1:2]) + 2 * np.sum(y[2:-1:2]))
    
    h = (end - start) / n

3/ integration_mojo.mojo

# integration_mojo.mojo
from python import PythonObject, Python
from python.bindings import PythonModuleBuilder
from os import abort
from math import sin

# Note: The 'fn' keyword is used here because it's compatible with all versions.
fn compute_integral_mojo(start_obj: PythonObject, end_obj: PythonObject, n_obj: PythonObject) raises -> PythonObject:
    # Bridge crossing happens ONCE firstly.
    var start = Float64(start_obj)
    var end = Float64(end_obj)
    var n = Int(n_obj)

    if n % 2 != 0:
        n += 1
    
    var h = (end - start) / n
    
    # All computation below is on NATIVE Mojo types. No Python interop.
    var integral = sin(start) + sin(end)

    # First loop for the '4 * f(x)' terms
    var i_1: Int = 1
    while i_1 < n:
        integral += 4 * sin(start + i_1 * h)
        i_1 += 2

    # Second loop for the '2 * f(x)' terms
    var i_2: Int = 2
    while i_2 < n:
        integral += 2 * sin(start + i_2 * h)
        i_2 += 2
        
    integral *= h / 3
    
    # Bridge crossing happens ONCE at the end.
    return Python.float(integral)

@export
fn PyInit_integration_mojo() -> PythonObject:
    try:
        var m = PythonModuleBuilder("integration_mojo")
        m.def_function[compute_integral_mojo]("compute", "Calculates a definite integral in Mojo.")
        return m.finalize()
    except e:
        return abort[PythonObject]("error creating integration_mojo module")

4/ foremost.py

import time
import sys
import numpy as np

# --- Mojo Setup ---
try:
    import max.mojo.importer
except ImportError:
    print("Mojo importer not found. Please ensure your environment is about up accurately.")
    sys.exit(1)
sys.path.insert(0, "")

# --- Import Our Modules ---
import integration_pure_py
import integration_numpy
import integration_mojo

# --- Test Runner ---
def run_test(name, compute_func, *args):
    print(f"Running {name} version...")
    start_time = time.time()
    result = compute_func(*args)
    duration = time.time() - start_time
    print(f"-> {name} version took: {duration:.4f} seconds")
    print(f"   Result: {result}")

# --- Important Test Execution ---
if __name__ == "__main__":
    # Use a really large variety of steps to spotlight loop performance
    START = 0.0
    END = np.pi 
    NUM_STEPS = 100_000_000 # 100 million steps
    
    print(f"Calculating integral of sin(x) from {START} to {END:.2f} with {NUM_STEPS:,} steps...")
    print("-" * 50)

    run_test("Pure Python", integration_pure_py.compute, START, END, NUM_STEPS)
    print("-" * 50)
    run_test("NumPy", integration_numpy.compute, START, END, NUM_STEPS)
    print("-" * 50)
    run_test("Mojo", integration_mojo.compute, START, END, NUM_STEPS)
    print("-" * 50)
    print("Comparison complete.")

And the outcomes?

Calculating integral of sin(x) from 0.0 to three.14 with 100,000,000 steps...
--------------------------------------------------
Running Pure Python version...
-> Pure Python version took: 4.9484 seconds
   Result: 2.0000000000000346
--------------------------------------------------
Running NumPy version...
-> NumPy version took: 0.7425 seconds
   Result: 1.9999999999999998
--------------------------------------------------
Running Mojo version...
-> Mojo version took: 0.8902 seconds
   Result: 2.0000000000000346
--------------------------------------------------
Comparison complete.

It’s interesting that this time, the NumPy code was marginally faster than the Mojo code, and its final value was more accurate. This highlights a key concept in high-performance computing: the trade-off between vectorisation and JIT-compiled loops.

NumPy’s strength lies in its ability to vectorise operations. It allocates a big block of memory after which calls highly optimised, pre-compiled C code that leverages modern CPU features, equivalent to SIMD, to perform the sin() function on tens of millions of values concurrently. This “burst processing” is incredibly efficient.

Mojo, alternatively, takes our easy while loop and JIT-compiles it into highly efficient machine code. While this avoids the massive initial memory allocation of NumPy, on this specific case, the raw power of NumPy’s vectorisation gave it a slight edge.

Example 3— The sigmoid function

The sigmoid function is a vital concept in AI because it’s the cornerstone of binary classification. 

Also often called the logistic function, it's defined as this.

The sigmoid function takes any real-valued input x and “squashes” it easily into the open interval (0,1). In easy terms, regardless of what's passed to the sigmoid function, it should all the time return a price between 0 and 1. 

So, for instance,

S(-197865) = 0
S(-2) = 0.0009111
S(3) = 0.9525741
S(10776.87) = 1

This makes it perfect for representing certain things like probabilities.

Since the Python code is easier, we will include it within the benchmarking script, so we only have two files this time.

sigmoid_mojo.mojo

from python               import Python, PythonObject
from python.bindings      import PythonModuleBuilder
from os                   import abort
from math                 import exp
from time                 import perf_counter

# ----------------------------------------------------------------------
#   Fast Mojo routine (no Python calls inside)
# ----------------------------------------------------------------------
fn sigmoid_sum(n: Int) -> (Float64, Float64):
    # deterministic fill, sized once
    var data = List[Float64](length = n, fill = 0.0)
    for i in range(n):
        data[i] = (Float64(i) / Float64(n)) * 10.0 - 5.0   # [-5, +5]

    var t0: Float64 = perf_counter()
    var total: Float64 = 0.0
    for x in data:                       # single tight loop
        total += 1.0 / (1.0 + exp(-x))
    var elapsed: Float64 = perf_counter() - t0
    return (total, elapsed)

# ----------------------------------------------------------------------
#   Python-visible wrapper
# ----------------------------------------------------------------------
fn py_sigmoid_sum(n_obj: PythonObject) raises -> PythonObject:
    var n: Int = Int(n_obj)                        # validates arg
    var (tot, secs) = sigmoid_sum(n)

    # safest container: construct a Python list (auto-boxes scalars)
    var out = Python.list()
    out.append(tot)
    out.append(secs)
    return out                                     # -> PythonObject

# ----------------------------------------------------------------------
#   Module initialiser  (name must match:  PyInit_sigmoid_mojo)
# ----------------------------------------------------------------------
@export
fn PyInit_sigmoid_mojo() -> PythonObject:
    try:
        var m = PythonModuleBuilder("sigmoid_mojo")
        m.def_function[py_sigmoid_sum](
            "sigmoid_sum",
            "Return [total_sigmoid, elapsed_seconds]"
        )
        return m.finalize()
    except e:
        # if anything raises, give Python an actual ImportError
        return abort[PythonObject]("error creating sigmoid_mojo module")

foremost.py

# bench_sigmoid.py
import time, math, numpy as np

N = 50_000_000  

# --------------------------- pure-Python -----------------------------------
py_data = [(i / N) * 10.0 - 5.0 for i in range(N)]
t0 = time.perf_counter()
py_total = sum(1 / (1 + math.exp(-x)) for x in py_data)
print(f"Pure-Python : {time.perf_counter()-t0:6.3f} s  - Σσ={py_total:,.1f}")

# --------------------------- NumPy -----------------------------------------
np_data = np.linspace(-5.0, 5.0, N, dtype=np.float64)
t0 = time.perf_counter()
np_total = float(np.sum(1 / (1 + np.exp(-np_data))))
print(f"NumPy       : {time.perf_counter()-t0:6.3f} s  - Σσ={np_total:,.1f}")

# --------------------------- Mojo ------------------------------------------
import max.mojo.importer          # installs .mojo import hook
import sigmoid_mojo               # compiles & loads shared object

mj_total, mj_secs = sigmoid_mojo.sigmoid_sum(N)
print(f"Mojo        : {mj_secs:6.3f} s  - Σσ={mj_total:,.1f}")

Here is the output.

$ python sigmoid_bench.py
Pure-Python :  1.847 s  - Σσ=24,999,999.5
NumPy       :  0.323 s  - Σσ=25,000,000.0
Mojo        :  0.150 s  - Σσ=24,999,999.5

The Σσ=… outputs are showing the sum of all of the calculated Sigmoid values. In theory, this must be precisely equal to the input N divided by 2, as N tends towards infinity.

But as we see, the mojo implementation represents an honest uplift of over 2x on the already fast NumPy code and is over 12x faster than the bottom Python implementation. 

Not too shabby.

Summary

This text explored the exciting recent capability of calling high-performance Mojo code directly from Python to speed up computationally intensive tasks. Mojo, a comparatively recent systems programming language from Modular, guarantees C-level performance with a well-recognized Pythonic syntax, aiming to unravel Python’s historical speed limitations. 

To check this promise, we benchmarked three computationally expensive scenarios: Mandelbrot set generation, numerical integration, and the sigmoid calculation function, by implementing each in pure Python, optimised NumPy, and a hybrid Python-Mojo approach. 

The outcomes reveal a nuanced performance landscape for loop-heavy algorithms where data may be processed entirely with native Mojo types. Mojo can significantly outperform each pure Python and even highly optimised NumPy code. Nonetheless, we also saw that for tasks that align perfectly with NumPy’s vectorised, pre-compiled C functions, NumPy can maintain a slight edge over Mojo. 

This investigation demonstrates that while Mojo is a strong recent tool for Python acceleration, achieving maximum performance requires a thoughtful approach to minimising the “bridge-crossing” overhead between the 2 language runtimes.

As all the time, when considering performance enhancements to your code, test, test, test. That's the final arbiter as as to if it’s worthwhile or not.

ASK ANA

What are your thoughts on this topic?
Let us know in the comments below.

0 0 votes
Article Rating
guest
0 Comments
Oldest
Newest Most Voted
Inline Feedbacks
View all comments

Share this article

Recent posts

0
Would love your thoughts, please comment.x
()
x