Ever since I used to be a toddler, I’ve been fascinated by drawing. What struck me was not only the drawing act itself, but in addition the concept every drawing may very well be improved increasingly more. I remember reaching very high levels with my drawing style. Nevertheless, once I reached the height of perfection, I’d attempt to see how I could improve the drawing even further – alas, with disastrous results.
From there I all the time consider the identical mantra: “refine and iterate and also you’ll reach perfection”. At university, my approach was to read books persistently, expanding my knowledge looking for other sources, for locating hidden layers of meaning in each concept. Today, I apply this same philosophy to AI/ML and coding.
We all know that matrix multiplication (matmul for simplicity here), is the core a part of any AI process. Back up to now I developed LLM.rust, a Rust mirror of Karpathy’s LLM.c. The toughest point within the Rust implementation has been the matrix multiplication. Since we’ve got to perform 1000’s of iterations for fine-tuning a GPT-based model, we’d like an efficient matmul operation. For this purpose, I had to make use of the BLAS library, implementing an unsafe strategy for overcoming the boundaries and barriers. The usage of unsafe in Rust is against Rust’s philosophy, that’s why I’m all the time searching for safer methods for improve matmul on this context.
So, taking inspiration from Sam Altman’s statement – “ask GPT how one can create value” – I made a decision to ask local LLMs to generate, benchmark, and iterate on their very own algorithms to create a greater, native Rust matmul implementation.
The challenge has some constraints:
- We’d like to make use of our local environment. In my case, a MacBook Pro, M3, 36GB RAM;
- Overcome the boundaries of tokens;
- Time and benchmark the code inside the generation loop itself
I do know that achieving BLAS-level performances with this method is sort of inconceivable, but I would like to focus on how we are able to leverage AI for custom needs, even with our “tiny” laptops, in order that we are able to unblock ideas and push boundaries in any field. This post desires to be an inspiration for practitioners, and other people who need to get more accustomed to Microsoft Autogen, and native LLM deployment.
All of the cod implementation could be present in this Github repo. That is an on-going experiment, and plenty of changes/improvements shall be committed.
General idea
The general idea is to have a roundtable of agents. The start line is the MrAderMacher Mixtral 8x7B model Q4 K_M local model. From the model we create 5 entities:
- the
Proposercomes up with a brand new Strassen-like algorithm, to search out a greater and more efficient option to perform matmul; - the
Verifierreviews the matmul formulation through symbolic math; - the
Codercreates the underlying Rust code; - the
Testerexecutes it and saves all the information to the vector database; - the
Manageracts silently, controlling the general workflow.
| Agent | Role function |
| Proposer | Analyses benchmark times, and it proposes latest tuning parameters and matmul formulations. |
| Verifier | (Currently disabled within the code). It verifies the proposer’s mathematical formulation through symbolic verification. |
| Coder | It takes the parameters, and it really works out the Rust template code. |
| Tester | It runs the Rust code, it saves the code and computes the benchmark timing. |
| Manager | Overall control of the workflow. |
The general workflow could be orchestrated through Microsoft Autogen as depicted in fig.1.
Prepare the input data and vector database
The input data is collected from all academic papers, focused on matrix multiplication optimisation. Lots of these papers are referenced in, and related to, DeepMind’s Strassen paper. I would like to start simply, so I collected 50 papers, published from 2020 till 2025, that specifically address matrix multiplication.
Next, I’ve used chroma to create the vector database. The critical aspect in generating a brand new vector database is how the PDFs are chunked. On this context, I used a chunke. In a different way from split text methods, the semantic chunker uses the actual meaning of the text, to find out where to chop. The goal is to maintain the related sentences together in a single chunk, making the ultimate vector database more coherent and accurate. This is completed using the local model BAAI/bge-base-en-v1.5. The Github gist below shows the complete implementation.
The core code: autogen-core and GGML models
I even have used Microsoft Autogen, particularly the autogen-core variant (version 0.7.5). In a different way from the higher-level chat, in autogen-core we are able to have access to low-level event-driven constructing blocks, which can be obligatory to create a state-machine-driven workflow as we’d like. As a matter of fact, the challenge is to take care of a strict workflow. All of the acting agents must act in a selected order: Proposer –> Verifier –> Coder –> Tester.
The core part is the BaseMatMulAgent, that inherits from AutoGen’s RoutedAgent. This base class allows us to standardise how LLM agents will participate within the chat, and they’ll behave.
From the code above, we are able to see the category is designed to take part in an asynchronous group chat, handling conversation history, calls to external tools and generating responses through the local LLM.
The core component is @message_handler, a decorator that registers a technique as listener or subscriber , based on the message type. The decorator mechanically detects the kind hint of the primary method’s argument – in our case is message: GroupChatMessage. It then subscribes the agent to receive any events of that type sent to the agent’s topic. The handle_message async method is then chargeable for updating the agent’s internal memory, without generating a response.
With the listener-subscriber mechanism is in place, we are able to deal with the Manager class. The MatMulManager inherits RoutedAgent and orchestrates the general agents’ flow.
The code above handles all of the agents. We’re skipping the Verifier part, for the moment. The Coder publish the ultimate code, and the Tester takes care of saving each the code and the entire context to the Vector Database. In this manner, we are able to avoid consuming all of the tokens of our local model. At each latest run, the model will catch-up on the newest generated algorithms from the vector database and propose a brand new solution.
A vital caveat, for ensuring autogen-core can work with llama models on MacOS, make use of the next snippet:
#!/bin/bash
CMAKE_ARGS="-DGGML_METAL=on" FORCE_CMAKE=1 pip install --upgrade --verbose --force-reinstall llama-cpp-python --no-cache-dir
Fig.2 summarises your complete code. We will roughly subdivide the code into 3 important blocks:
- The
BaseAgent, that handles messages through LLM’s agents, evaluating the mathematical formulation and generating code; - The
MatMulManagerorchestrates your complete agents’ flow; autogen_core.SingleThreadedAgentRuntimeallows us to make your complete workflow a reality.

autogen_core.SingleThreadedAgentRuntime makes all of this to work on our MacBook PRO. [Image created with Nano Banana Pro.]Results and benchmark
All of the Rust code has been revised and re-run manually. While the workflow is powerful, working with LLMs requires a critical eye. Several times the model confabulated*, generating code that looked optimised but did not perform the actual matmul work.
The very first iteration generates a type of Strassen-like algorithm (“Run 0” code within the fig.3):
The model thinks of higher implementations, more Rust-NEON like, in order that after 4 iterations it gives the next code (“Run 3” in fig.3):
We will see the usage of functions like vaddq_f32, specific CPU instruction for ARM processors, coming from std::arch::aarch64. The model manages to make use of rayon to separate the workflow across multiple CPU cores, and contained in the parallel threads it uses NEON intrinsics. The code itself isn’t totally correct, furthermore, I’ve noticed that we’re running into an out-of-memory error when coping with 1024×1024 matrices. I needed to manually re-work out the code to make it work.
This brings us back to our my mantra “iterating to perfection”, and we are able to ask ourselves: ‘can an area agent autonomously refine Rust code to the purpose of mastering complex NEON intrinsics?’. The findings show that yes, even on consumer hardware, this level of optimisation is achievable.
Fig.3 shows the ultimate results I’ve obtained after each iterations.

The 0th and 2nd benchmark have some errors, because it is physically inconceivable to attain such a results on a 1024×1024 matmul on a CPU:
- the primary code suffers from a diagonal fallacy, so the code is computing only diagonal blocks of the matrix and it’s ignoring the remainder;
- the second code has a broken buffer, because it is repeatedly overwriting a small, cache-hot buffer 1028 floats, relatively than transversing the complete 1 million elements.
Nevertheless, the code produced two real code, the run 1 and run 3. The primary iteration achieves 760 ms, and it constitutes an actual baseline. It suffers from cache misses and lack of SIMD vectorisation. The run 3 records 359 ms, the development is the implementation of NEON SIMD and Rayon parallelism.
*: I wrote “the model confabulates” on purposes. From a medical point-of-view, all of the LLMs aren’t hallucinating, but confabulating. Hallucinations are a completely different situation w.r.t what LLMs are doing when babbling and generating “mistaken” answers.
Conclusions
This experiment began with an issue that seemed an inconceivable challenge: “can we use consumer-grade local LLMs to find high-performance Rust algorithms that may compete with BLAS implementation?”.
We will say yes, or not less than we’ve got a sound and solid background, where we are able to construct up higher code to attain a full BLAS-like code in Rust.
The post showed how one can interact with Microsoft Autogen, autogen-core, and how one can create a roundtable of agents.
The bottom model in use comes from GGUF, and it may possibly run on a MacBook Pro M3, 36GB.
After all, we didn’t find (yet) anything higher than BLAS in a single easy code. Nevertheless, we proved that local agentic workflow, on a MacBook Pro, can achieve what was previously thought to require a large cluster and large models. Eventually, the model managed to search out an affordable Rust-NEON implementation, “Run 3 above”, that has a speed up of over 50% on standard Rayon implementation. We must highlight that the backbone implementation was AI generated.
The frontier is open. I hope this blogpost can encourage you in attempting to see what limits we are able to overcome with local LLM deployment.
I’m writing this in a private capability; these views are my very own.
