Kirill Solodskih, PhD, is the Co-Founder and CEO of TheStage AI, in addition to a seasoned AI researcher and entrepreneur with over a decade of experience in optimizing neural networks for real-world business applications. In 2024, he co-founded TheStage AI, which secured $4.5 million in funding to completely automate neural network acceleration across any hardware platform.
Previously, as a Team Lead at Huawei, Kirill led the acceleration of AI camera applications for Qualcomm NPUs, contributing to the performance of the P50 and P60 smartphones and earning multiple patents for his innovations. His research has been featured at leading conferences akin to CVPR and ECCV , where it received awards and industry-wide recognition. He also hosts a podcast on AI optimization and inference.
What inspired you to co-found TheStage AI, and the way did you transition from academia and research to tackling inference optimization as a startup founder?
The foundations for what eventually became TheStage AI began with my work at Huawei, where I used to be deep into automating deployments and optimizing neural networks. These initiatives became the inspiration for a few of our groundbreaking innovations, and that’s where I saw the actual challenge. Training a model is one thing, but getting it to run efficiently in the actual world and making it accessible to users is one other. Deployment is the bottleneck that holds back quite a lot of great ideas from coming to life. To make something as easy to make use of as ChatGPT, there are quite a lot of back-end challenges involved. From a technical perspective, neural network optimization is about minimizing parameters while keeping performance high. It’s a tricky math problem with loads of room for innovation.
Manual inference optimization has long been a bottleneck in AI. Are you able to explain how TheStage AI automates this process and why it’s a game-changer?
TheStage AI tackles a serious bottleneck in AI: manual compression and acceleration of neural networks. Neural networks have billions of parameters, and determining which of them to remove for higher performance is sort of unattainable by hand. ANNA (Automated Neural Networks Analyzer) automates this process, identifying which layers to exclude from optimization, just like how ZIP compression was first automated.
This changes the sport by making AI adoption faster and cheaper. As an alternative of counting on costly manual processes, startups can optimize models mechanically. The technology gives businesses a transparent view of performance and value, ensuring efficiency and scalability without guesswork.
TheStage AI claims to cut back inference costs by as much as 5x — what makes your optimization technology so effective in comparison with traditional methods?
TheStage AI cuts output costs by as much as 5x with an optimization approach that goes beyond traditional methods. As an alternative of applying the identical algorithm to the whole neural network, ANNA breaks it down into smaller layers and decides which algorithm to use for every part to deliver desired compression while maximizing model’s quality. By combining smart mathematical heuristics with efficient approximations, our approach is very scalable and makes AI adoption easier for businesses of all sizes. We also integrate flexible compiler settings to optimize networks for specific hardware like iPhones or NVIDIA GPUs. This provides us more control to fine-tune performance, increasing speed without losing quality.
How does TheStage AI’s inference acceleration compare to PyTorch’s native compiler, and what benefits does it offer AI developers?
TheStage AI accelerates output far beyond the native PyTorch compiler. PyTorch uses a “just-in-time” compilation method, which compiles the model every time it runs. This results in long startup times, sometimes taking minutes and even longer. In scalable environments, this will create inefficiencies, especially when latest GPUs should be brought online to handle increased user load, causing delays that impact the user experience.
In contrast, TheStage AI allows models to be pre-compiled, so once a model is prepared, it will possibly be deployed immediately. This results in faster rollouts, improved service efficiency, and value savings. Developers can deploy and scale AI models faster, without the bottlenecks of traditional compilation, making it more efficient and responsive for high-demand use cases.
Are you able to share more about TheStage AI’s QLIP toolkit and the way it enhances model performance while maintaining quality?
QLIP, TheStage AI’s toolkit, is a Python library which provides an important set of primitives for quickly constructing latest optimization algorithms tailored to different hardware, like GPUs and NPUs. The toolkit includes components like quantization, pruning, specification, compilation, and serving, all critical for developing efficient, scalable AI systems.
What sets QLIP apart is its flexibility. It lets AI engineers prototype and implement latest algorithms with just just a few lines of code. For instance, a recent AI conference paper on quantization neural networks may be converted right into a working algorithm using QLIP’s primitives in minutes. This makes it easy for developers to integrate the newest research into their models without being held back by rigid frameworks.
Unlike traditional open-source frameworks that restrict you to a set set of algorithms, QLIP allows anyone so as to add latest optimization techniques. This adaptability helps teams stay ahead of the rapidly evolving AI landscape, improving performance while ensuring flexibility for future innovations.
You’ve contributed to AI quantization frameworks utilized in Huawei’s P50 & P60 cameras. How did that have shape your approach to AI optimization?
My experience working on AI quantization frameworks for Huawei’s P50 and P60 gave me helpful insights into how optimization may be streamlined and scaled. Once I first began with PyTorch, working with the entire execution graph of neural networks was rigid, and quantization algorithms needed to be implemented manually, layer by layer. At Huawei, I built a framework that automated the method. You just input the model, and it might mechanically generate the code for quantization, eliminating manual work.
This led me to understand that automation in AI optimization is about enabling speed without sacrificing quality. Considered one of the algorithms I developed and patented became essential for Huawei, particularly once they needed to transition from Kirin processors to Qualcomm on account of sanctions. It allowed the team to quickly adapt neural networks to Qualcomm’s architecture without losing performance or accuracy.
By streamlining and automating the method, we cut development time from over a yr to only just a few months. This made a big impact on a product utilized by hundreds of thousands and shaped my approach to optimization, specializing in speed, efficiency, and minimal quality loss. That’s the mindset I bring to ANNA today.
Your research has been featured at CVPR and ECCV — what are among the key breakthroughs in AI efficiency that you simply’re most pleased with?
Once I’m asked about my achievements in AI efficiency, I at all times think back to our paper that was chosen for an oral presentation at CVPR 2023. Being chosen for an oral presentation at such a conference is rare, as only 12 papers are chosen. This adds to the undeniable fact that Generative AI typically dominates the highlight, and our paper took a unique approach, specializing in the mathematical side, specifically the evaluation and compression of neural networks.
We developed a technique that helped us understand what number of parameters a neural network truly must operate efficiently. By applying techniques from functional evaluation and moving from a discrete to a continuous formulation, we were in a position to achieve good compression results while keeping the power to integrate these changes back into the model. The paper also introduced several novel algorithms that hadn’t been utilized by the community and located further application.
This was certainly one of my first papers in the sphere of AI, and importantly, it was the results of our team’s collective effort, including my co-founders. It was a big milestone for all of us.
Are you able to explain how Integral Neural Networks (INNs) work and why they’re a very important innovation in deep learning?
Traditional neural networks use fixed matrices, just like Excel tables, where the scale and parameters are predetermined. INNs, nonetheless, describe networks as continuous functions, offering far more flexibility. Consider it like a blanket with pins at different heights, and this represents the continual wave.
What makes INNs exciting is their ability to dynamically “compress” or “expand” based on available resources, just like how an analog signal is digitized into sound. You’ll be able to shrink the network without sacrificing quality, and when needed, expand it back without retraining.
We tested this, and while traditional compression methods result in significant quality loss, INNs maintain close-to-original quality even under extreme compression. The maths behind it’s more unconventional for the AI community, but the actual value lies in its ability to deliver solid, practical results with minimal effort.
TheStage AI has worked on quantum annealing algorithms — how do you see quantum computing playing a job in AI optimization within the near future?
With regards to quantum computing and its role in AI optimization, the important thing takeaway is that quantum systems offer a totally different approach to solving problems like optimization. While we didn’t invent quantum annealing algorithms from scratch, firms like D-Wave provide Python libraries to construct quantum algorithms specifically for discrete optimization tasks, which are perfect for quantum computers.
The thought here is that we will not be directly loading a neural network right into a quantum computer. That’s impossible with current architecture. As an alternative, we approximate how neural networks behave under various kinds of degradation, making them fit right into a system that a quantum chip can process.
In the long run, quantum systems could scale and optimize networks with a precision that traditional systems struggle to match. The advantage of quantum systems lies of their built-in parallelism, something classical systems can only simulate using additional resources. This implies quantum computing could significantly speed up the optimization process, especially as we work out find out how to model larger and more complex networks effectively.
The actual potential is available in using quantum computing to unravel massive, intricate optimization tasks and breaking down parameters into smaller, more manageable groups. With technologies like quantum and optical computing, there are vast possibilities for optimizing AI that go far beyond what traditional computing can offer.
What’s your long-term vision for TheStage AI? Where do you see inference optimization heading in the subsequent 5-10 years?
In the long run, TheStage AI goals to turn into a worldwide Model Hub where anyone can easily access an optimized neural network with the specified characteristics, whether for a smartphone or another device. The goal is to supply a drag-and-drop experience, where users input their parameters and the system mechanically generates the network. If the network doesn’t exist already, it’s going to be created mechanically using ANNA.
Our goal is to make neural networks run directly on user devices, cutting costs by 20 to 30 times. In the long run, this might almost eliminate costs completely, because the user’s device would handle the computation slightly than counting on cloud servers. This, combined with advancements in model compression and hardware acceleration, could make AI deployment significantly more efficient.
We also plan to integrate our technology with hardware solutions, akin to sensors, chips, and robotics, for applications in fields like autonomous driving and robotics. As an illustration, we aim to construct AI cameras able to functioning in any environment, whether in space or under extreme conditions like darkness or dust. This is able to make AI usable in a big selection of applications and permit us to create custom solutions for specific hardware and use cases.