Running Local LLMs and VLMs on the Raspberry Pi

Artificial Intelligence

Running Local LLMs and VLMs on the Raspberry Pi

admin

January 15, 2024

Running Local LLMs and VLMs on the Raspberry Pi

Get models like Phi-2, Mistral, and LLaVA running locally on a Raspberry Pi with Ollama

Host LLMs and VLMs using Ollama on the Raspberry Pi — Source: Creator

Ever considered running your individual large language models (LLMs) or vision language models (VLMs) on your individual device? You most likely did, however the thoughts of setting things up from scratch, having to administer the environment, downloading the fitting model weights, and the lingering doubt of whether your device may even handle the model has probably given you some pause.

Let’s go one step further than that. Imagine operating your individual LLM or VLM on a tool no larger than a bank card — a Raspberry Pi. Unattainable? Under no circumstances. I mean, I’m writing this post in spite of everything, so it definitely is feasible.

Possible, yes. But why would you even do it?

LLMs at the sting seem quite far-fetched at this time limit. But this particular area of interest use case should mature over time, and we will certainly see some cool edge solutions being deployed with an all-local generative AI solution running on-device at the sting.

It’s also about pushing the bounds to see what’s possible. If it could actually be done at this extreme end of the compute scale, then it could actually be done at any level in between a Raspberry Pi and a giant and powerful server GPU.

Traditionally, edge AI has been closely linked with computer vision. Exploring the deployment of LLMs and VLMs at the sting adds an exciting dimension to this field that’s just emerging.

Most significantly, I just desired to do something fun with my recently acquired Raspberry Pi 5.

So, how will we achieve all this on a Raspberry Pi? Using Ollama!

What’s Ollama?

Ollama has emerged as among the best solutions for running local LLMs on your individual notebook computer without having to cope with the trouble of setting things up from scratch. With just a couple of commands, every thing will be arrange with none issues. The whole lot is self-contained and works splendidly in my experience across several devices and models. It even exposes a REST API for model inference, so you possibly can leave it running on the Raspberry Pi and call it out of your other applications and devices if you need to.

There’s also Ollama Web UI which is a lovely piece of AI UI/UX that runs seamlessly with Ollama for those apprehensive about command-line interfaces. It’s mainly a neighborhood ChatGPT interface, for those who will.

Together, these two pieces of open-source software provide what I feel is the perfect locally hosted LLM experience straight away.

Each Ollama and Ollama Web UI support VLMs like LLaVA too, which opens up much more doors for this edge Generative AI use case.

Technical Requirements

All you wish is the next:

Raspberry Pi 5 (or 4 for a less speedy setup) — Go for the 8GB RAM variant to suit the 7B models.
SD Card — Minimally 16GB, the larger the scale the more models you possibly can fit. Have it already loaded with an appropriate OS akin to Raspbian Bookworm or Ubuntu
A web connection

Like I discussed earlier, running Ollama on a Raspberry Pi is already near the intense end of the hardware spectrum. Essentially, any device more powerful than a Raspberry Pi, provided it runs a Linux distribution and has an analogous memory capability, should theoretically be able to running Ollama and the models discussed on this post.

1. Installing Ollama

To put in Ollama on a Raspberry Pi, we’ll avoid using Docker to conserve resources.

Within the terminal, run

curl https://ollama.ai/install.sh | sh

You must see something much like the image below after running the command above.

Just like the output says, go to 0.0.0.0:11434 to confirm that Ollama is running. It’s normal to see the ‘WARNING: No NVIDIA GPU detected. Ollama will run in CPU-only mode.’ since we’re using a Raspberry Pi. But for those who’re following these instructions on something that’s purported to have a NVIDIA GPU, something didn’t go right.

For any issues or updates, confer with the Ollama GitHub repository.

2. Running LLMs through the command line

Take a take a look at the official Ollama model library for a listing of models that will be run using Ollama. On an 8GB Raspberry Pi, models larger than 7B won’t fit. Let’s use Phi-2, a 2.7B LLM from Microsoft, now under MIT license.

We’ll use the default Phi-2 model, but be at liberty to make use of any of the opposite tags found here. Take a take a look at the model page for Phi-2 to see how you possibly can interact with it.

Within the terminal, run

ollama run phi

When you see something much like the output below, you have already got a LLM running on the Raspberry Pi! It’s that straightforward.

Here’s an interaction with Phi-2 2.7B. Obviously, you won’t get the identical output, but you get the thought. | Source: Creator

You may try other models like Mistral, Llama-2, etc, just be certain there’s enough space on the SD card for the model weights.

Naturally, the larger the model, the slower the output can be. On Phi-2 2.7B, I can get around 4 tokens per second. But with a Mistral 7B, the generation speed goes right down to around 2 tokens per second. A token is roughly reminiscent of a single word.

Here’s an interaction with Mistral 7B | Source: Creator

Now we now have LLMs running on the Raspberry Pi, but we should not done yet. The terminal isn’t for everybody. Let’s get Ollama Web UI running as well!

3. Installing and Running Ollama Web UI

We will follow the instructions on the official Ollama Web UI GitHub Repository to put in it without Docker. It recommends minimally Node.js to be >= 20.10 so we will follow that. It also recommends Python to be a minimum of 3.11, but Raspbian OS already has that installed for us.

We’ve to put in Node.js first. Within the terminal, run

curl -fsSL https://deb.nodesource.com/setup_20.x | sudo -E bash - &&
sudo apt-get install -y nodejs

Change the 20.x to a more appropriate version if need be for future readers.

Then run the code block below.

git clone https://github.com/ollama-webui/ollama-webui.git
cd ollama-webui/# Copying required .env file
cp -RPp example.env .env
# Constructing Frontend Using Node
npm i
npm run construct
# Serving Frontend with the Backend
cd ./backend
pip install -r requirements.txt --break-system-packages 
sh start.sh

It’s a slight modification of what’s provided on GitHub. Do take note that for simplicity and brevity we should not following best practices like using virtual environments and we’re using the — break-system-packages flag. In case you encounter an error like uvicorn not being found, restart the terminal session.

If all goes accurately, you need to find a way to access Ollama Web UI on port 8080 through http://0.0.0.0:8080 on the Raspberry Pi, or through http://:8080/ for those who are accessing through one other device on the identical network.

In case you see this, yes, it worked | Source: Creator

When you’ve created an account and logged in, you need to see something much like the image below.

In case you had downloaded some model weights earlier, you need to see them within the dropdown menu like below. If not, you possibly can go to the settings to download a model.

Available models will appear here | Source: Creator

If you need to download latest models, go to Settings > Models to tug models | Source: Creator

All the interface could be very clean and intuitive, so I won’t explain much about it. It’s truly a really well-done open-source project.

Here’s an interaction with Mistral 7B through Ollama Web UI | Source: Creator

4. Running VLMs through Ollama Web UI

Like I discussed in the beginning of this text, we may also run VLMs. Let’s run LLaVA, a well-liked open source VLM which also happens to be supported by Ollama. To achieve this, download the weights by pulling ‘llava’ through the interface.

Unfortunately, unlike LLMs, it takes quite a while for the setup to interpret the image on the Raspberry Pi. The instance below took around 6 minutes to be processed. The majority of the time might be since the image side of things shouldn’t be properly optimised yet, but this may definitely change in the long run. The token generation speed is around 2 tokens/second.

To wrap all of it up

At this point we’re just about done with the goals of this text. To recap, we’ve managed to make use of Ollama and Ollama Web UI to run LLMs and VLMs like Phi-2, Mistral, and LLaVA on the Raspberry Pi.

I can definitely imagine quite a couple of use cases for locally hosted LLMs running on the Raspberry Pi (or one other other small edge device), especially since 4 tokens/second does seem to be a suitable speed with streaming for some use cases if we’re going for models around the scale of Phi-2.

The sector of ‘small’ LLMs and VLMs, somewhat paradoxically named given their ‘large’ designation, is an lively area of research with quite a couple of model releases recently. Hopefully this emerging trend continues, and more efficient and compact models proceed to get released! Definitely something to regulate in the approaching months.

Disclaimer: I haven’t any affiliation with Ollama or Ollama Web UI. All views and opinions are my very own and don’t represent any organisation.