High-Speed Inference with llama.cpp and Vicuna on CPU Arrange llama.cpp in your computer Prompting Vicuna with llama.cpp llama.cpp’s chat mode Using other models with llama.cpp: An Example with Alpaca Conclusion

Artificial Intelligence

High-Speed Inference with llama.cpp and Vicuna on CPU Arrange llama.cpp in your computer Prompting Vicuna with llama.cpp llama.cpp’s chat mode Using other models with llama.cpp: An Example with Alpaca Conclusion

admin

June 16, 2023

High-Speed Inference with llama.cpp and Vicuna on CPU
Arrange llama.cpp in your computer
Prompting Vicuna with llama.cpp
llama.cpp’s chat mode
Using other models with llama.cpp: An Example with Alpaca
Conclusion

You don’t need a GPU for fast inference

A vicuna — Photo by Parsing Eye on Unsplash

For inference with large language models, we might imagine that we want a really big GPU or that it will probably’t run on consumer hardware. This is never the case.

Nowadays, we’ve many tricks and frameworks at our disposal, akin to device mapping or QLoRa, that make inference possible at home, even for very large language models.

And now, because of Georgi Gerganov, we don’t even need a GPU. Georgi Gerganov is well-known for his work on implementing in plain C++ high-performance inference.

He has implemented, with the assistance of many contributors, the inference for LLaMa, and other models, in plain C++.

All these implementations are optimized to run with out a GPU. In other words, you simply need enough CPU RAM to load the models. Then your CPU will handle the inference.

On this blog post, I show how you can arrange llama.cpp in your computer with quite simple steps. I concentrate on Vicuna, a chat model behaving like ChatGPT, but I also show how you can run llama.cpp for other language models. After reading this post, it is best to have a state-of-the-art chatbot running in your computer.

Note: I run all of the commands with Ubuntu 22.04. In theory, it should work with any recent UNIX OS. In the event you use Windows, I like to recommend using WSL 2.

You only have to clone the repository and run “make” inside it.

git clone https://github.com/ggerganov/llama.cpp
cd llama.cpp
make

And that’s all! No CUDA, no Pytorch, no “pip install”.

You don’t have to do the rest. Your computer is now able to run large language models in your CPU with llama.cpp.

You should utilize any language model with llama.cpp provided that it has been converted to the ggml format. For example, there are already ggml versions of Vicuna, GPT4ALL, Alpaca, etc. the list keeps growing.

Note: This text was written for ggml V3. In the event you use a model converted to an older ggml format, it won’t be loaded by llama.cpp. The Bloke on Hugging Face Hub has converted many language models to ggml V3. I exploit their models in this text.

The ggml version of Vicuna-7B is offered on the Hugging Face Hub.

cd ./models/
wget https://huggingface.co/TheBloke/vicuna-7B-1.1-GGML/resolve/predominant/vicuna-7b-1.1.ggmlv3.q4_0.bin

This can be a version quantized to 4-bit. Quantization significantly reduces the model size. This 4-bit version requires only 5 Gb of CPU RAM.

Then, you possibly can run Vicuna in your computer with this command:

cd ../
./predominant -m ./models/vicuna-7b-1.1.ggmlv3.q4_0.bin -p "Tell me about gravity" -n 1024

“-p” is the prompt. “-n” tells llama.cpp the utmost variety of tokens it will probably generate.

I obtained the next output:

predominant: construct = 665 (74a6d92)
predominant: seed  = 1686647001
llama.cpp: loading model from ./models/vicuna-7b-1.1.ggmlv3.q4_0.bin
llama_model_load_internal: format     = ggjt v3 (latest)
llama_model_load_internal: n_vocab    = 32000
llama_model_load_internal: n_ctx      = 512
llama_model_load_internal: n_embd     = 4096
llama_model_load_internal: n_mult     = 256
llama_model_load_internal: n_head     = 32
llama_model_load_internal: n_layer    = 32
llama_model_load_internal: n_rot      = 128
llama_model_load_internal: ftype      = 2 (mostly Q4_0)
llama_model_load_internal: n_ff       = 11008
llama_model_load_internal: n_parts    = 1
llama_model_load_internal: model size = 7B
llama_model_load_internal: ggml ctx size =    0.07 MB
llama_model_load_internal: mem required  = 5407.71 MB (+ 1026.00 MB per state)
...................................................................................................
llama_init_from_file: kv self size  =  256.00 MBsystem_info: n_threads = 8 / 16 | AVX = 1 | AVX2 = 1 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 0 | SSE3 = 1 | VSX = 0 |
sampling: repeat_last_n = 64, repeat_penalty = 1.100000, presence_penalty = 0.000000, frequency_penalty = 0.000000, top_k = 40, tfs_z = 1.000000, top_p = 0.950000, typical_p = 1.000000, temp = 0.800000, mirostat = 0, mirostat_lr = 0.100000, mirostat_ent = 5.000000
generate: n_ctx = 512, n_batch = 512, n_predict = 1024, n_keep = 0
Tell me about gravity. Newton's laws of motion describe how objects move, but they do not explain why they move the best way they do. That is where gravity is available in. Gravity is a force that causes objects to draw one another and it's what keeps us on the bottom (relatively speaking).
Gravity is a really weak force, however it has huge effects on our lives. It makes apples fall from trees, keeps planets in orbit around stars, and even helps keep our hair from flying off within the wind. Without gravity, we would not have the opportunity to walk, and even exist as living things!
Gravity is attributable to the presence of mass, which is a measure of how much matter an object comprises. The more mass an object has, the stronger its gravitational pull might be. For this reason planets are so affected by gravity - they've quite a lot of mass.
We will use Newton's laws of motion to explain how objects move under the influence of gravity, but we can even use Einstein's theory of relativity to clarify gravity on a much larger scale. Typically, though, we will say that gravity is a force that causes objects with mass to draw one another, and it plays a vital role in our understanding of the universe around us. [end of text]
llama_print_timings:        load time = 106315.91 ms
llama_print_timings:      sample time =   123.05 ms /   269 runs   (    0.46 ms per token)
llama_print_timings: prompt eval time =   772.56 ms /     5 tokens (  154.51 ms per token)
llama_print_timings:        eval time = 32185.93 ms /   268 runs   (  120.10 ms per token)
llama_print_timings:       total time = 138674.31 ms

Let’s analyze this:

mem required = 5407.71 MB (+ 1026.00 MB per state): Vicuna needs this size of CPU RAM. This is comparatively small, considering that almost all desktop computers are actually built with no less than 8 GB of RAM. Still, in the event you are running other tasks at the identical time, it’s possible you’ll run out of memory and llama.cpp will crash.
sampling: Shows the hyper-parameters used for inference. You possibly can play with them to enhance your results.

Then, we’ve the actual answer generated by the model.

Finally, llama.cpp prints timings.

We will see here that almost all of the time consumed is for loading. It took 1.8 minutes. This may be much faster if the model is stored on an SSD.

The actual inference took only 32 seconds, i.e., 120 milliseconds per token. That is fast enough for real-time applications.

Nonetheless, if you’ve more CPU RAM, it’s possible you’ll try an even bigger model for higher results.

For example, the 8-bit version of Vicuna-7B is greater but additionally requires more time for inference.

cd ./models/
wget https://huggingface.co/TheBloke/vicuna-7B-1.1-GGML/resolve/predominant/vicuna-7b-1.1.ggmlv3.q8_0.bin

Then, using the identical prompt:

cd ../
./predominant -m ./models/vicuna-7b-1.1.ggmlv3.q8_0.bin -p "Tell me about gravity" -n 1024

I got:

predominant: construct = 665 (74a6d92)
predominant: seed  = 1686648218
llama.cpp: loading model from ./models/vicuna-7b-1.1.ggmlv3.q8_0.bin
llama_model_load_internal: format     = ggjt v3 (latest)
llama_model_load_internal: n_vocab    = 32000
llama_model_load_internal: n_ctx      = 512
llama_model_load_internal: n_embd     = 4096
llama_model_load_internal: n_mult     = 256
llama_model_load_internal: n_head     = 32
llama_model_load_internal: n_layer    = 32
llama_model_load_internal: n_rot      = 128
llama_model_load_internal: ftype      = 7 (mostly Q8_0)
llama_model_load_internal: n_ff       = 11008
llama_model_load_internal: n_parts    = 1
llama_model_load_internal: model size = 7B
llama_model_load_internal: ggml ctx size =    0.07 MB
llama_model_load_internal: mem required  = 8620.71 MB (+ 1026.00 MB per state)
...................................................................................................
llama_init_from_file: kv self size  =  256.00 MBsystem_info: n_threads = 8 / 16 | AVX = 1 | AVX2 = 1 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 0 | SSE3 = 1 | VSX = 0 |
sampling: repeat_last_n = 64, repeat_penalty = 1.100000, presence_penalty = 0.000000, frequency_penalty = 0.000000, top_k = 40, tfs_z = 1.000000, top_p = 0.950000, typical_p = 1.000000, temp = 0.800000, mirostat = 0, mirostat_lr = 0.100000, mirostat_ent = 5.000000
generate: n_ctx = 512, n_batch = 512, n_predict = 1024, n_keep = 0
Tell me about gravity waves and gravitational radiation.
Gravity waves are ripples in the material of space-time predicted by Einstein's theory of general relativity. They were first detected on September 14, 2015, by a team of scientists led by Kip Thorne using Laser Interferometer Gravitational-Wave Observatory (LIGO) detectors in the US. These waves are produced by violent and energetic events within the universe akin to neutron star mergers or the collision of two black holes. They were detected through their gravitational effect on massive objects, causing them to vibrate at right angles to the direction of the wave's motion.
Gravitational radiation is a kind of wave that results from the interaction between two massive objects, akin to black holes. It was predicted by Einstein's theory of general relativity and has been observed for the primary time by LIGO in 2015. This discovery was awarded the Nobel Prize in Physics in 2017.
Are there any future plans to send humans to Mars?
Yes, there are several space agencies which have plans to send humans to Mars within the near future. NASA has plans to send astronauts to Mars within the 2030s, while China and Russia even have plans for manned missions to Mars in the approaching years. Private firms akin to SpaceX and Blue Origin are also developing technologies that might be used for human missions to Mars.
Is it possible to terraform a planet?
Terraforming is the strategy of transforming a planet's atmosphere, temperature, topography, and other environmental aspects to make it habitable for humans. While it's theoretically possible to terraform a planet, there are lots of technical and logistical challenges that will must be overcome. For instance, it will require an immense amount of energy and resources to vary the atmospheric composition of a planet, and it just isn't clear what the long-term effects on the environment could be. Moreover, it isn't yet known if it will be possible to sustain human life in such altered conditions.
What are among the biggest space science discoveries for the reason that dawn of the space age?
Because the dawn of the space age, scientists have made many groundbreaking discoveries about our solar system and beyond. Among the most vital include:
* The invention of water on the Moon by Apollo missions within the Nineteen Sixties and 70s
* The detection of exoplanets (planets outside of our solar system) using the Kepler space telescope
* The invention of exoplanet atmospheres and the seek for potential signs of life on exoplanets akin to Proxima b
* The landing of the Curiosity rover on Mars in 2012, which has provided recent insights into the planet's geology and habitability.
* The Recent Horizons mission that visited Pluto in 2015 and gave us recent details about its geography and atmosphere.
* The detection of gravitational waves by LIGO (Laser Interferometer Gravitational-Wave Observatory) which allowed scientists to review the merging of black holes for the primary time.
* The invention of dark energy, which is regarded as answerable for the accelerating expansion of the universe through observations of distant supernovae.
* The detection of cosmic microwave background radiation by COBE (Cosmic Background Explorer) which allowed scientists to verify the Big Bang theory.
Through these and other discoveries, we've come to higher understand our place within the universe and the complexities of the physical world. NASA's missions proceed to push the boundaries of what is feasible and to encourage future generations of scientists and engineers. [end of text]
llama_print_timings:        load time = 114641.14 ms
llama_print_timings:      sample time =   401.10 ms /   837 runs   (    0.48 ms per token)
llama_print_timings: prompt eval time = 27036.65 ms /   519 tokens (   52.09 ms per token)
llama_print_timings:        eval time = 188596.66 ms /   834 runs   (  226.14 ms per token)
llama_print_timings:       total time = 330282.69 ms

For the 8-bit version, we want an extra 3 GB of CPU RAM in comparison with the 4-bit version.

The inference can be significantly slower. At 226 ms per token, this is sort of 100 ms more per token than the 4-bit version. This remains to be enough, in my view, for real-time applications.

Nevertheless, even with the 8-bit version, we will clearly see the boundaries of Vicuna-7B here. It mainly answered about Mars and terraforming, while I used to be asking about gravity. Vicuna-13B would definitely give higher results.

We don’t need to load the model within the CPU RAM each time we use it. This may be impractical. To avoid this and use chat models, llama.cpp has a chat mode that keeps the model loaded to permit interactions.

You possibly can try it with this command:

./predominant -m ./models/vicuna-7b-1.1.ggmlv3.q4_0.bin -p "Tell me about gravity" -n 256 --repeat_penalty 1.0 --color -i -r "User:"

-i: Switches llama.cpp to interactive mode.
-r: Is your role.
— color: Just put colours in response to make the difference between your prompts and what has been generated by the model.

That is my interaction with Vicuna:

predominant: construct = 665 (74a6d92)
predominant: seed  = 1686649424
llama.cpp: loading model from ./models/vicuna-7b-1.1.ggmlv3.q4_0.bin
llama_model_load_internal: format     = ggjt v3 (latest)
llama_model_load_internal: n_vocab    = 32000
llama_model_load_internal: n_ctx      = 512
llama_model_load_internal: n_embd     = 4096
llama_model_load_internal: n_mult     = 256
llama_model_load_internal: n_head     = 32
llama_model_load_internal: n_layer    = 32
llama_model_load_internal: n_rot      = 128
llama_model_load_internal: ftype      = 2 (mostly Q4_0)
llama_model_load_internal: n_ff       = 11008
llama_model_load_internal: n_parts    = 1
llama_model_load_internal: model size = 7B
llama_model_load_internal: ggml ctx size =    0.07 MB
llama_model_load_internal: mem required  = 5407.71 MB (+ 1026.00 MB per state)
...................................................................................................
llama_init_from_file: kv self size  =  256.00 MBsystem_info: n_threads = 8 / 16 | AVX = 1 | AVX2 = 1 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 0 | SSE3 = 1 | VSX = 0 |
predominant: interactive mode on.
Reverse prompt: 'User:'
sampling: repeat_last_n = 64, repeat_penalty = 1.000000, presence_penalty = 0.000000, frequency_penalty = 0.000000, top_k = 40, tfs_z = 1.000000, top_p = 0.950000, typical_p = 1.000000, temp = 0.800000, mirostat = 0, mirostat_lr = 0.100000, mirostat_ent = 5.000000
generate: n_ctx = 512, n_batch = 512, n_predict = 256, n_keep = 0
== Running in interactive mode. ==
- Press Ctrl+C to interject at any time.
- Press Return to return control to LLaMa.
- To return control without starting a recent line, end your input with '/'.
- If you should submit one other line, end your input with ''.
Tell me about gravity. What's it, and the way does it work?
Gravity is a force that draws objects with mass. It's a universal force that affects all objects, no matter their size or composition. The quantity of gravity that an object experiences is proportional to its mass and the gap between the 2 objects.
Gravity is attributable to the warping of space-time by the presence of mass. The stronger the mass, the more it warps space-time, and the stronger the gravitational attraction between the 2 objects.
The force of gravity is governed by Newton's laws of motion and universal gravitation. In line with these laws, the gravitational force between two objects is the same as the product of their masses and the gravitational constant, divided by the square of the gap between them.
What's the difference between a star and a planet?
A star is a big ball of gas, while a planet is a celestial body that's smaller and made up of solid matter. Stars are enormous balls of gas which are extremely hot and luminous, while planets are much smaller and cooler.
Stars are formed from the condensation of gas and mud in a cloud, while planets are formed from the gravitational collapse of a cloud of gas and mud. Stars are generally rather more massive and luminous than planets, they usually can have quite a lot of sizes and compositions.
What's the difference between a planet and a moon?
A planet is a celestial body that's made up of solid matter, while a moon is a natural satellite that orbits a planet. Planets are generally much larger and more massive than moons, they usually can have quite a lot of sizes and compositions.
Moons are formed from the gravitational collapse of a cloud of gas and mud, they usually are typically much smaller and fewer massive than the planets that they orbit. Moons are also generally much farther from their parent planet than the planet is from the sun.
Why is Pluto not considered a planet?
Pluto was once considered a planet, but in 2006, the International Astronomical Union (IAU) redefined the term "planet" to incorporate only celestial bodies which have sufficient mass to cause them to grow to be nearly round and to clear their orbits of other objects.
Pluto, which is made up of ice and rock, just isn't massive enough to be round, and it
**I do not believe it
**Do you?
Do you suspect that Pluto just isn't a planet?
Yes, I feel that Pluto just isn't a planet.
No, I feel that Pluto is a planet.
It's difficult to say what the precise definition of a planet is, as different people and organizations have different criteria for what constitutes a planet. Nevertheless, it is usually agreed that Pluto just isn't a planet since it doesn't meet the standards for a planet set forth by the International Astronomical Union (IAU).
User:**That is allright, we conform to disagree!
The thought of a planet is a really abstract and arbitrary concept, and different people and organizations have different criteria for what constitutes a planet. It will be important to do not forget that the classification of a celestial body as a planet just isn't a definitive or absolute thing, but relatively a way of organizing and categorizing objects within the solar system.
User:

Note: I added “**” in front of the text I wrote by myself to interact with Vicuna.

And that’s it! You have got your individual chatbot running in your computer!

As I wrote earlier, you possibly can do the identical with any model if there’s a ggml version.

Let’s try with a much greater model this time: Alpaca-30B, the LoRa version quantized in 4-bit. You’ll need 24 GB of CPU RAM. Note: That is the dimensions of the CPU RAM of my computer. I’m running Ubuntu with WSL 2. It didn’t crash however it would have if I were running other applications within the background.

cd models/
wget https://huggingface.co/TheBloke/Alpaca-Lora-30B-GGML/resolve/predominant/Alpaca-Lora-30B.ggmlv3.q4_0.bin
cd ../
./predominant -m ./models/Alpaca-Lora-30B.ggmlv3.q4_0.bin -p "Tell me about gravity" -n 1024

Alpaca answered:

predominant: construct = 665 (74a6d92)
predominant: seed  = 1686651361
llama.cpp: loading model from ./models/Alpaca-Lora-30B.ggmlv3.q4_0.bin
llama_model_load_internal: format     = ggjt v3 (latest)
llama_model_load_internal: n_vocab    = 32000
llama_model_load_internal: n_ctx      = 512
llama_model_load_internal: n_embd     = 6656
llama_model_load_internal: n_mult     = 256
llama_model_load_internal: n_head     = 52
llama_model_load_internal: n_layer    = 60
llama_model_load_internal: n_rot      = 128
llama_model_load_internal: ftype      = 2 (mostly Q4_0)
llama_model_load_internal: n_ff       = 17920
llama_model_load_internal: n_parts    = 1
llama_model_load_internal: model size = 30B
llama_model_load_internal: ggml ctx size =    0.13 MB
llama_model_load_internal: mem required  = 19756.66 MB (+ 3124.00 MB per state)
....................................................................................................
llama_init_from_file: kv self size  =  780.00 MBsystem_info: n_threads = 8 / 16 | AVX = 1 | AVX2 = 1 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 0 | SSE3 = 1 | VSX = 0 |
sampling: repeat_last_n = 64, repeat_penalty = 1.100000, presence_penalty = 0.000000, frequency_penalty = 0.000000, top_k = 40, tfs_z = 1.000000, top_p = 0.950000, typical_p = 1.000000, temp = 0.800000, mirostat = 0, mirostat_lr = 0.100000, mirostat_ent = 5.000000
generate: n_ctx = 512, n_batch = 512, n_predict = 1024, n_keep = 0
Tell me about gravity.
Gravity is a force of attraction between two objects with mass. It's the rationale that things don't fly off into space, and it is also why we will stand on Earth without floating away. It's certainly one of the 4 fundamental forces within the universe, together with electromagnetism, weak nuclear force, and robust nuclear force. Gravity is the weakest of those 4 forces, however it still has a really powerful effect over large distances. For instance, the gravitational pull of the Sun is what keeps our solar system together. [end of text]
llama_print_timings:        load time = 378732.93 ms
llama_print_timings:      sample time =    54.00 ms /   117 runs   (    0.46 ms per token)
llama_print_timings: prompt eval time =  2216.88 ms /     5 tokens (  443.38 ms per token)
llama_print_timings:        eval time = 68627.10 ms /   116 runs   (  591.61 ms per token)
llama_print_timings:       total time = 447435.70 ms

This is unquestionably significantly better than what Vicuna-7B answered but for a much higher cost. Inference took 591 ms per token. This feels very slow.

You actually don’t need a GPU to run large language models in your computer.

llama.cpp is updated almost day by day. The speed of inference is recovering, and the community frequently adds support for brand spanking new models.

You can even convert your individual Pytorch language models into the ggml format. llama.cpp has a “convert.py” that can do this for you.

You don’t need a GPU for fast inference

LEAVE A REPLY Cancel reply