Introduction to ggml

ggml is a machine learning (ML) library written in C and C++ with a concentrate on Transformer inference. The project is open-source and is being actively developed by a growing community. ggml is comparable to ML libraries equivalent to PyTorch and TensorFlow, though it remains to be in its early stages of development and a few of its fundamentals are still changing rapidly.

Over time, ggml has gained popularity alongside other projects like llama.cpp and whisper.cpp. Many other projects also use ggml under the hood to enable on-device LLM, including ollama, jan, LM Studio, GPT4All.

The fundamental reasons people select to make use of ggml over other libraries are:

Minimalism: The core library is self-contained in lower than 5 files. While it’s possible you’ll want to incorporate additional files for GPU support, it’s optional.
Easy compilation: You do not need fancy construct tools. Without GPU support, you simply need GCC or Clang!
Lightweight: The compiled binary size is lower than 1MB, which is tiny in comparison with PyTorch (which often takes a whole bunch of MB).
Good compatibility: It supports many sorts of hardware, including x86_64, ARM, Apple Silicon, CUDA, etc.
Support for quantized tensors: Tensors may be quantized to avoid wasting memory (just like JPEG compression) and in certain cases to enhance performance.
Extremely memory efficient: Overhead for storing tensors and performing computations is minimal.

Nonetheless, ggml also comes with some disadvantages that you might want to bear in mind when using it (this list may change in future versions of ggml):

Not all tensor operations are supported on all backends. For instance, some may go on CPU but won’t work on CUDA.
Development with ggml will not be straightforward and should require deep knowledge of low-level programming.
The project is in energetic development, so breaking changes are expected.

In this text, we’ll concentrate on the basics of ggml for developers seeking to start with the library. We don’t cover higher-level tasks equivalent to LLM inference with llama.cpp, which builds upon ggml. As an alternative, we’ll explore the core concepts and basic usage of ggml to offer a solid foundation for further learning and development.

Getting began

Great, so how do you begin?

For simplicity, this guide will show you the best way to compile ggml on Ubuntu. In point of fact, you’ll be able to compile ggml on virtually any platform (including Windows, macOS, and BSD).



sudo apt install build-essential cmake git gdb


git clone https://github.com/ggerganov/ggml.git
cd ggml


cmake -B construct
cmake --build construct --config Release --target simple-ctx


./construct/bin/simple-ctx

Expected output:

mul mat (4 x 3) (transposed result):
[ 60.00 55.00 50.00 110.00
 90.00 54.00 54.00 126.00
 42.00 29.00 28.00 64.00 ]

If you happen to see the expected result, which means we’re good to go!

Terminology and ideas

Before diving deep into ggml, we should always understand some key concepts. If you happen to’re coming from high-level libraries like PyTorch or TensorFlow, these could seem difficult to understand. Nonetheless, bear in mind that ggml is a low-level library. Understanding these terms can provide you with rather more control over performance:

ggml_context: A “container” that holds objects equivalent to tensors, graphs, and optionally data
ggml_cgraph: Represents a computational graph. Consider it because the “order of computation” that might be transferred to the backend.
ggml_backend: Represents an interface for executing computation graphs. There are a lot of sorts of backends: CPU (default), CUDA, Metal (Apple Silicon), Vulkan, RPC, etc.
ggml_backend_buffer_type: Represents a buffer type. Consider it as a “memory allocator” connected to every ggml_backend. For instance, if you would like to perform calculations on a GPU, you might want to allocate memory on the GPU via buffer_type (often abbreviated as buft).
ggml_backend_buffer: Represents a buffer allocated by buffer_type. Remember: a buffer can hold the info of multiple tensors.
ggml_gallocr: Represents a graph memory allocator, used to allocate efficiently the tensors utilized in a computation graph.
ggml_backend_sched: A scheduler that permits concurrent use of multiple backends. It will possibly distribute computations across different hardware (e.g., GPU and CPU) when coping with large models or multiple GPUs. The scheduler also can robotically assign GPU-unsupported operations to the CPU, ensuring optimal resource utilization and compatibility.

Easy example

In this instance, we’ll undergo the steps to copy the code we ran in Getting Began. We want to create 2 matrices, multiply them and get the result. Using PyTorch, the code looks like this:

import torch


matrix1 = torch.tensor([
  [2, 8],
  [5, 1],
  [4, 2],
  [8, 6],
])
matrix2 = torch.tensor([
  [10, 5],
  [9, 9],
  [5, 4],
])


result = torch.matmul(matrix1, matrix2.T)
print(result.T)

With ggml, the next steps should be done to realize the identical result:

Allocate ggml_context to store tensor data
Create tensors and set data
Create a ggml_cgraph for mul_mat operation
Run the computation
Retrieve results (output tensors)
Free memory and exit

NOTE: In this instance, we’ll allocate the tensor data inside the ggml_context for simplicity. In practice, memory must be allocated as a tool buffer, as we’ll see in the subsequent section.

To start, let’s create a brand new directory examples/demo

cd ggml 


touch examples/demo/demo.c
touch examples/demo/CMakeLists.txt

The code for this instance relies on simple-ctx.cpp

Edit examples/demo/demo.c with the content below:

#include "ggml.h"
#include "ggml-cpu.h"
#include 
#include 

int fundamental(void) {
    
    const int rows_A = 4, cols_A = 2;
    float matrix_A[rows_A * cols_A] = {
        2, 8,
        5, 1,
        4, 2,
        8, 6
    };
    const int rows_B = 3, cols_B = 2;
    float matrix_B[rows_B * cols_B] = {
        10, 5,
        9, 9,
        5, 4
    };

    
    
    size_t ctx_size = 0;
    ctx_size += rows_A * cols_A * ggml_type_size(GGML_TYPE_F32); 
    ctx_size += rows_B * cols_B * ggml_type_size(GGML_TYPE_F32); 
    ctx_size += rows_A * rows_B * ggml_type_size(GGML_TYPE_F32); 
    ctx_size += 3 * ggml_tensor_overhead(); 
    ctx_size += ggml_graph_overhead(); 
    ctx_size += 1024; 

    
    struct ggml_init_params params = {
         ctx_size,
         NULL,
         false,
    };
    struct ggml_context * ctx = ggml_init(params);

    
    struct ggml_tensor * tensor_a = ggml_new_tensor_2d(ctx, GGML_TYPE_F32, cols_A, rows_A);
    struct ggml_tensor * tensor_b = ggml_new_tensor_2d(ctx, GGML_TYPE_F32, cols_B, rows_B);
    memcpy(tensor_a->data, matrix_A, ggml_nbytes(tensor_a));
    memcpy(tensor_b->data, matrix_B, ggml_nbytes(tensor_b));


    
    struct ggml_cgraph * gf = ggml_new_graph(ctx);

    
    
    
    struct ggml_tensor * result = ggml_mul_mat(ctx, tensor_a, tensor_b);

    
    ggml_build_forward_expand(gf, result);

    
    int n_threads = 1; 
    ggml_graph_compute_with_ctx(ctx, gf, n_threads);

    
    float * result_data = (float *) result->data;
    printf("mul mat (%d x %d) (transposed result):n[", (int) result->ne[0], (int) result->ne[1]);
    for (int j = 0; j < result->ne[1] ; j++) {
        if (j > 0) {
            printf("n");
        }

        for (int i = 0; i < result->ne[0] ; i++) {
            printf(" %.2f", result_data[j * result->ne[0] + i]);
        }
    }
    printf(" ]n");

    
    ggml_free(ctx);
    return 0;
}

Write these lines within the examples/demo/CMakeLists.txt file you created:

set(TEST_TARGET demo)
add_executable(${TEST_TARGET} demo)
target_link_libraries(${TEST_TARGET} PRIVATE ggml)

Edit examples/CMakeLists.txt, add this line at the tip:

add_subdirectory(demo)

Compile and run it:

cmake -B construct
cmake --build construct --config Release --target demo


./construct/bin/demo

Expected result:

mul mat (4 x 3) (transposed result):
[ 60.00 55.00 50.00 110.00
 90.00 54.00 54.00 126.00
 42.00 29.00 28.00 64.00 ]

Example with a backend

“Backend” in ggml refers to an interface that may handle tensor operations. Backend may be CPU, CUDA, Vulkan, etc.

The backend abstracts the execution of the computation graphs. Once defined, a graph may be computed with the available hardware through the use of the respective backend implementation. Note that ggml will robotically reserve memory for any intermediate tensors essential for the computation and can optimize the memory usage based on the lifetime of those intermediate results.

When doing a computation or inference with backend, common steps that should be done are:

Initialize ggml_backend
Allocate ggml_context to store tensor metadata (we don’t need to allocate tensor data straight away)
Create tensors metadata (only their shapes and data types)
Allocate a ggml_backend_buffer to store all tensors
Copy tensor data from fundamental memory (RAM) to backend buffer
Create a ggml_cgraph for mul_mat operation
Create a ggml_gallocr for cgraph allocation
Optionally: schedule the cgraph using ggml_backend_sched
Run the computation
Retrieve results (output tensors)
Free memory and exit

The code for this instance relies on simple-backend.cpp

#include "ggml.h"
#include "ggml-alloc.h"
#include "ggml-backend.h"
#ifdef GGML_USE_CUDA
#include "ggml-cuda.h"
#endif

#include 
#include 
#include 

int fundamental(void) {
    
    const int rows_A = 4, cols_A = 2;
    float matrix_A[rows_A * cols_A] = {
        2, 8,
        5, 1,
        4, 2,
        8, 6
    };
    const int rows_B = 3, cols_B = 2;
    float matrix_B[rows_B * cols_B] = {
        10, 5,
        9, 9,
        5, 4
    };

    
    ggml_backend_t backend = NULL;
#ifdef GGML_USE_CUDA
    fprintf(stderr, "%s: using CUDA backendn", __func__);
    backend = ggml_backend_cuda_init(0); 
    if (!backend) {
        fprintf(stderr, "%s: ggml_backend_cuda_init() failedn", __func__);
    }
#endif
    
    if (!backend) {
        backend = ggml_backend_cpu_init();
    }

    
    size_t ctx_size = 0;
    ctx_size += 2 * ggml_tensor_overhead(); 
    

    
    struct ggml_init_params params = {
         ctx_size,
         NULL,
         true, 
    };
    struct ggml_context * ctx = ggml_init(params);

    
    struct ggml_tensor * tensor_a = ggml_new_tensor_2d(ctx, GGML_TYPE_F32, cols_A, rows_A);
    struct ggml_tensor * tensor_b = ggml_new_tensor_2d(ctx, GGML_TYPE_F32, cols_B, rows_B);

    
    ggml_backend_buffer_t buffer = ggml_backend_alloc_ctx_tensors(ctx, backend);

    
    ggml_backend_tensor_set(tensor_a, matrix_A, 0, ggml_nbytes(tensor_a));
    ggml_backend_tensor_set(tensor_b, matrix_B, 0, ggml_nbytes(tensor_b));

    
    struct ggml_cgraph * gf = NULL;
    struct ggml_context * ctx_cgraph = NULL;
    {
        
        struct ggml_init_params params0 = {
             ggml_tensor_overhead()*GGML_DEFAULT_GRAPH_SIZE + ggml_graph_overhead(),
             NULL,
             true, 
        };
        ctx_cgraph = ggml_init(params0);
        gf = ggml_new_graph(ctx_cgraph);

        
        
        
        struct ggml_tensor * result0 = ggml_mul_mat(ctx_cgraph, tensor_a, tensor_b);

        
        ggml_build_forward_expand(gf, result0);
    }

    
    ggml_gallocr_t allocr = ggml_gallocr_new(ggml_backend_get_default_buffer_type(backend));
    ggml_gallocr_alloc_graph(allocr, gf);

    

    
    int n_threads = 1; 
    if (ggml_backend_is_cpu(backend)) {
        ggml_backend_cpu_set_n_threads(backend, n_threads);
    }
    ggml_backend_graph_compute(backend, gf);

    
    
    struct ggml_tensor * result = ggml_graph_node(gf, -1);
    float * result_data = malloc(ggml_nbytes(result));
    
    ggml_backend_tensor_get(result, result_data, 0, ggml_nbytes(result));
    printf("mul mat (%d x %d) (transposed result):n[", (int) result->ne[0], (int) result->ne[1]);
    for (int j = 0; j < result->ne[1] ; j++) {
        if (j > 0) {
            printf("n");
        }

        for (int i = 0; i < result->ne[0] ; i++) {
            printf(" %.2f", result_data[j * result->ne[0] + i]);
        }
    }
    printf(" ]n");
    free(result_data);

    
    ggml_free(ctx_cgraph);
    ggml_gallocr_free(allocr);
    ggml_free(ctx);
    ggml_backend_buffer_free(buffer);
    ggml_backend_free(backend);
    return 0;
}

Compile and run it, you must get the identical result because the last example:

cmake -B construct
cmake --build construct --config Release --target demo


./construct/bin/demo

Expected result:

mul mat (4 x 3) (transposed result):
[ 60.00 55.00 50.00 110.00
 90.00 54.00 54.00 126.00
 42.00 29.00 28.00 64.00 ]

Printing the computational graph

The ggml_cgraph represents the computational graph, which defines the order of operations that might be executed by the backend. Printing the graph is usually a helpful debugging tool, especially when working with more complex models and computations.

You possibly can add ggml_graph_print to print the cgraph:

...


ggml_build_forward_expand(gf, result0);


ggml_graph_print(gf);

Run it:

=== GRAPH ===
n_nodes = 1
 -   0: [     4,     3,     1]          MUL_MAT  
n_leafs = 2
 -   0: [     2,     4]     NONE           leaf_0
 -   1: [     2,     3]     NONE           leaf_1
========================================

Moreover, you’ll be able to draw the cgraph as graphviz dot format:

ggml_graph_dump_dot(gf, NULL, "debug.dot");

You should use the dot command or this online website to render debug.dot right into a final image:

Conclusion

This text has provided an introductory overview of ggml, covering the important thing concepts, an easy usage example, and an example using a backend. While we have covered the fundamentals, there may be rather more to explore on the subject of ggml.

In upcoming articles, we’ll dive deeper into other ggml-related subjects, equivalent to the GGUF format, quantization, and the way different backends are organized and utilized. Moreover, you’ll be able to visit the ggml examples directory to see more advanced use cases and sample code. Stay tuned for more ggml content in the long run!

Source link

Introduction to ggml

Getting began

Terminology and ideas

Easy example

Example with a backend

Printing the computational graph

Conclusion

What are your thoughts on this topic?
Let us know in the comments below.

Share this article

Recent posts

Tool Use, Unified

Breaking the Hardware Barrier: Software FP8 for Older GPUs

Hugging Face Transformers in Motion: Learning How To Leverage AI for NLP

Infini-Attention, and why we must always keep trying?

Deploy Meta Llama 3.1 405B on Google Cloud Vertex AI

Introduction to ggml

Getting began

Terminology and ideas

Easy example

Example with a backend

Printing the computational graph

Conclusion

What are your thoughts on this topic? Let us know in the comments below.

Share this article

Recent posts

What are your thoughts on this topic?
Let us know in the comments below.