|
# Simple autogenerated Python bindings for ggml |
|
|
|
This folder contains: |
|
|
|
- Scripts to generate full Python bindings from ggml headers (+ stubs for autocompletion in IDEs) |
|
- Some barebones utils (see [ggml/utils.py](./ggml/utils.py)): |
|
- `ggml.utils.init` builds a context that's freed automatically when the pointer gets GC'd |
|
- `ggml.utils.copy` **copies between same-shaped tensors (numpy or ggml), w/ automatic (de/re)quantization** |
|
- `ggml.utils.numpy` returns a numpy view over a ggml tensor; if it's quantized, it returns a copy (requires `allow_copy=True`) |
|
- Very basic examples (anyone wants to port [llama2.c](https://github.com/karpathy/llama2.c)?) |
|
|
|
Provided you set `GGML_LIBRARY=.../path/to/libggml_shared.so` (see instructions below), it's trivial to do some operations on quantized tensors: |
|
|
|
```python |
|
# Make sure libllama.so is in your [DY]LD_LIBRARY_PATH, or set GGML_LIBRARY=.../libggml_shared.so |
|
|
|
from ggml import lib, ffi |
|
from ggml.utils import init, copy, numpy |
|
import numpy as np |
|
|
|
ctx = init(mem_size=12*1024*1024) |
|
n = 256 |
|
n_threads = 4 |
|
|
|
a = lib.ggml_new_tensor_1d(ctx, lib.GGML_TYPE_Q5_K, n) |
|
b = lib.ggml_new_tensor_1d(ctx, lib.GGML_TYPE_F32, n) # Can't both be quantized |
|
sum = lib.ggml_add(ctx, a, b) # all zeroes for now. Will be quantized too! |
|
|
|
gf = ffi.new('struct ggml_cgraph*') |
|
lib.ggml_build_forward_expand(gf, sum) |
|
|
|
copy(np.array([i for i in range(n)], np.float32), a) |
|
copy(np.array([i*100 for i in range(n)], np.float32), b) |
|
|
|
lib.ggml_graph_compute_with_ctx(ctx, gf, n_threads) |
|
|
|
print(numpy(a, allow_copy=True)) |
|
# 0. 1.0439453 2.0878906 3.131836 4.1757812 5.2197266. ... |
|
print(numpy(b)) |
|
# 0. 100. 200. 300. 400. 500. ... |
|
print(numpy(sum, allow_copy=True)) |
|
# 0. 105.4375 210.875 316.3125 421.75 527.1875 ... |
|
``` |
|
|
|
### Prerequisites |
|
|
|
You'll need a shared library of ggml to use the bindings. |
|
|
|
#### Build libggml_shared.so or libllama.so |
|
|
|
As of this writing the best is to use [ggerganov/llama.cpp](https://github.com/ggerganov/llama.cpp)'s generated `libggml_shared.so` or `libllama.so`, which you can build as follows: |
|
|
|
```bash |
|
git clone https://github.com/ggerganov/llama.cpp |
|
# On a CUDA-enabled system add -DLLAMA_CUBLAS=1 |
|
# On a Mac add -DLLAMA_METAL=1 |
|
cmake llama.cpp \ |
|
-B llama_build \ |
|
-DCMAKE_C_FLAGS=-Ofast \ |
|
-DLLAMA_NATIVE=1 \ |
|
-DLLAMA_LTO=1 \ |
|
-DBUILD_SHARED_LIBS=1 \ |
|
-DLLAMA_MPI=1 \ |
|
-DLLAMA_BUILD_TESTS=0 \ |
|
-DLLAMA_BUILD_EXAMPLES=0 |
|
( cd llama_build && make -j ) |
|
|
|
# On Mac, this will be libggml_shared.dylib instead |
|
export GGML_LIBRARY=$PWD/llama_build/libggml_shared.so |
|
# Alternatively, you can just copy it to your system's lib dir, e.g /usr/local/lib |
|
``` |
|
|
|
#### (Optional) Regenerate the bindings and stubs |
|
|
|
If you added or changed any signatures of the C API, you'll want to regenerate the bindings ([ggml/cffi.py](./ggml/cffi.py)) and stubs ([ggml/__init__.pyi](./ggml/__init__.pyi)). |
|
|
|
Luckily it's a one-liner using [regenerate.py](./regenerate.py): |
|
|
|
```bash |
|
pip install -q cffi |
|
|
|
python regenerate.py |
|
``` |
|
|
|
By default it assumes `llama.cpp` was cloned in ../../../llama.cpp (alongside the ggml folder). You can override this with: |
|
|
|
```bash |
|
C_INCLUDE_DIR=$LLAMA_CPP_DIR python regenerate.py |
|
``` |
|
|
|
You can also edit [api.h](./api.h) to control which files should be included in the generated bindings (defaults to `llama.cpp/ggml*.h`) |
|
|
|
In fact, if you wanted to only generate bindings for the current version of the `ggml` repo itself (instead of `llama.cpp`; you'd loose support for k-quants), you could run: |
|
|
|
```bash |
|
API=../../include/ggml/ggml.h python regenerate.py |
|
``` |
|
|
|
## Develop |
|
|
|
Run tests: |
|
|
|
```bash |
|
pytest |
|
``` |
|
|
|
### Alternatives |
|
|
|
This example's goal is to showcase [cffi](https://cffi.readthedocs.io/)-generated bindings that are trivial to use and update, but there are already alternatives in the wild: |
|
|
|
- https://github.com/abetlen/ggml-python: these bindings seem to be hand-written and use [ctypes](https://docs.python.org/3/library/ctypes.html). It has [high-quality API reference docs](https://ggml-python.readthedocs.io/en/latest/api-reference/#ggml.ggml) that can be used with these bindings too, but it doesn't expose Metal, CUDA, MPI or OpenCL calls, doesn't support transparent (de/re)quantization like this example does (see [ggml.utils](./ggml/utils.py) module), and won't pick up your local changes. |
|
|
|
- https://github.com/abetlen/llama-cpp-python: these expose the C++ `llama.cpp` interface, which this example cannot easily be extended to support (`cffi` only generates bindings of C libraries) |
|
|
|
- [pybind11](https://github.com/pybind/pybind11) and [nanobind](https://github.com/wjakob/nanobind) are two alternatives to cffi that support binding C++ libraries, but it doesn't seem either of them have an automatic generator (writing bindings is rather time-consuming). |
|
|