They were done for testing purposes and include:

one with an older llama.cpp version without bpe pre-tokenizer fix, done per fp16 binary
one with an older llama.cpp version without bpe pre-tokenizer fix, done per fp32 binary
one with a recent version and the bpefix

Currently the GGUFs perform below expectations, the -mlx performs best in comparison. Any ideas why?

GGUF

Model size

8.03B params

Architecture

llama

Hardware compatibility

4-bit

32-bit

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support