NEW - I exported and added mmproj-BF16.gguf to properly support llama.cpp, ollama, and LM Studio.

Devstral-Vision-Small-2507 GGUF

Quantized GGUF versions of cognitivecomputations/Devstral-Vision-Small-2507 - the multimodal coding specialist that combines Devstral's exceptional coding abilities with vision understanding.

Model Description

This is the first vision-enabled version of Devstral, created by transplanting Devstral's language model weights into Mistral-Small-3.2's multimodal architecture. It enables:

Converting UI screenshots to code
Debugging visual rendering issues
Implementing designs from mockups
Understanding codebases with visual context

Quantization Selection Guide

Quantization	Size	Min RAM	Recommended For	Quality	Notes
Q8_0	23GB	24GB	RTX 3090/4090/A6000 users wanting maximum quality	★★★★★	Near-lossless, best for production use
Q6_K	18GB	20GB	High-end GPUs with focus on quality	★★★★☆	Excellent quality/size balance
Q5_K_M	16GB	18GB	RTX 3080 Ti/4070 Ti users	★★★★☆	Great balance of quality and performance
Q4_K_M	13GB	16GB	Most users - RTX 3060 12GB/3070/4060	★★★☆☆	The sweet spot, minimal quality loss
IQ4_XS	12GB	14GB	Experimental - newer compression method	★★★☆☆	Good alternative to Q4_K_M
Q3_K_M	11GB	12GB	8-12GB GPUs, quality-conscious users	★★☆☆☆	Noticeable quality drop for complex code

Choosing the Right Quantization

For coding with vision tasks, I recommend:

Production/Professional use: Q8_0 or Q6_K
General development: Q4_K_M (best balance)
Limited VRAM: Q5_K_M if you can fit it, otherwise Q4_K_M
Experimental: Try IQ4_XS for potentially better quality at similar size to Q4_K_M

Avoid Q3_K_M unless you're VRAM-constrained - the quality degradation becomes noticeable for complex coding tasks and visual understanding.

Usage Examples

With llama.cpp

# Download the model
huggingface-cli download cognitivecomputations/Devstral-Vision-Small-2507-GGUF \
  Devstral-Small-Vision-2507-Q4_K_M.gguf \
  --local-dir .
huggingface-cli download cognitivecomputations/Devstral-Vision-Small-2507-GGUF \
  mmproj-BF16.gguf \
  --local-dir .

# Run with llama.cpp
./llama-cli -m Devstral-Small-Vision-2507-Q4_K_M.gguf \
  -p "Analyze this UI and generate React code" \
  --image screenshot.png \
  -c 8192

With LM Studio

Download your chosen quantization
Load in LM Studio
Enable multimodal/vision mode in settings
Drag and drop images into the chat

With ollama

# Create Modelfile
cat > Modelfile << EOF
FROM ./Devstral-Small-Vision-2507-Q4_K_M.gguf
PARAMETER temperature 0.7
PARAMETER num_ctx 8192
EOF

# Create and run
ollama create devstral-vision -f Modelfile
ollama run devstral-vision

With koboldcpp

python koboldcpp.py --model Devstral-Small-Vision-2507-Q4_K_M.gguf \
  --contextsize 8192 \
  --gpulayers 999 \
  --multimodal

Performance Tips

Context Size: This model supports up to 128k context, but start with 8k-16k for better performance
GPU Layers: Offload all layers to GPU if possible (--gpulayers 999 or -ngl 999)
Batch Size: Increase batch size for better throughput if you have VRAM headroom
Temperature: Use lower temperatures (0.1-0.3) for code generation, higher (0.7-0.9) for creative tasks

Hardware Requirements

Quantization	Single GPU	Partial Offload	CPU Only
Q8_0	24GB VRAM	16GB VRAM + 16GB RAM	32GB RAM
Q6_K	20GB VRAM	12GB VRAM + 16GB RAM	24GB RAM
Q5_K_M	18GB VRAM	12GB VRAM + 12GB RAM	24GB RAM
Q4_K_M	16GB VRAM	8GB VRAM + 12GB RAM	20GB RAM
IQ4_XS	14GB VRAM	8GB VRAM + 12GB RAM	20GB RAM
Q3_K_M	12GB VRAM	6GB VRAM + 12GB RAM	16GB RAM

Model Capabilities

✅ Strengths:

Exceptional at converting visual designs to code
Strong debugging abilities with visual context
Maintains Devstral's 53.6% SWE-Bench performance
Handles multiple programming languages
128k token context window

⚠️ Limitations:

Not specifically fine-tuned for vision-to-code tasks
Vision performance bounded by Mistral-Small-3.2's capabilities
Requires decent hardware for optimal performance
Quantization impacts both vision and coding quality

License

Apache 2.0 (inherited from base models)

Acknowledgments

Original model by Eric Hartford at Cognitive Computations
Built on Mistral AI's Devstral and Mistral-Small models
Quantized using llama.cpp

Links

For issues or questions about these quantizations, please open an issue in the repository.