NEW - I exported and added mmproj-BF16.gguf to properly support llama.cpp, ollama, and LM Studio.
Devstral-Vision-Small-2507 GGUF
Quantized GGUF versions of cognitivecomputations/Devstral-Vision-Small-2507 - the multimodal coding specialist that combines Devstral's exceptional coding abilities with vision understanding.
Model Description
This is the first vision-enabled version of Devstral, created by transplanting Devstral's language model weights into Mistral-Small-3.2's multimodal architecture. It enables:
- Converting UI screenshots to code
- Debugging visual rendering issues
- Implementing designs from mockups
- Understanding codebases with visual context
Quantization Selection Guide
Quantization | Size | Min RAM | Recommended For | Quality | Notes |
---|---|---|---|---|---|
Q8_0 | 23GB | 24GB | RTX 3090/4090/A6000 users wanting maximum quality | β β β β β | Near-lossless, best for production use |
Q6_K | 18GB | 20GB | High-end GPUs with focus on quality | β β β β β | Excellent quality/size balance |
Q5_K_M | 16GB | 18GB | RTX 3080 Ti/4070 Ti users | β β β β β | Great balance of quality and performance |
Q4_K_M | 13GB | 16GB | Most users - RTX 3060 12GB/3070/4060 | β β β ββ | The sweet spot, minimal quality loss |
IQ4_XS | 12GB | 14GB | Experimental - newer compression method | β β β ββ | Good alternative to Q4_K_M |
Q3_K_M | 11GB | 12GB | 8-12GB GPUs, quality-conscious users | β β βββ | Noticeable quality drop for complex code |
Choosing the Right Quantization
For coding with vision tasks, I recommend:
- Production/Professional use: Q8_0 or Q6_K
- General development: Q4_K_M (best balance)
- Limited VRAM: Q5_K_M if you can fit it, otherwise Q4_K_M
- Experimental: Try IQ4_XS for potentially better quality at similar size to Q4_K_M
Avoid Q3_K_M unless you're VRAM-constrained - the quality degradation becomes noticeable for complex coding tasks and visual understanding.
Usage Examples
With llama.cpp
# Download the model
huggingface-cli download cognitivecomputations/Devstral-Vision-Small-2507-GGUF \
Devstral-Small-Vision-2507-Q4_K_M.gguf \
--local-dir .
huggingface-cli download cognitivecomputations/Devstral-Vision-Small-2507-GGUF \
mmproj-BF16.gguf \
--local-dir .
# Run with llama.cpp
./llama-cli -m Devstral-Small-Vision-2507-Q4_K_M.gguf \
-p "Analyze this UI and generate React code" \
--image screenshot.png \
-c 8192
With LM Studio
- Download your chosen quantization
- Load in LM Studio
- Enable multimodal/vision mode in settings
- Drag and drop images into the chat
With ollama
# Create Modelfile
cat > Modelfile << EOF
FROM ./Devstral-Small-Vision-2507-Q4_K_M.gguf
PARAMETER temperature 0.7
PARAMETER num_ctx 8192
EOF
# Create and run
ollama create devstral-vision -f Modelfile
ollama run devstral-vision
With koboldcpp
python koboldcpp.py --model Devstral-Small-Vision-2507-Q4_K_M.gguf \
--contextsize 8192 \
--gpulayers 999 \
--multimodal
Performance Tips
- Context Size: This model supports up to 128k context, but start with 8k-16k for better performance
- GPU Layers: Offload all layers to GPU if possible (
--gpulayers 999
or-ngl 999
) - Batch Size: Increase batch size for better throughput if you have VRAM headroom
- Temperature: Use lower temperatures (0.1-0.3) for code generation, higher (0.7-0.9) for creative tasks
Hardware Requirements
Quantization | Single GPU | Partial Offload | CPU Only |
---|---|---|---|
Q8_0 | 24GB VRAM | 16GB VRAM + 16GB RAM | 32GB RAM |
Q6_K | 20GB VRAM | 12GB VRAM + 16GB RAM | 24GB RAM |
Q5_K_M | 18GB VRAM | 12GB VRAM + 12GB RAM | 24GB RAM |
Q4_K_M | 16GB VRAM | 8GB VRAM + 12GB RAM | 20GB RAM |
IQ4_XS | 14GB VRAM | 8GB VRAM + 12GB RAM | 20GB RAM |
Q3_K_M | 12GB VRAM | 6GB VRAM + 12GB RAM | 16GB RAM |
Model Capabilities
β Strengths:
- Exceptional at converting visual designs to code
- Strong debugging abilities with visual context
- Maintains Devstral's 53.6% SWE-Bench performance
- Handles multiple programming languages
- 128k token context window
β οΈ Limitations:
- Not specifically fine-tuned for vision-to-code tasks
- Vision performance bounded by Mistral-Small-3.2's capabilities
- Requires decent hardware for optimal performance
- Quantization impacts both vision and coding quality
License
Apache 2.0 (inherited from base models)
Acknowledgments
- Original model by Eric Hartford at Cognitive Computations
- Built on Mistral AI's Devstral and Mistral-Small models
- Quantized using llama.cpp
Links
For issues or questions about these quantizations, please open an issue in the repository.
- Downloads last month
- 848
3-bit
4-bit
5-bit
6-bit
8-bit