GGUF

NEW - I exported and added mmproj-BF16.gguf to properly support llama.cpp, ollama, and LM Studio.

Devstral-Vision-Small-2507 GGUF

Quantized GGUF versions of cognitivecomputations/Devstral-Vision-Small-2507 - the multimodal coding specialist that combines Devstral's exceptional coding abilities with vision understanding.

Model Description

This is the first vision-enabled version of Devstral, created by transplanting Devstral's language model weights into Mistral-Small-3.2's multimodal architecture. It enables:

  • Converting UI screenshots to code
  • Debugging visual rendering issues
  • Implementing designs from mockups
  • Understanding codebases with visual context

Quantization Selection Guide

Quantization Size Min RAM Recommended For Quality Notes
Q8_0 23GB 24GB RTX 3090/4090/A6000 users wanting maximum quality β˜…β˜…β˜…β˜…β˜… Near-lossless, best for production use
Q6_K 18GB 20GB High-end GPUs with focus on quality β˜…β˜…β˜…β˜…β˜† Excellent quality/size balance
Q5_K_M 16GB 18GB RTX 3080 Ti/4070 Ti users β˜…β˜…β˜…β˜…β˜† Great balance of quality and performance
Q4_K_M 13GB 16GB Most users - RTX 3060 12GB/3070/4060 β˜…β˜…β˜…β˜†β˜† The sweet spot, minimal quality loss
IQ4_XS 12GB 14GB Experimental - newer compression method β˜…β˜…β˜…β˜†β˜† Good alternative to Q4_K_M
Q3_K_M 11GB 12GB 8-12GB GPUs, quality-conscious users β˜…β˜…β˜†β˜†β˜† Noticeable quality drop for complex code

Choosing the Right Quantization

For coding with vision tasks, I recommend:

  • Production/Professional use: Q8_0 or Q6_K
  • General development: Q4_K_M (best balance)
  • Limited VRAM: Q5_K_M if you can fit it, otherwise Q4_K_M
  • Experimental: Try IQ4_XS for potentially better quality at similar size to Q4_K_M

Avoid Q3_K_M unless you're VRAM-constrained - the quality degradation becomes noticeable for complex coding tasks and visual understanding.

Usage Examples

With llama.cpp

# Download the model
huggingface-cli download cognitivecomputations/Devstral-Vision-Small-2507-GGUF \
  Devstral-Small-Vision-2507-Q4_K_M.gguf \
  --local-dir .
huggingface-cli download cognitivecomputations/Devstral-Vision-Small-2507-GGUF \
  mmproj-BF16.gguf \
  --local-dir .

# Run with llama.cpp
./llama-cli -m Devstral-Small-Vision-2507-Q4_K_M.gguf \
  -p "Analyze this UI and generate React code" \
  --image screenshot.png \
  -c 8192

With LM Studio

  1. Download your chosen quantization
  2. Load in LM Studio
  3. Enable multimodal/vision mode in settings
  4. Drag and drop images into the chat

With ollama

# Create Modelfile
cat > Modelfile << EOF
FROM ./Devstral-Small-Vision-2507-Q4_K_M.gguf
PARAMETER temperature 0.7
PARAMETER num_ctx 8192
EOF

# Create and run
ollama create devstral-vision -f Modelfile
ollama run devstral-vision

With koboldcpp

python koboldcpp.py --model Devstral-Small-Vision-2507-Q4_K_M.gguf \
  --contextsize 8192 \
  --gpulayers 999 \
  --multimodal

Performance Tips

  1. Context Size: This model supports up to 128k context, but start with 8k-16k for better performance
  2. GPU Layers: Offload all layers to GPU if possible (--gpulayers 999 or -ngl 999)
  3. Batch Size: Increase batch size for better throughput if you have VRAM headroom
  4. Temperature: Use lower temperatures (0.1-0.3) for code generation, higher (0.7-0.9) for creative tasks

Hardware Requirements

Quantization Single GPU Partial Offload CPU Only
Q8_0 24GB VRAM 16GB VRAM + 16GB RAM 32GB RAM
Q6_K 20GB VRAM 12GB VRAM + 16GB RAM 24GB RAM
Q5_K_M 18GB VRAM 12GB VRAM + 12GB RAM 24GB RAM
Q4_K_M 16GB VRAM 8GB VRAM + 12GB RAM 20GB RAM
IQ4_XS 14GB VRAM 8GB VRAM + 12GB RAM 20GB RAM
Q3_K_M 12GB VRAM 6GB VRAM + 12GB RAM 16GB RAM

Model Capabilities

βœ… Strengths:

  • Exceptional at converting visual designs to code
  • Strong debugging abilities with visual context
  • Maintains Devstral's 53.6% SWE-Bench performance
  • Handles multiple programming languages
  • 128k token context window

⚠️ Limitations:

  • Not specifically fine-tuned for vision-to-code tasks
  • Vision performance bounded by Mistral-Small-3.2's capabilities
  • Requires decent hardware for optimal performance
  • Quantization impacts both vision and coding quality

License

Apache 2.0 (inherited from base models)

image/png

Acknowledgments

Links


For issues or questions about these quantizations, please open an issue in the repository.

Downloads last month
848
GGUF
Model size
23.6B params
Architecture
llama
Hardware compatibility
Log In to view the estimation

3-bit

4-bit

5-bit

6-bit

8-bit

Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support