Spaces:
				
			
			
	
			
			
					
		Running
		
	
	
	
			
			
	
	
	
	
		
		
					
		Running
		
	metadata
			title: Multi-GGUF LLM Inference
emoji: π§ 
colorFrom: pink
colorTo: purple
sdk: streamlit
sdk_version: 1.44.1
app_file: app.py
pinned: false
license: apache-2.0
short_description: Run GGUF models with llama.cpp
This Streamlit app enables chat-based inference on various GGUF models using llama.cpp and llama-cpp-python.
π Supported Models:
- Qwen/Qwen2.5-7B-Instruct-GGUFβ- qwen2.5-7b-instruct-q2_k.gguf
- unsloth/gemma-3-4b-it-GGUFβ- gemma-3-4b-it-Q4_K_M.gguf
- unsloth/Phi-4-mini-instruct-GGUFβ- Phi-4-mini-instruct-Q4_K_M.gguf
- MaziyarPanahi/Meta-Llama-3.1-8B-Instruct-GGUFβ- Meta-Llama-3.1-8B-Instruct.Q2_K.gguf
- unsloth/DeepSeek-R1-Distill-Llama-8B-GGUFβ- DeepSeek-R1-Distill-Llama-8B-Q2_K.gguf
- MaziyarPanahi/Mistral-7B-Instruct-v0.3-GGUFβ- Mistral-7B-Instruct-v0.3.IQ3_XS.gguf
- Qwen/Qwen2.5-Coder-7B-Instruct-GGUFβ- qwen2.5-coder-7b-instruct-q2_k.gguf
βοΈ Features:
- Model selection in the sidebar
- Customizable system prompt and generation parameters
- Chat-style UI with streaming responses
- Markdown output rendering for readable, styled output
- DeepSeek-compatible <think>tag handling β shows model reasoning in a collapsible expander
π§ Memory-Safe Design (for HuggingFace Spaces):
- Loads only one model at a time to prevent memory bloat
- Utilizes manual unloading and gc.collect()to free memory when switching models
- Adjusts n_ctxcontext length to operate within a 16 GB RAM limit
- Automatically downloads models as needed
- Limits history to the last 8 user-assistant turns to prevent context overflow
Ideal for deploying multiple GGUF chat models on free-tier HuggingFace Spaces!
Refer to the configuration guide at https://huggingface.co/docs/hub/spaces-config-reference
 
			
