Spaces:

Luigi
/

ZeroGPU-LLM-Inference

Running

Luigi commited on 15 days ago

Commit

c1bc514

1 Parent(s): df40b1d

Add comprehensive documentation and user guide

📚 Documentation Enhancements:
- Complete rewrite of README with modern formatting
- Organized sections for features, models, and technical details
- Added comprehensive USER_GUIDE.md with tutorials
- Quick start guide for beginners
- Advanced configuration guide for power users

📖 README Updates:
- Modern layout with clear sections
- Feature highlights with emojis for easy scanning
- Model categorization by size and purpose
- Technical flow explanation
- Performance and customization info
- Contributing guidelines

📝 User Guide Includes:
- 5-minute quick start tutorial
- Detailed feature explanations
- Advanced parameter guide with use cases
- Preset configurations for common tasks
- Tips & tricks for better results
- Troubleshooting section
- Best practices for different user levels
- Keyboard shortcuts reference

🎯 Content Organization:
- Beginner-friendly introduction
- Progressive complexity
- Practical examples throughout
- Visual tables for quick reference
- Clear explanations of technical concepts

Files changed (3) hide show

README.md +177 -62
README_OLD.md +80 -0
USER_GUIDE.md +300 -0

README.md CHANGED Viewed

@@ -1,80 +1,195 @@
 ---
 title: ZeroGPU-LLM-Inference
 emoji: 🧠
-colorFrom: pink
 colorTo: purple
 sdk: gradio
 sdk_version: 5.49.1
 app_file: app.py
 pinned: false
 license: apache-2.0
-short_description: Streaming LLM chat with web search and debug
 ---
-This Gradio app provides **token-streaming, chat-style inference** on a wide variety of Transformer models—leveraging ZeroGPU for free GPU acceleration on HF Spaces.
-Key features:
-- **Real-time DuckDuckGo web search** (background thread, configurable timeout) with results injected into the system prompt.
-- **Prompt preview panel** for debugging and prompt-engineering insights—see exactly what’s sent to the model.
-- **Thought vs. Answer streaming**: any `<think>…</think>` blocks emitted by the model are shown as separate “💭 Thought.”
-- **Cancel button** to immediately stop generation.
-- **Dynamic system prompt**: automatically inserts today’s date when you toggle web search.
-- **Extensive model selection**: over 30 LLMs (from Phi-4 mini to Qwen3-14B, SmolLM2, Taiwan-ELM, Mistral, Meta-Llama, MiMo, Gemma, DeepSeek-R1, etc.).
-- **Memory-safe design**: loads one model at a time, clears cache after each generation.
-- **Customizable generation parameters**: max tokens, temperature, top-k, top-p, repetition penalty.
-- **Web-search settings**: max results, max chars per result, search timeout.
-- **Requirements pinned** to ensure reproducible deployment.
 ## 🔄 Supported Models
-Use the dropdown to select any of these:
-| Name                                  | Repo ID                                            |
-| ------------------------------------- | -------------------------------------------------- |
-| Taiwan-ELM-1_1B-Instruct              | liswei/Taiwan-ELM-1_1B-Instruct                    |
-| Taiwan-ELM-270M-Instruct              | liswei/Taiwan-ELM-270M-Instruct                    |
-| Qwen3-0.6B                            | Qwen/Qwen3-0.6B                                    |
-| Qwen3-1.7B                            | Qwen/Qwen3-1.7B                                    |
-| Qwen3-4B                              | Qwen/Qwen3-4B                                      |
-| Qwen3-8B                              | Qwen/Qwen3-8B                                      |
-| Qwen3-14B                             | Qwen/Qwen3-14B                                     |
-| Gemma-3-4B-IT                         | unsloth/gemma-3-4b-it                              |
-| SmolLM2-135M-Instruct-TaiwanChat      | Luigi/SmolLM2-135M-Instruct-TaiwanChat             |
-| SmolLM2-135M-Instruct                 | HuggingFaceTB/SmolLM2-135M-Instruct                |
-| SmolLM2-360M-Instruct-TaiwanChat      | Luigi/SmolLM2-360M-Instruct-TaiwanChat             |
-| Llama-3.2-Taiwan-3B-Instruct          | lianghsun/Llama-3.2-Taiwan-3B-Instruct             |
-| MiniCPM3-4B                           | openbmb/MiniCPM3-4B                                |
-| Qwen2.5-3B-Instruct                   | Qwen/Qwen2.5-3B-Instruct                           |
-| Qwen2.5-7B-Instruct                   | Qwen/Qwen2.5-7B-Instruct                           |
-| Phi-4-mini-Reasoning                  | microsoft/Phi-4-mini-reasoning                     |
-| Phi-4-mini-Instruct                   | microsoft/Phi-4-mini-instruct                      |
-| Meta-Llama-3.1-8B-Instruct            | MaziyarPanahi/Meta-Llama-3.1-8B-Instruct            |
-| DeepSeek-R1-Distill-Llama-8B          | unsloth/DeepSeek-R1-Distill-Llama-8B               |
-| Mistral-7B-Instruct-v0.3              | MaziyarPanahi/Mistral-7B-Instruct-v0.3              |
-| Qwen2.5-Coder-7B-Instruct             | Qwen/Qwen2.5-Coder-7B-Instruct                     |
-| Qwen2.5-Omni-3B                       | Qwen/Qwen2.5-Omni-3B                               |
-| MiMo-7B-RL                            | XiaomiMiMo/MiMo-7B-RL                              |
-*(…and more can easily be added in `MODELS` in `app.py`.)*
-## ⚙️ Generation & Search Parameters
-- **Max Tokens**: 64–16384
-- **Temperature**: 0.1–2.0
-- **Top-K**: 1–100
-- **Top-P**: 0.1–1.0
-- **Repetition Penalty**: 1.0–2.0
-- **Enable Web Search**: on/off
-- **Max Results**: integer
-- **Max Chars/Result**: integer
-- **Search Timeout (s)**: 0.0–30.0
 ## 🚀 How It Works
-1. **User message** enters chat history.
-2. If search is enabled, a background DuckDuckGo thread fetches snippets.
-3. After up to *Search Timeout* seconds, snippets merge into the system prompt.
-4. The selected model pipeline is loaded (bf16→f16→f32 fallback) on ZeroGPU.
-5. Prompt is formatted—any `<think>…</think>` blocks will be streamed as separate “💭 Thought.”
-6. Tokens stream to the Chatbot UI. Press **Cancel** to stop mid-generation.

 ---
 title: ZeroGPU-LLM-Inference
 emoji: 🧠
+colorFrom: indigo
 colorTo: purple
 sdk: gradio
 sdk_version: 5.49.1
 app_file: app.py
 pinned: false
 license: apache-2.0
+short_description: Streaming LLM chat with web search and controls
 ---
+# 🧠 ZeroGPU LLM Inference
+A modern, user-friendly Gradio interface for **token-streaming, chat-style inference** across a wide variety of Transformer models—powered by ZeroGPU for free GPU acceleration on Hugging Face Spaces.
+## ✨ Key Features
+### 🎨 Modern UI/UX
+- **Clean, intuitive interface** with organized layout and visual hierarchy
+- **Collapsible advanced settings** for both simple and power users
+- **Smooth animations and transitions** for better user experience
+- **Responsive design** that works on all screen sizes
+- **Copy-to-clipboard** functionality for easy sharing of responses
+### 🔍 Web Search Integration
+- **Real-time DuckDuckGo search** with background threading
+- **Configurable timeout** and result limits
+- **Automatic context injection** into system prompts
+- **Smart toggle** - search settings auto-hide when disabled
+### 💡 Smart Features
+- **Thought vs. Answer streaming**: `<think>…</think>` blocks shown separately as "💭 Thought"
+- **Working cancel button** - immediately stops generation without errors
+- **Debug panel** for prompt engineering insights
+- **Duration estimates** based on model size and settings
+- **Example prompts** to help users get started
+- **Dynamic system prompts** with automatic date insertion
+### 🎯 Model Variety
+- **30+ LLM options** from leading providers (Qwen, Microsoft, Meta, Mistral, etc.)
+- Models ranging from **135M to 32B+** parameters
+- Specialized models for **reasoning, coding, and general chat**
+- **Efficient model loading** - one at a time with automatic cache clearing
+### ⚙️ Advanced Controls
+- **Generation parameters**: max tokens, temperature, top-k, top-p, repetition penalty
+- **Web search settings**: max results, chars per result, timeout
+- **Custom system prompts** with dynamic date insertion
+- **Organized in collapsible sections** to keep interface clean
 ## 🔄 Supported Models
+### Compact Models (< 2B)
+- **SmolLM2-135M-Instruct** - Tiny but capable
+- **SmolLM2-360M-Instruct** - Lightweight conversation
+- **Taiwan-ELM-270M/1.1B** - Multilingual support
+- **Qwen3-0.6B/1.7B** - Fast inference
+### Mid-Size Models (2B-8B)
+- **Qwen3-4B/8B** - Balanced performance
+- **Phi-4-mini** (4.3B) - Reasoning & Instruct variants
+- **MiniCPM3-4B** - Efficient mid-size
+- **Gemma-3-4B-IT** - Instruction-tuned
+- **Llama-3.2-Taiwan-3B** - Regional optimization
+- **Mistral-7B-Instruct** - Classic performer
+- **DeepSeek-R1-Distill-Llama-8B** - Reasoning specialist
+### Large Models (14B+)
+- **Qwen3-14B** - Strong general purpose
+- **Apriel-1.5-15b-Thinker** - Multimodal reasoning
+- **gpt-oss-20b** - Open GPT-style
+- **Qwen3-32B** - Top-tier performance
 ## 🚀 How It Works
+1. **Select Model** - Choose from 30+ pre-configured models
+2. **Configure Settings** - Adjust generation parameters or use defaults
+3. **Enable Web Search** (optional) - Get real-time information
+4. **Start Chatting** - Type your message or use example prompts
+5. **Stream Response** - Watch as tokens are generated in real-time
+6. **Cancel Anytime** - Stop generation mid-stream if needed
+### Technical Flow
+1. User message enters chat history
+2. If search enabled, background thread fetches DuckDuckGo results
+3. Search snippets merge into system prompt (within timeout limit)
+4. Selected model pipeline loads on ZeroGPU (bf16→f16→f32 fallback)
+5. Prompt formatted with thinking mode detection
+6. Tokens stream to UI with thought/answer separation
+7. Cancel button available for immediate interruption
+8. Memory cleared after generation for next request
+## ⚙️ Generation Parameters
+| Parameter | Range | Default | Description |
+|-----------|-------|---------|-------------|
+| Max Tokens | 64-16384 | 1024 | Maximum response length |
+| Temperature | 0.1-2.0 | 0.7 | Creativity vs focus |
+| Top-K | 1-100 | 40 | Token sampling pool size |
+| Top-P | 0.1-1.0 | 0.9 | Nucleus sampling threshold |
+| Repetition Penalty | 1.0-2.0 | 1.2 | Reduce repetition |
+## 🌐 Web Search Settings
+| Setting | Range | Default | Description |
+|---------|-------|---------|-------------|
+| Max Results | Integer | 4 | Number of search results |
+| Max Chars/Result | Integer | 50 | Character limit per result |
+| Search Timeout | 0-30s | 5s | Maximum wait time |
+## 💻 Local Development
+```bash
+# Clone the repository
+git clone https://huggingface.co/spaces/Luigi/ZeroGPU-LLM-Inference
+cd ZeroGPU-LLM-Inference
+# Install dependencies
+pip install -r requirements.txt
+# Run the app
+python app.py
+```
+## 🎨 UI Design Philosophy
+The interface follows these principles:
+1. **Simplicity First** - Core features immediately visible
+2. **Progressive Disclosure** - Advanced options hidden but accessible
+3. **Visual Hierarchy** - Clear organization with groups and sections
+4. **Feedback** - Status indicators and helpful messages
+5. **Accessibility** - Responsive, keyboard-friendly, with tooltips
+## 🔧 Customization
+### Adding New Models
+Edit `MODELS` dictionary in `app.py`:
+```python
+"Your-Model-Name": {
+    "repo_id": "org/model-name",
+    "description": "Model description",
+    "params_b": 7.0  # Size in billions
+}
+```
+### Modifying UI Theme
+Adjust theme parameters in `gr.Blocks()`:
+```python
+theme=gr.themes.Soft(
+    primary_hue="indigo",
+    secondary_hue="purple",
+    # ... more options
+)
+```
+## 📊 Performance
+- **Token streaming** for responsive feel
+- **Background search** doesn't block UI
+- **Efficient memory** management with cache clearing
+- **ZeroGPU acceleration** for fast inference
+- **Optimized loading** with dtype fallbacks
+## 🤝 Contributing
+Contributions welcome! Areas for improvement:
+- Additional model integrations
+- UI/UX enhancements
+- Performance optimizations
+- Bug fixes and testing
+- Documentation improvements
+## 📝 License
+Apache 2.0 - See LICENSE file for details
+## 🙏 Acknowledgments
+- Built with [Gradio](https://gradio.app)
+- Powered by [Hugging Face Transformers](https://huggingface.co/transformers)
+- Uses [ZeroGPU](https://huggingface.co/zero-gpu-explorers) for acceleration
+- Search via [DuckDuckGo](https://duckduckgo.com)
+---
+**Made with ❤️ for the open source community**

README_OLD.md ADDED Viewed

	@@ -0,0 +1,80 @@

+---
+title: ZeroGPU-LLM-Inference
+emoji: 🧠
+colorFrom: pink
+colorTo: purple
+sdk: gradio
+sdk_version: 5.49.1
+app_file: app.py
+pinned: false
+license: apache-2.0
+short_description: Streaming LLM chat with web search and debug
+---
+This Gradio app provides **token-streaming, chat-style inference** on a wide variety of Transformer models—leveraging ZeroGPU for free GPU acceleration on HF Spaces.
+Key features:
+- **Real-time DuckDuckGo web search** (background thread, configurable timeout) with results injected into the system prompt.
+- **Prompt preview panel** for debugging and prompt-engineering insights—see exactly what’s sent to the model.
+- **Thought vs. Answer streaming**: any `<think>…</think>` blocks emitted by the model are shown as separate “💭 Thought.”
+- **Cancel button** to immediately stop generation.
+- **Dynamic system prompt**: automatically inserts today’s date when you toggle web search.
+- **Extensive model selection**: over 30 LLMs (from Phi-4 mini to Qwen3-14B, SmolLM2, Taiwan-ELM, Mistral, Meta-Llama, MiMo, Gemma, DeepSeek-R1, etc.).
+- **Memory-safe design**: loads one model at a time, clears cache after each generation.
+- **Customizable generation parameters**: max tokens, temperature, top-k, top-p, repetition penalty.
+- **Web-search settings**: max results, max chars per result, search timeout.
+- **Requirements pinned** to ensure reproducible deployment.
+## 🔄 Supported Models
+Use the dropdown to select any of these:
+| Name                                  | Repo ID                                            |
+| ------------------------------------- | -------------------------------------------------- |
+| Taiwan-ELM-1_1B-Instruct              | liswei/Taiwan-ELM-1_1B-Instruct                    |
+| Taiwan-ELM-270M-Instruct              | liswei/Taiwan-ELM-270M-Instruct                    |
+| Qwen3-0.6B                            | Qwen/Qwen3-0.6B                                    |
+| Qwen3-1.7B                            | Qwen/Qwen3-1.7B                                    |
+| Qwen3-4B                              | Qwen/Qwen3-4B                                      |
+| Qwen3-8B                              | Qwen/Qwen3-8B                                      |
+| Qwen3-14B                             | Qwen/Qwen3-14B                                     |
+| Gemma-3-4B-IT                         | unsloth/gemma-3-4b-it                              |
+| SmolLM2-135M-Instruct-TaiwanChat      | Luigi/SmolLM2-135M-Instruct-TaiwanChat             |
+| SmolLM2-135M-Instruct                 | HuggingFaceTB/SmolLM2-135M-Instruct                |
+| SmolLM2-360M-Instruct-TaiwanChat      | Luigi/SmolLM2-360M-Instruct-TaiwanChat             |
+| Llama-3.2-Taiwan-3B-Instruct          | lianghsun/Llama-3.2-Taiwan-3B-Instruct             |
+| MiniCPM3-4B                           | openbmb/MiniCPM3-4B                                |
+| Qwen2.5-3B-Instruct                   | Qwen/Qwen2.5-3B-Instruct                           |
+| Qwen2.5-7B-Instruct                   | Qwen/Qwen2.5-7B-Instruct                           |
+| Phi-4-mini-Reasoning                  | microsoft/Phi-4-mini-reasoning                     |
+| Phi-4-mini-Instruct                   | microsoft/Phi-4-mini-instruct                      |
+| Meta-Llama-3.1-8B-Instruct            | MaziyarPanahi/Meta-Llama-3.1-8B-Instruct            |
+| DeepSeek-R1-Distill-Llama-8B          | unsloth/DeepSeek-R1-Distill-Llama-8B               |
+| Mistral-7B-Instruct-v0.3              | MaziyarPanahi/Mistral-7B-Instruct-v0.3              |
+| Qwen2.5-Coder-7B-Instruct             | Qwen/Qwen2.5-Coder-7B-Instruct                     |
+| Qwen2.5-Omni-3B                       | Qwen/Qwen2.5-Omni-3B                               |
+| MiMo-7B-RL                            | XiaomiMiMo/MiMo-7B-RL                              |
+*(…and more can easily be added in `MODELS` in `app.py`.)*
+## ⚙️ Generation & Search Parameters
+- **Max Tokens**: 64–16384
+- **Temperature**: 0.1–2.0
+- **Top-K**: 1–100
+- **Top-P**: 0.1–1.0
+- **Repetition Penalty**: 1.0–2.0
+- **Enable Web Search**: on/off
+- **Max Results**: integer
+- **Max Chars/Result**: integer
+- **Search Timeout (s)**: 0.0–30.0
+## 🚀 How It Works
+1. **User message** enters chat history.
+2. If search is enabled, a background DuckDuckGo thread fetches snippets.
+3. After up to *Search Timeout* seconds, snippets merge into the system prompt.
+4. The selected model pipeline is loaded (bf16→f16→f32 fallback) on ZeroGPU.
+5. Prompt is formatted—any `<think>…</think>` blocks will be streamed as separate “💭 Thought.”
+6. Tokens stream to the Chatbot UI. Press **Cancel** to stop mid-generation.

USER_GUIDE.md ADDED Viewed

	@@ -0,0 +1,300 @@

+# 📖 User Guide - ZeroGPU LLM Inference
+## Quick Start (5 Minutes)
+### 1. Choose Your Model
+The model dropdown shows 30+ options organized by size:
+- **Compact (<2B)**: Fast, lightweight - great for quick responses
+- **Mid-size (2-8B)**: Best balance of speed and quality
+- **Large (14B+)**: Highest quality, slower but more capable
+**Recommendation for beginners**: Start with `Qwen3-4B-Instruct-2507`
+### 2. Try an Example Prompt
+Click on any example below the chat box to get started:
+- "Explain quantum computing in simple terms"
+- "Write a Python function..."
+- "What are the latest developments..." (requires web search)
+### 3. Start Chatting!
+Type your message and press Enter or click "📤 Send"
+## Core Features
+### 💬 Chat Interface
+The main chat area shows:
+- Your messages on one side
+- AI responses with a 🤖 avatar
+- Copy button on each message
+- Smooth streaming as tokens generate
+**Tips:**
+- Press Enter to send (Shift+Enter for new line)
+- Click Copy button to save responses
+- Scroll up to review history
+- Use Clear Chat to start fresh
+### 🤖 Model Selection
+**When to use each size:**
+| Model Size | Best For | Speed | Quality |
+|------------|----------|-------|---------|
+| <2B | Quick questions, testing | ⚡⚡⚡ | ⭐⭐ |
+| 2-8B | General chat, coding help | ⚡⚡ | ⭐⭐⭐ |
+| 14B+ | Complex reasoning, long-form | ⚡ | ⭐⭐⭐⭐ |
+**Specialized Models:**
+- **Phi-4-mini-Reasoning**: Math, logic problems
+- **Qwen2.5-Coder**: Programming tasks
+- **DeepSeek-R1-Distill**: Step-by-step reasoning
+- **Apriel-1.5-15b-Thinker**: Multimodal understanding
+### 🔍 Web Search
+Enable this when you need:
+- Current events and news
+- Recent information (after model training cutoff)
+- Facts that change frequently
+- Real-time data
+**How it works:**
+1. Toggle "🔍 Enable Web Search"
+2. Web search settings accordion appears
+3. System prompt updates automatically
+4. Search runs in background (won't block chat)
+5. Results injected into context
+**Settings explained:**
+- **Max Results**: How many search results to fetch (4 is good default)
+- **Max Chars/Result**: Limit length per result (50 prevents overwhelming context)
+- **Search Timeout**: Maximum wait time (5s recommended)
+### 📝 System Prompt
+This defines the AI's personality and behavior.
+**Default prompts:**
+- Without search: Helpful, creative assistant
+- With search: Includes search results and current date
+**Customization ideas:**
+```
+You are a professional code reviewer...
+You are a creative writing coach...
+You are a patient tutor explaining concepts simply...
+You are a technical documentation writer...
+```
+## Advanced Features
+### 🎛️ Advanced Generation Parameters
+Click the accordion to reveal these controls:
+#### Max Tokens (64-16384)
+- **What it does**: Sets maximum response length
+- **Lower (256-512)**: Quick, concise answers
+- **Medium (1024)**: Balanced (default)
+- **Higher (2048+)**: Long-form content, detailed explanations
+#### Temperature (0.1-2.0)
+- **What it does**: Controls randomness/creativity
+- **Low (0.1-0.3)**: Focused, deterministic (good for facts, code)
+- **Medium (0.7)**: Balanced creativity (default)
+- **High (1.2-2.0)**: Very creative, unpredictable (stories, brainstorming)
+#### Top-K (1-100)
+- **What it does**: Limits token choices to top K most likely
+- **Lower (10-20)**: More focused
+- **Medium (40)**: Balanced (default)
+- **Higher (80-100)**: More varied vocabulary
+#### Top-P (0.1-1.0)
+- **What it does**: Nucleus sampling threshold
+- **Lower (0.5-0.7)**: Conservative choices
+- **Medium (0.9)**: Balanced (default)
+- **Higher (0.95-1.0)**: Full vocabulary range
+#### Repetition Penalty (1.0-2.0)
+- **What it does**: Reduces repeated words/phrases
+- **Low (1.0-1.1)**: Allows some repetition
+- **Medium (1.2)**: Balanced (default)
+- **High (1.5+)**: Strongly avoids repetition (may hurt coherence)
+### Preset Configurations
+**For Creative Writing:**
+```
+Temperature: 1.2
+Top-P: 0.95
+Top-K: 80
+Max Tokens: 2048
+```
+**For Code Generation:**
+```
+Temperature: 0.3
+Top-P: 0.9
+Top-K: 40
+Max Tokens: 1024
+Repetition Penalty: 1.1
+```
+**For Factual Q&A:**
+```
+Temperature: 0.5
+Top-P: 0.85
+Top-K: 30
+Max Tokens: 512
+Enable Web Search: Yes
+```
+**For Reasoning Tasks:**
+```
+Model: Phi-4-mini-Reasoning or DeepSeek-R1
+Temperature: 0.7
+Max Tokens: 2048
+```
+## Tips & Tricks
+### 🎯 Getting Better Results
+1. **Be Specific**: "Write a Python function to sort a list" → "Write a Python function that sorts a list of dictionaries by a specific key"
+2. **Provide Context**: "Explain recursion" → "Explain recursion to someone learning programming for the first time, with a simple example"
+3. **Use System Prompts**: Define role/expertise in system prompt instead of every message
+4. **Iterate**: Use follow-up questions to refine responses
+5. **Experiment with Models**: Try different models for the same task
+### ⚡ Performance Tips
+1. **Start Small**: Test with smaller models first
+2. **Adjust Max Tokens**: Don't request more than you need
+3. **Use Cancel**: Stop bad generations early
+4. **Clear Cache**: Clear chat if experiencing slowdowns
+5. **One Task at a Time**: Don't send multiple requests simultaneously
+### 🔍 When to Use Web Search
+**✅ Good use cases:**
+- "What happened in the latest SpaceX launch?"
+- "Current cryptocurrency prices"
+- "Recent AI research papers"
+- "Today's weather in Paris"
+**❌ Don't need search for:**
+- General knowledge questions
+- Code writing/debugging
+- Math problems
+- Creative writing
+- Theoretical explanations
+### 💭 Understanding Thinking Mode
+Some models output `<think>...</think>` blocks:
+```
+<think>
+Let me break this down step by step...
+First, I need to consider...
+</think>
+Here's the answer: ...
+```
+**In the UI:**
+- Thinking shows as "💭 Thought"
+- Answer shows separately
+- Helps you see the reasoning process
+**Best for:**
+- Complex math problems
+- Multi-step reasoning
+- Debugging logic
+- Learning how AI thinks
+## Troubleshooting
+### Generation is Slow
+- Try a smaller model
+- Reduce Max Tokens
+- Disable web search if not needed
+- Clear chat history
+### Responses are Repetitive
+- Increase Repetition Penalty
+- Reduce Temperature slightly
+- Try different model
+### Responses are Random/Nonsensical
+- Decrease Temperature
+- Reduce Top-P
+- Reduce Top-K
+- Try more stable model
+### Web Search Not Working
+- Check timeout isn't too short
+- Verify internet connection
+- Try increasing Max Results
+- Check search query in debug panel
+### Cancel Button Doesn't Work
+- Wait a moment (might be processing)
+- Refresh page if persists
+- Check browser console for errors
+## Keyboard Shortcuts
+- **Enter**: Send message
+- **Shift+Enter**: New line in input
+- **Ctrl+C**: Copy (when text selected)
+- **Ctrl+A**: Select all in input
+## Best Practices
+### For Beginners
+1. Start with example prompts
+2. Use default settings initially
+3. Try 2-4 different models
+4. Gradually explore advanced settings
+5. Read responses fully before replying
+### For Power Users
+1. Create custom system prompts
+2. Fine-tune parameters per task
+3. Use debug panel for prompt engineering
+4. Experiment with model combinations
+5. Utilize web search strategically
+### For Developers
+1. Study the debug output
+2. Test code generation thoroughly
+3. Use lower temperature for determinism
+4. Compare multiple models
+5. Save working configurations
+## Privacy & Safety
+- **No data collection**: Conversations not stored permanently
+- **Model limitations**: May produce incorrect information
+- **Verify important info**: Don't rely solely on AI for critical decisions
+- **Web search**: Uses DuckDuckGo (privacy-focused)
+- **Open source**: Code is transparent and auditable
+## Support & Feedback
+Found a bug? Have a suggestion?
+- Check GitHub issues
+- Submit feature requests
+- Contribute improvements
+- Share your use cases
+---
+**Happy chatting! 🎉**