Luigi commited on
Commit
c1bc514
Β·
1 Parent(s): df40b1d

Add comprehensive documentation and user guide

Browse files

πŸ“š Documentation Enhancements:
- Complete rewrite of README with modern formatting
- Organized sections for features, models, and technical details
- Added comprehensive USER_GUIDE.md with tutorials
- Quick start guide for beginners
- Advanced configuration guide for power users

πŸ“– README Updates:
- Modern layout with clear sections
- Feature highlights with emojis for easy scanning
- Model categorization by size and purpose
- Technical flow explanation
- Performance and customization info
- Contributing guidelines

πŸ“ User Guide Includes:
- 5-minute quick start tutorial
- Detailed feature explanations
- Advanced parameter guide with use cases
- Preset configurations for common tasks
- Tips & tricks for better results
- Troubleshooting section
- Best practices for different user levels
- Keyboard shortcuts reference

🎯 Content Organization:
- Beginner-friendly introduction
- Progressive complexity
- Practical examples throughout
- Visual tables for quick reference
- Clear explanations of technical concepts

Files changed (3) hide show
  1. README.md +177 -62
  2. README_OLD.md +80 -0
  3. USER_GUIDE.md +300 -0
README.md CHANGED
@@ -1,80 +1,195 @@
1
  ---
2
  title: ZeroGPU-LLM-Inference
3
  emoji: 🧠
4
- colorFrom: pink
5
  colorTo: purple
6
  sdk: gradio
7
  sdk_version: 5.49.1
8
  app_file: app.py
9
  pinned: false
10
  license: apache-2.0
11
- short_description: Streaming LLM chat with web search and debug
12
  ---
13
 
14
- This Gradio app provides **token-streaming, chat-style inference** on a wide variety of Transformer modelsβ€”leveraging ZeroGPU for free GPU acceleration on HF Spaces.
15
 
16
- Key features:
17
- - **Real-time DuckDuckGo web search** (background thread, configurable timeout) with results injected into the system prompt.
18
- - **Prompt preview panel** for debugging and prompt-engineering insightsβ€”see exactly what’s sent to the model.
19
- - **Thought vs. Answer streaming**: any `<think>…</think>` blocks emitted by the model are shown as separate β€œπŸ’­ Thought.”
20
- - **Cancel button** to immediately stop generation.
21
- - **Dynamic system prompt**: automatically inserts today’s date when you toggle web search.
22
- - **Extensive model selection**: over 30 LLMs (from Phi-4 mini to Qwen3-14B, SmolLM2, Taiwan-ELM, Mistral, Meta-Llama, MiMo, Gemma, DeepSeek-R1, etc.).
23
- - **Memory-safe design**: loads one model at a time, clears cache after each generation.
24
- - **Customizable generation parameters**: max tokens, temperature, top-k, top-p, repetition penalty.
25
- - **Web-search settings**: max results, max chars per result, search timeout.
26
- - **Requirements pinned** to ensure reproducible deployment.
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
27
 
28
  ## πŸ”„ Supported Models
29
 
30
- Use the dropdown to select any of these:
31
-
32
- | Name | Repo ID |
33
- | ------------------------------------- | -------------------------------------------------- |
34
- | Taiwan-ELM-1_1B-Instruct | liswei/Taiwan-ELM-1_1B-Instruct |
35
- | Taiwan-ELM-270M-Instruct | liswei/Taiwan-ELM-270M-Instruct |
36
- | Qwen3-0.6B | Qwen/Qwen3-0.6B |
37
- | Qwen3-1.7B | Qwen/Qwen3-1.7B |
38
- | Qwen3-4B | Qwen/Qwen3-4B |
39
- | Qwen3-8B | Qwen/Qwen3-8B |
40
- | Qwen3-14B | Qwen/Qwen3-14B |
41
- | Gemma-3-4B-IT | unsloth/gemma-3-4b-it |
42
- | SmolLM2-135M-Instruct-TaiwanChat | Luigi/SmolLM2-135M-Instruct-TaiwanChat |
43
- | SmolLM2-135M-Instruct | HuggingFaceTB/SmolLM2-135M-Instruct |
44
- | SmolLM2-360M-Instruct-TaiwanChat | Luigi/SmolLM2-360M-Instruct-TaiwanChat |
45
- | Llama-3.2-Taiwan-3B-Instruct | lianghsun/Llama-3.2-Taiwan-3B-Instruct |
46
- | MiniCPM3-4B | openbmb/MiniCPM3-4B |
47
- | Qwen2.5-3B-Instruct | Qwen/Qwen2.5-3B-Instruct |
48
- | Qwen2.5-7B-Instruct | Qwen/Qwen2.5-7B-Instruct |
49
- | Phi-4-mini-Reasoning | microsoft/Phi-4-mini-reasoning |
50
- | Phi-4-mini-Instruct | microsoft/Phi-4-mini-instruct |
51
- | Meta-Llama-3.1-8B-Instruct | MaziyarPanahi/Meta-Llama-3.1-8B-Instruct |
52
- | DeepSeek-R1-Distill-Llama-8B | unsloth/DeepSeek-R1-Distill-Llama-8B |
53
- | Mistral-7B-Instruct-v0.3 | MaziyarPanahi/Mistral-7B-Instruct-v0.3 |
54
- | Qwen2.5-Coder-7B-Instruct | Qwen/Qwen2.5-Coder-7B-Instruct |
55
- | Qwen2.5-Omni-3B | Qwen/Qwen2.5-Omni-3B |
56
- | MiMo-7B-RL | XiaomiMiMo/MiMo-7B-RL |
57
-
58
- *(…and more can easily be added in `MODELS` in `app.py`.)*
59
-
60
- ## βš™οΈ Generation & Search Parameters
61
-
62
- - **Max Tokens**: 64–16384
63
- - **Temperature**: 0.1–2.0
64
- - **Top-K**: 1–100
65
- - **Top-P**: 0.1–1.0
66
- - **Repetition Penalty**: 1.0–2.0
67
-
68
- - **Enable Web Search**: on/off
69
- - **Max Results**: integer
70
- - **Max Chars/Result**: integer
71
- - **Search Timeout (s)**: 0.0–30.0
72
 
73
  ## πŸš€ How It Works
74
 
75
- 1. **User message** enters chat history.
76
- 2. If search is enabled, a background DuckDuckGo thread fetches snippets.
77
- 3. After up to *Search Timeout* seconds, snippets merge into the system prompt.
78
- 4. The selected model pipeline is loaded (bf16β†’f16β†’f32 fallback) on ZeroGPU.
79
- 5. Prompt is formattedβ€”any `<think>…</think>` blocks will be streamed as separate β€œπŸ’­ Thought.”
80
- 6. Tokens stream to the Chatbot UI. Press **Cancel** to stop mid-generation.
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
  ---
2
  title: ZeroGPU-LLM-Inference
3
  emoji: 🧠
4
+ colorFrom: indigo
5
  colorTo: purple
6
  sdk: gradio
7
  sdk_version: 5.49.1
8
  app_file: app.py
9
  pinned: false
10
  license: apache-2.0
11
+ short_description: Streaming LLM chat with web search and controls
12
  ---
13
 
14
+ # 🧠 ZeroGPU LLM Inference
15
 
16
+ A modern, user-friendly Gradio interface for **token-streaming, chat-style inference** across a wide variety of Transformer modelsβ€”powered by ZeroGPU for free GPU acceleration on Hugging Face Spaces.
17
+
18
+ ## ✨ Key Features
19
+
20
+ ### 🎨 Modern UI/UX
21
+ - **Clean, intuitive interface** with organized layout and visual hierarchy
22
+ - **Collapsible advanced settings** for both simple and power users
23
+ - **Smooth animations and transitions** for better user experience
24
+ - **Responsive design** that works on all screen sizes
25
+ - **Copy-to-clipboard** functionality for easy sharing of responses
26
+
27
+ ### πŸ” Web Search Integration
28
+ - **Real-time DuckDuckGo search** with background threading
29
+ - **Configurable timeout** and result limits
30
+ - **Automatic context injection** into system prompts
31
+ - **Smart toggle** - search settings auto-hide when disabled
32
+
33
+ ### πŸ’‘ Smart Features
34
+ - **Thought vs. Answer streaming**: `<think>…</think>` blocks shown separately as "πŸ’­ Thought"
35
+ - **Working cancel button** - immediately stops generation without errors
36
+ - **Debug panel** for prompt engineering insights
37
+ - **Duration estimates** based on model size and settings
38
+ - **Example prompts** to help users get started
39
+ - **Dynamic system prompts** with automatic date insertion
40
+
41
+ ### 🎯 Model Variety
42
+ - **30+ LLM options** from leading providers (Qwen, Microsoft, Meta, Mistral, etc.)
43
+ - Models ranging from **135M to 32B+** parameters
44
+ - Specialized models for **reasoning, coding, and general chat**
45
+ - **Efficient model loading** - one at a time with automatic cache clearing
46
+
47
+ ### βš™οΈ Advanced Controls
48
+ - **Generation parameters**: max tokens, temperature, top-k, top-p, repetition penalty
49
+ - **Web search settings**: max results, chars per result, timeout
50
+ - **Custom system prompts** with dynamic date insertion
51
+ - **Organized in collapsible sections** to keep interface clean
52
 
53
  ## πŸ”„ Supported Models
54
 
55
+ ### Compact Models (< 2B)
56
+ - **SmolLM2-135M-Instruct** - Tiny but capable
57
+ - **SmolLM2-360M-Instruct** - Lightweight conversation
58
+ - **Taiwan-ELM-270M/1.1B** - Multilingual support
59
+ - **Qwen3-0.6B/1.7B** - Fast inference
60
+
61
+ ### Mid-Size Models (2B-8B)
62
+ - **Qwen3-4B/8B** - Balanced performance
63
+ - **Phi-4-mini** (4.3B) - Reasoning & Instruct variants
64
+ - **MiniCPM3-4B** - Efficient mid-size
65
+ - **Gemma-3-4B-IT** - Instruction-tuned
66
+ - **Llama-3.2-Taiwan-3B** - Regional optimization
67
+ - **Mistral-7B-Instruct** - Classic performer
68
+ - **DeepSeek-R1-Distill-Llama-8B** - Reasoning specialist
69
+
70
+ ### Large Models (14B+)
71
+ - **Qwen3-14B** - Strong general purpose
72
+ - **Apriel-1.5-15b-Thinker** - Multimodal reasoning
73
+ - **gpt-oss-20b** - Open GPT-style
74
+ - **Qwen3-32B** - Top-tier performance
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
75
 
76
  ## πŸš€ How It Works
77
 
78
+ 1. **Select Model** - Choose from 30+ pre-configured models
79
+ 2. **Configure Settings** - Adjust generation parameters or use defaults
80
+ 3. **Enable Web Search** (optional) - Get real-time information
81
+ 4. **Start Chatting** - Type your message or use example prompts
82
+ 5. **Stream Response** - Watch as tokens are generated in real-time
83
+ 6. **Cancel Anytime** - Stop generation mid-stream if needed
84
+
85
+ ### Technical Flow
86
+
87
+ 1. User message enters chat history
88
+ 2. If search enabled, background thread fetches DuckDuckGo results
89
+ 3. Search snippets merge into system prompt (within timeout limit)
90
+ 4. Selected model pipeline loads on ZeroGPU (bf16β†’f16β†’f32 fallback)
91
+ 5. Prompt formatted with thinking mode detection
92
+ 6. Tokens stream to UI with thought/answer separation
93
+ 7. Cancel button available for immediate interruption
94
+ 8. Memory cleared after generation for next request
95
+
96
+ ## βš™οΈ Generation Parameters
97
+
98
+ | Parameter | Range | Default | Description |
99
+ |-----------|-------|---------|-------------|
100
+ | Max Tokens | 64-16384 | 1024 | Maximum response length |
101
+ | Temperature | 0.1-2.0 | 0.7 | Creativity vs focus |
102
+ | Top-K | 1-100 | 40 | Token sampling pool size |
103
+ | Top-P | 0.1-1.0 | 0.9 | Nucleus sampling threshold |
104
+ | Repetition Penalty | 1.0-2.0 | 1.2 | Reduce repetition |
105
+
106
+ ## 🌐 Web Search Settings
107
+
108
+ | Setting | Range | Default | Description |
109
+ |---------|-------|---------|-------------|
110
+ | Max Results | Integer | 4 | Number of search results |
111
+ | Max Chars/Result | Integer | 50 | Character limit per result |
112
+ | Search Timeout | 0-30s | 5s | Maximum wait time |
113
+
114
+ ## πŸ’» Local Development
115
+
116
+ ```bash
117
+ # Clone the repository
118
+ git clone https://huggingface.co/spaces/Luigi/ZeroGPU-LLM-Inference
119
+ cd ZeroGPU-LLM-Inference
120
+
121
+ # Install dependencies
122
+ pip install -r requirements.txt
123
+
124
+ # Run the app
125
+ python app.py
126
+ ```
127
+
128
+ ## 🎨 UI Design Philosophy
129
+
130
+ The interface follows these principles:
131
+
132
+ 1. **Simplicity First** - Core features immediately visible
133
+ 2. **Progressive Disclosure** - Advanced options hidden but accessible
134
+ 3. **Visual Hierarchy** - Clear organization with groups and sections
135
+ 4. **Feedback** - Status indicators and helpful messages
136
+ 5. **Accessibility** - Responsive, keyboard-friendly, with tooltips
137
+
138
+ ## πŸ”§ Customization
139
+
140
+ ### Adding New Models
141
+
142
+ Edit `MODELS` dictionary in `app.py`:
143
+
144
+ ```python
145
+ "Your-Model-Name": {
146
+ "repo_id": "org/model-name",
147
+ "description": "Model description",
148
+ "params_b": 7.0 # Size in billions
149
+ }
150
+ ```
151
+
152
+ ### Modifying UI Theme
153
+
154
+ Adjust theme parameters in `gr.Blocks()`:
155
+
156
+ ```python
157
+ theme=gr.themes.Soft(
158
+ primary_hue="indigo",
159
+ secondary_hue="purple",
160
+ # ... more options
161
+ )
162
+ ```
163
+
164
+ ## πŸ“Š Performance
165
+
166
+ - **Token streaming** for responsive feel
167
+ - **Background search** doesn't block UI
168
+ - **Efficient memory** management with cache clearing
169
+ - **ZeroGPU acceleration** for fast inference
170
+ - **Optimized loading** with dtype fallbacks
171
+
172
+ ## 🀝 Contributing
173
+
174
+ Contributions welcome! Areas for improvement:
175
+
176
+ - Additional model integrations
177
+ - UI/UX enhancements
178
+ - Performance optimizations
179
+ - Bug fixes and testing
180
+ - Documentation improvements
181
+
182
+ ## πŸ“ License
183
+
184
+ Apache 2.0 - See LICENSE file for details
185
+
186
+ ## πŸ™ Acknowledgments
187
+
188
+ - Built with [Gradio](https://gradio.app)
189
+ - Powered by [Hugging Face Transformers](https://huggingface.co/transformers)
190
+ - Uses [ZeroGPU](https://huggingface.co/zero-gpu-explorers) for acceleration
191
+ - Search via [DuckDuckGo](https://duckduckgo.com)
192
+
193
+ ---
194
+
195
+ **Made with ❀️ for the open source community**
README_OLD.md ADDED
@@ -0,0 +1,80 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ title: ZeroGPU-LLM-Inference
3
+ emoji: 🧠
4
+ colorFrom: pink
5
+ colorTo: purple
6
+ sdk: gradio
7
+ sdk_version: 5.49.1
8
+ app_file: app.py
9
+ pinned: false
10
+ license: apache-2.0
11
+ short_description: Streaming LLM chat with web search and debug
12
+ ---
13
+
14
+ This Gradio app provides **token-streaming, chat-style inference** on a wide variety of Transformer modelsβ€”leveraging ZeroGPU for free GPU acceleration on HF Spaces.
15
+
16
+ Key features:
17
+ - **Real-time DuckDuckGo web search** (background thread, configurable timeout) with results injected into the system prompt.
18
+ - **Prompt preview panel** for debugging and prompt-engineering insightsβ€”see exactly what’s sent to the model.
19
+ - **Thought vs. Answer streaming**: any `<think>…</think>` blocks emitted by the model are shown as separate β€œπŸ’­ Thought.”
20
+ - **Cancel button** to immediately stop generation.
21
+ - **Dynamic system prompt**: automatically inserts today’s date when you toggle web search.
22
+ - **Extensive model selection**: over 30 LLMs (from Phi-4 mini to Qwen3-14B, SmolLM2, Taiwan-ELM, Mistral, Meta-Llama, MiMo, Gemma, DeepSeek-R1, etc.).
23
+ - **Memory-safe design**: loads one model at a time, clears cache after each generation.
24
+ - **Customizable generation parameters**: max tokens, temperature, top-k, top-p, repetition penalty.
25
+ - **Web-search settings**: max results, max chars per result, search timeout.
26
+ - **Requirements pinned** to ensure reproducible deployment.
27
+
28
+ ## πŸ”„ Supported Models
29
+
30
+ Use the dropdown to select any of these:
31
+
32
+ | Name | Repo ID |
33
+ | ------------------------------------- | -------------------------------------------------- |
34
+ | Taiwan-ELM-1_1B-Instruct | liswei/Taiwan-ELM-1_1B-Instruct |
35
+ | Taiwan-ELM-270M-Instruct | liswei/Taiwan-ELM-270M-Instruct |
36
+ | Qwen3-0.6B | Qwen/Qwen3-0.6B |
37
+ | Qwen3-1.7B | Qwen/Qwen3-1.7B |
38
+ | Qwen3-4B | Qwen/Qwen3-4B |
39
+ | Qwen3-8B | Qwen/Qwen3-8B |
40
+ | Qwen3-14B | Qwen/Qwen3-14B |
41
+ | Gemma-3-4B-IT | unsloth/gemma-3-4b-it |
42
+ | SmolLM2-135M-Instruct-TaiwanChat | Luigi/SmolLM2-135M-Instruct-TaiwanChat |
43
+ | SmolLM2-135M-Instruct | HuggingFaceTB/SmolLM2-135M-Instruct |
44
+ | SmolLM2-360M-Instruct-TaiwanChat | Luigi/SmolLM2-360M-Instruct-TaiwanChat |
45
+ | Llama-3.2-Taiwan-3B-Instruct | lianghsun/Llama-3.2-Taiwan-3B-Instruct |
46
+ | MiniCPM3-4B | openbmb/MiniCPM3-4B |
47
+ | Qwen2.5-3B-Instruct | Qwen/Qwen2.5-3B-Instruct |
48
+ | Qwen2.5-7B-Instruct | Qwen/Qwen2.5-7B-Instruct |
49
+ | Phi-4-mini-Reasoning | microsoft/Phi-4-mini-reasoning |
50
+ | Phi-4-mini-Instruct | microsoft/Phi-4-mini-instruct |
51
+ | Meta-Llama-3.1-8B-Instruct | MaziyarPanahi/Meta-Llama-3.1-8B-Instruct |
52
+ | DeepSeek-R1-Distill-Llama-8B | unsloth/DeepSeek-R1-Distill-Llama-8B |
53
+ | Mistral-7B-Instruct-v0.3 | MaziyarPanahi/Mistral-7B-Instruct-v0.3 |
54
+ | Qwen2.5-Coder-7B-Instruct | Qwen/Qwen2.5-Coder-7B-Instruct |
55
+ | Qwen2.5-Omni-3B | Qwen/Qwen2.5-Omni-3B |
56
+ | MiMo-7B-RL | XiaomiMiMo/MiMo-7B-RL |
57
+
58
+ *(…and more can easily be added in `MODELS` in `app.py`.)*
59
+
60
+ ## βš™οΈ Generation & Search Parameters
61
+
62
+ - **Max Tokens**: 64–16384
63
+ - **Temperature**: 0.1–2.0
64
+ - **Top-K**: 1–100
65
+ - **Top-P**: 0.1–1.0
66
+ - **Repetition Penalty**: 1.0–2.0
67
+
68
+ - **Enable Web Search**: on/off
69
+ - **Max Results**: integer
70
+ - **Max Chars/Result**: integer
71
+ - **Search Timeout (s)**: 0.0–30.0
72
+
73
+ ## πŸš€ How It Works
74
+
75
+ 1. **User message** enters chat history.
76
+ 2. If search is enabled, a background DuckDuckGo thread fetches snippets.
77
+ 3. After up to *Search Timeout* seconds, snippets merge into the system prompt.
78
+ 4. The selected model pipeline is loaded (bf16β†’f16β†’f32 fallback) on ZeroGPU.
79
+ 5. Prompt is formattedβ€”any `<think>…</think>` blocks will be streamed as separate β€œπŸ’­ Thought.”
80
+ 6. Tokens stream to the Chatbot UI. Press **Cancel** to stop mid-generation.
USER_GUIDE.md ADDED
@@ -0,0 +1,300 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # πŸ“– User Guide - ZeroGPU LLM Inference
2
+
3
+ ## Quick Start (5 Minutes)
4
+
5
+ ### 1. Choose Your Model
6
+ The model dropdown shows 30+ options organized by size:
7
+ - **Compact (<2B)**: Fast, lightweight - great for quick responses
8
+ - **Mid-size (2-8B)**: Best balance of speed and quality
9
+ - **Large (14B+)**: Highest quality, slower but more capable
10
+
11
+ **Recommendation for beginners**: Start with `Qwen3-4B-Instruct-2507`
12
+
13
+ ### 2. Try an Example Prompt
14
+ Click on any example below the chat box to get started:
15
+ - "Explain quantum computing in simple terms"
16
+ - "Write a Python function..."
17
+ - "What are the latest developments..." (requires web search)
18
+
19
+ ### 3. Start Chatting!
20
+ Type your message and press Enter or click "πŸ“€ Send"
21
+
22
+ ## Core Features
23
+
24
+ ### πŸ’¬ Chat Interface
25
+
26
+ The main chat area shows:
27
+ - Your messages on one side
28
+ - AI responses with a πŸ€– avatar
29
+ - Copy button on each message
30
+ - Smooth streaming as tokens generate
31
+
32
+ **Tips:**
33
+ - Press Enter to send (Shift+Enter for new line)
34
+ - Click Copy button to save responses
35
+ - Scroll up to review history
36
+ - Use Clear Chat to start fresh
37
+
38
+ ### πŸ€– Model Selection
39
+
40
+ **When to use each size:**
41
+
42
+ | Model Size | Best For | Speed | Quality |
43
+ |------------|----------|-------|---------|
44
+ | <2B | Quick questions, testing | ⚑⚑⚑ | ⭐⭐ |
45
+ | 2-8B | General chat, coding help | ⚑⚑ | ⭐⭐⭐ |
46
+ | 14B+ | Complex reasoning, long-form | ⚑ | ⭐⭐⭐⭐ |
47
+
48
+ **Specialized Models:**
49
+ - **Phi-4-mini-Reasoning**: Math, logic problems
50
+ - **Qwen2.5-Coder**: Programming tasks
51
+ - **DeepSeek-R1-Distill**: Step-by-step reasoning
52
+ - **Apriel-1.5-15b-Thinker**: Multimodal understanding
53
+
54
+ ### πŸ” Web Search
55
+
56
+ Enable this when you need:
57
+ - Current events and news
58
+ - Recent information (after model training cutoff)
59
+ - Facts that change frequently
60
+ - Real-time data
61
+
62
+ **How it works:**
63
+ 1. Toggle "πŸ” Enable Web Search"
64
+ 2. Web search settings accordion appears
65
+ 3. System prompt updates automatically
66
+ 4. Search runs in background (won't block chat)
67
+ 5. Results injected into context
68
+
69
+ **Settings explained:**
70
+ - **Max Results**: How many search results to fetch (4 is good default)
71
+ - **Max Chars/Result**: Limit length per result (50 prevents overwhelming context)
72
+ - **Search Timeout**: Maximum wait time (5s recommended)
73
+
74
+ ### πŸ“ System Prompt
75
+
76
+ This defines the AI's personality and behavior.
77
+
78
+ **Default prompts:**
79
+ - Without search: Helpful, creative assistant
80
+ - With search: Includes search results and current date
81
+
82
+ **Customization ideas:**
83
+ ```
84
+ You are a professional code reviewer...
85
+ You are a creative writing coach...
86
+ You are a patient tutor explaining concepts simply...
87
+ You are a technical documentation writer...
88
+ ```
89
+
90
+ ## Advanced Features
91
+
92
+ ### πŸŽ›οΈ Advanced Generation Parameters
93
+
94
+ Click the accordion to reveal these controls:
95
+
96
+ #### Max Tokens (64-16384)
97
+ - **What it does**: Sets maximum response length
98
+ - **Lower (256-512)**: Quick, concise answers
99
+ - **Medium (1024)**: Balanced (default)
100
+ - **Higher (2048+)**: Long-form content, detailed explanations
101
+
102
+ #### Temperature (0.1-2.0)
103
+ - **What it does**: Controls randomness/creativity
104
+ - **Low (0.1-0.3)**: Focused, deterministic (good for facts, code)
105
+ - **Medium (0.7)**: Balanced creativity (default)
106
+ - **High (1.2-2.0)**: Very creative, unpredictable (stories, brainstorming)
107
+
108
+ #### Top-K (1-100)
109
+ - **What it does**: Limits token choices to top K most likely
110
+ - **Lower (10-20)**: More focused
111
+ - **Medium (40)**: Balanced (default)
112
+ - **Higher (80-100)**: More varied vocabulary
113
+
114
+ #### Top-P (0.1-1.0)
115
+ - **What it does**: Nucleus sampling threshold
116
+ - **Lower (0.5-0.7)**: Conservative choices
117
+ - **Medium (0.9)**: Balanced (default)
118
+ - **Higher (0.95-1.0)**: Full vocabulary range
119
+
120
+ #### Repetition Penalty (1.0-2.0)
121
+ - **What it does**: Reduces repeated words/phrases
122
+ - **Low (1.0-1.1)**: Allows some repetition
123
+ - **Medium (1.2)**: Balanced (default)
124
+ - **High (1.5+)**: Strongly avoids repetition (may hurt coherence)
125
+
126
+ ### Preset Configurations
127
+
128
+ **For Creative Writing:**
129
+ ```
130
+ Temperature: 1.2
131
+ Top-P: 0.95
132
+ Top-K: 80
133
+ Max Tokens: 2048
134
+ ```
135
+
136
+ **For Code Generation:**
137
+ ```
138
+ Temperature: 0.3
139
+ Top-P: 0.9
140
+ Top-K: 40
141
+ Max Tokens: 1024
142
+ Repetition Penalty: 1.1
143
+ ```
144
+
145
+ **For Factual Q&A:**
146
+ ```
147
+ Temperature: 0.5
148
+ Top-P: 0.85
149
+ Top-K: 30
150
+ Max Tokens: 512
151
+ Enable Web Search: Yes
152
+ ```
153
+
154
+ **For Reasoning Tasks:**
155
+ ```
156
+ Model: Phi-4-mini-Reasoning or DeepSeek-R1
157
+ Temperature: 0.7
158
+ Max Tokens: 2048
159
+ ```
160
+
161
+ ## Tips & Tricks
162
+
163
+ ### 🎯 Getting Better Results
164
+
165
+ 1. **Be Specific**: "Write a Python function to sort a list" β†’ "Write a Python function that sorts a list of dictionaries by a specific key"
166
+
167
+ 2. **Provide Context**: "Explain recursion" β†’ "Explain recursion to someone learning programming for the first time, with a simple example"
168
+
169
+ 3. **Use System Prompts**: Define role/expertise in system prompt instead of every message
170
+
171
+ 4. **Iterate**: Use follow-up questions to refine responses
172
+
173
+ 5. **Experiment with Models**: Try different models for the same task
174
+
175
+ ### ⚑ Performance Tips
176
+
177
+ 1. **Start Small**: Test with smaller models first
178
+ 2. **Adjust Max Tokens**: Don't request more than you need
179
+ 3. **Use Cancel**: Stop bad generations early
180
+ 4. **Clear Cache**: Clear chat if experiencing slowdowns
181
+ 5. **One Task at a Time**: Don't send multiple requests simultaneously
182
+
183
+ ### πŸ” When to Use Web Search
184
+
185
+ **βœ… Good use cases:**
186
+ - "What happened in the latest SpaceX launch?"
187
+ - "Current cryptocurrency prices"
188
+ - "Recent AI research papers"
189
+ - "Today's weather in Paris"
190
+
191
+ **❌ Don't need search for:**
192
+ - General knowledge questions
193
+ - Code writing/debugging
194
+ - Math problems
195
+ - Creative writing
196
+ - Theoretical explanations
197
+
198
+ ### πŸ’­ Understanding Thinking Mode
199
+
200
+ Some models output `<think>...</think>` blocks:
201
+
202
+ ```
203
+ <think>
204
+ Let me break this down step by step...
205
+ First, I need to consider...
206
+ </think>
207
+
208
+ Here's the answer: ...
209
+ ```
210
+
211
+ **In the UI:**
212
+ - Thinking shows as "πŸ’­ Thought"
213
+ - Answer shows separately
214
+ - Helps you see the reasoning process
215
+
216
+ **Best for:**
217
+ - Complex math problems
218
+ - Multi-step reasoning
219
+ - Debugging logic
220
+ - Learning how AI thinks
221
+
222
+ ## Troubleshooting
223
+
224
+ ### Generation is Slow
225
+ - Try a smaller model
226
+ - Reduce Max Tokens
227
+ - Disable web search if not needed
228
+ - Clear chat history
229
+
230
+ ### Responses are Repetitive
231
+ - Increase Repetition Penalty
232
+ - Reduce Temperature slightly
233
+ - Try different model
234
+
235
+ ### Responses are Random/Nonsensical
236
+ - Decrease Temperature
237
+ - Reduce Top-P
238
+ - Reduce Top-K
239
+ - Try more stable model
240
+
241
+ ### Web Search Not Working
242
+ - Check timeout isn't too short
243
+ - Verify internet connection
244
+ - Try increasing Max Results
245
+ - Check search query in debug panel
246
+
247
+ ### Cancel Button Doesn't Work
248
+ - Wait a moment (might be processing)
249
+ - Refresh page if persists
250
+ - Check browser console for errors
251
+
252
+ ## Keyboard Shortcuts
253
+
254
+ - **Enter**: Send message
255
+ - **Shift+Enter**: New line in input
256
+ - **Ctrl+C**: Copy (when text selected)
257
+ - **Ctrl+A**: Select all in input
258
+
259
+ ## Best Practices
260
+
261
+ ### For Beginners
262
+ 1. Start with example prompts
263
+ 2. Use default settings initially
264
+ 3. Try 2-4 different models
265
+ 4. Gradually explore advanced settings
266
+ 5. Read responses fully before replying
267
+
268
+ ### For Power Users
269
+ 1. Create custom system prompts
270
+ 2. Fine-tune parameters per task
271
+ 3. Use debug panel for prompt engineering
272
+ 4. Experiment with model combinations
273
+ 5. Utilize web search strategically
274
+
275
+ ### For Developers
276
+ 1. Study the debug output
277
+ 2. Test code generation thoroughly
278
+ 3. Use lower temperature for determinism
279
+ 4. Compare multiple models
280
+ 5. Save working configurations
281
+
282
+ ## Privacy & Safety
283
+
284
+ - **No data collection**: Conversations not stored permanently
285
+ - **Model limitations**: May produce incorrect information
286
+ - **Verify important info**: Don't rely solely on AI for critical decisions
287
+ - **Web search**: Uses DuckDuckGo (privacy-focused)
288
+ - **Open source**: Code is transparent and auditable
289
+
290
+ ## Support & Feedback
291
+
292
+ Found a bug? Have a suggestion?
293
+ - Check GitHub issues
294
+ - Submit feature requests
295
+ - Contribute improvements
296
+ - Share your use cases
297
+
298
+ ---
299
+
300
+ **Happy chatting! πŸŽ‰**