burtenshaw commited on
Commit
551ae1a
·
1 Parent(s): fb723d2

switch back to gradio

Browse files
Files changed (6) hide show
  1. = +0 -0
  2. README.md +81 -453
  3. mcp_server.py → app.py +198 -182
  4. pyproject.toml +4 -3
  5. requirements.txt +2 -1
  6. uv.lock +0 -0
= ADDED
File without changes
README.md CHANGED
@@ -1,508 +1,136 @@
1
- # ⚡ Inference Providers FastMCP Server
 
 
 
 
 
 
 
 
 
2
 
3
- A **FastMCP Server** for Hugging Face Inference Providers, built with [FastMCP](https://github.com/jlowin/fastmcp) - the fast, Pythonic way to build MCP servers. This allows LLMs and AI assistants to access multiple AI providers and language models through the Model Context Protocol.
4
 
5
- ## Features
6
 
7
- - **⚡ FastMCP**: Built with FastMCP for optimal performance and simplicity
8
- - **🚀 UV-Powered**: Uses UV/UVX for fast, modern Python dependency management
9
- - **🤖 MCP Server**: Native MCP server with tools, resources, and prompts
10
- - **🎯 Multi-Provider Support**: Access 14+ inference providers including Cerebras, Cohere, Fal AI, Fireworks, Groq, and more
11
- - **💬 Chat Completion**: Interactive conversations with LLMs and Vision Language Models
12
- - **📊 Resources**: Access provider information and popular model recommendations
13
- - **🔍 Context Logging**: Rich logging and error handling through MCP context
14
- - **🔧 Easy Integration**: Simple configuration for Cursor, Claude Desktop, and other MCP clients
15
 
16
- ## 🚀 Supported Providers
17
-
18
- | Provider | Chat Completion | Vision Language Models |
19
- |----------|----------------|------------------------|
20
- | Cerebras | ✅ | ❌ |
21
- | Cohere | ✅ | ✅ |
22
- | Fal AI | ✅ | ✅ |
23
- | Featherless AI | ✅ | ✅ |
24
- | Fireworks | ✅ | ✅ |
25
- | Groq | ✅ | ❌ |
26
- | HF Inference | ✅ | ✅ |
27
- | Hyperbolic | ✅ | ✅ |
28
- | Nebius | ✅ | ✅ |
29
- | Novita | ✅ | ✅ |
30
- | Nscale | ✅ | ✅ |
31
- | Replicate | ✅ | ✅ |
32
- | SambaNova | ✅ | ✅ |
33
- | Together | ✅ | ✅ |
34
-
35
- ## 🛠️ Quick Start
36
-
37
- ### 1. Get a Hugging Face Token
38
-
39
- 1. Go to [https://huggingface.co/settings/tokens](https://huggingface.co/settings/tokens)
40
- 2. Create a new token with **Inference Providers** scope
41
- 3. Copy the token (starts with `hf_`)
42
-
43
- ### 2. Install Dependencies
44
-
45
- ```bash
46
- # Clone the repository
47
- git clone <repository-url>
48
- cd inference-providers-mcp
49
-
50
- # Install dependencies
51
- pip install -r requirements.txt
52
- ```
53
-
54
- ### 3. Set Environment Variables
55
-
56
- Create a `.env` file in your project directory:
57
-
58
- ```bash
59
- # .env file
60
- HF_TOKEN=hf_your_actual_token_here
61
- ```
62
-
63
- Or set it globally:
64
-
65
- ```bash
66
- # Linux/macOS
67
- export HF_TOKEN=hf_your_actual_token_here
68
-
69
- # Windows
70
- set HF_TOKEN=hf_your_actual_token_here
71
- ```
72
-
73
- ### 4. Test the Server
74
-
75
- ```bash
76
- # Test the server works (using UV - recommended)
77
- uvx test_mcp.py
78
-
79
- # Or test with Python
80
- python test_mcp.py
81
-
82
- # Run the server manually (optional)
83
- uvx mcp_server.py
84
- # Or: python mcp_server.py
85
- ```
86
-
87
- ## 🎯 Cursor IDE Integration
88
-
89
- There are several ways to integrate this FastMCP server with Cursor IDE. Choose the method that works best for your setup.
90
-
91
- > **✅ Your Current Configuration is Already Optimal!**
92
- >
93
- > Looking at your `.cursor/mcp.json`, you're already using `uvx` which is the recommended approach. Your configuration with `uvx` + `mcp_server.py` is perfect for modern FastMCP development!
94
-
95
- ### Method 1: Cursor Settings UI (Recommended)
96
-
97
- This is the easiest method for beginners:
98
 
99
- 1. **Open Cursor Settings**:
100
- - Go to `Settings → Cursor Settings → Features → Model Context Protocol`
101
- - OR use `Cmd/Ctrl + ,` and search for "MCP"
102
 
103
- 2. **Add New MCP Server**:
104
- - Click **"Add New MCP Server"**
105
- - Fill in the configuration:
106
-
107
- ```
108
- Name: inference-providers
109
- Command: uvx
110
- Arguments: mcp_server.py
111
- Environment Variables:
112
- HF_TOKEN: hf_your_actual_token_here
113
- ```
114
-
115
- **Why UV/UVX?** ✨
116
- - **Faster**: UV is significantly faster than pip for dependency management
117
- - **Auto-manages dependencies**: Automatically handles virtual environments and packages
118
- - **Modern**: The recommended approach for Python tooling in 2025
119
- - **No setup required**: Works without manual virtual environment creation
120
-
121
- 3. **Save and Test**:
122
- - Click **"Add"** to save
123
- - Restart Cursor
124
- - Open a new chat and try: *"Use the chat completion tool to ask Groq about Python"*
125
-
126
- ### Method 2: Project-Specific Configuration (Recommended)
127
-
128
- Create a `.cursor/mcp.json` file in your project root:
129
-
130
- ```json
131
- {
132
- "mcpServers": {
133
- "inference-providers": {
134
- "command": "uvx",
135
- "args": ["mcp_server.py"],
136
- "env": {
137
- "HF_TOKEN": "hf_your_actual_token_here"
138
- }
139
- }
140
- }
141
- }
142
- ```
143
-
144
- **Advantages**:
145
- - ✅ Project-specific (only available in this project)
146
- - ✅ Can be version controlled (but **don't commit tokens!**)
147
- - ✅ Automatic activation when opening the project
148
- - ✅ UV automatically handles dependencies from `pyproject.toml`
149
-
150
- ### Method 3: Global Configuration
151
 
152
- Create a global configuration file:
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
153
 
154
- **Linux/macOS**: `~/.cursor/mcp.json`
155
- **Windows**: `%USERPROFILE%\.cursor\mcp.json`
156
 
 
 
157
  ```json
158
  {
159
  "mcpServers": {
160
  "inference-providers": {
161
- "command": "uvx",
162
- "args": ["/full/path/to/your/project/mcp_server.py"],
163
- "env": {
164
- "HF_TOKEN": "hf_your_actual_token_here"
165
- }
166
  }
167
  }
168
  }
169
  ```
170
 
171
- **Advantages**:
172
- - Available across all Cursor projects
173
- - ✅ Set once, use everywhere
174
-
175
- ### Method 4: Environment Variables (Most Secure)
176
-
177
- If you have `HF_TOKEN` set as a system environment variable, you can use:
178
-
179
  ```json
180
  {
181
  "mcpServers": {
182
  "inference-providers": {
183
- "command": "uvx",
184
- "args": ["mcp_server.py"]
185
  }
186
  }
187
  }
188
  ```
189
 
190
- The server will automatically pick up `HF_TOKEN` from your environment.
191
-
192
- ## 🔄 UV vs Python: When to Use Which?
193
-
194
- | Approach | Best For | Pros | Cons |
195
- |----------|----------|------|------|
196
- | **`uvx` (Recommended)** | Most users, development | ⚡ Fast, auto-manages deps, modern | Requires UV installation |
197
- | **`python`** | System restrictions, debugging | 🔧 Universal, explicit control | Manual venv management |
198
- | **`uv run`** | Local development | 🎯 Project-aware, consistent | Must be in project directory |
199
-
200
- ### UV Installation
201
-
202
- If you don't have UV installed:
203
-
204
- ```bash
205
- # macOS/Linux
206
- curl -LsSf https://astral.sh/uv/install.sh | sh
207
-
208
- # Windows
209
- powershell -ExecutionPolicy ByPass -c "irm https://astral.sh/uv/install.ps1 | iex"
210
-
211
- # Alternative: pip install
212
- pip install uv
213
- ```
214
-
215
- ## 🎮 Using the Server in Cursor
216
-
217
- Once configured, you can use the server in several ways:
218
-
219
- ### 1. Let Cursor Auto-Select Tools
220
-
221
- Simply describe what you want:
222
-
223
- ```
224
- "Help me compare language models for code generation"
225
- "Get recommendations for the best chat models"
226
- "I need to chat with a model about Python best practices"
227
- ```
228
-
229
- Cursor will automatically detect and use the appropriate tools.
230
-
231
- ### 2. Explicitly Request Tools
232
-
233
- Be more specific about which tool to use:
234
-
235
- ```
236
- "Use the chat_completion tool with DeepSeek V3 via Novita to explain machine learning"
237
- "Call the inference providers chat tool to ask Groq about async programming"
238
- ```
239
-
240
- ### 3. Access Resources
241
 
242
- Get information about providers and models:
243
 
244
- ```
245
- "Show me the available inference providers"
246
- "What are the popular models I can use?"
247
- "Get the provider capabilities information"
248
- ```
249
-
250
- ### 4. Generate Prompts
251
-
252
- Use the prompt generation feature:
253
-
254
- ```
255
- "Generate a prompt to compare chat providers"
256
- "Create a comparison prompt for vision language models"
257
- ```
258
-
259
- ## 🎪 Example Conversations
260
 
261
- ### Basic Chat Example
262
 
263
- ```
264
- You: "Use the chat completion tool with Groq and Llama 3.1 70B to explain async/await in Python"
265
 
266
- Cursor: [Calls chat_completion tool]
267
- - Provider: groq
268
- - Model: meta-llama/Llama-3.1-70B-Instruct
269
- - Message: "Explain async/await in Python with examples"
270
 
271
- Response: [Detailed explanation of async/await...]
272
- ```
273
 
274
- ### Provider Comparison Example
275
 
276
- ```
277
- You: "Help me choose between Groq and Together AI for coding tasks"
278
 
279
- Cursor: [Uses provider comparison prompt and chat completion]
280
- Response: [Detailed comparison of providers with recommendations...]
281
- ```
282
 
283
- ### Model Recommendations
 
 
 
 
 
284
 
285
- ```
286
- You: "What are good models for vision tasks?"
287
 
288
- Cursor: [Accesses models/popular resource]
289
- Response: Here are the recommended vision models:
290
- - meta-llama/Llama-3.2-11B-Vision-Instruct (Together)
291
- - microsoft/Phi-3.5-vision-instruct (HF Inference)
292
- - command-r-plus-vision (Cohere)
293
- ```
294
 
295
- ## 🔧 Advanced Configuration
 
 
 
296
 
297
- ### Running as Remote Server
 
 
298
 
299
- For team usage or remote development:
300
 
301
  ```bash
302
- # Option 1: Using UV (Recommended)
303
- uvx mcp_server.py --transport=sse --host=0.0.0.0 --port=8000
304
-
305
- # Option 2: Using Python directly
306
- python -c "
307
- from mcp_server import mcp
308
- mcp.run(transport='sse', host='0.0.0.0', port=8000)
309
- "
310
- ```
311
-
312
- Then configure Cursor to connect remotely:
313
-
314
- ```json
315
- {
316
- "mcpServers": {
317
- "inference-providers": {
318
- "command": "npx",
319
- "args": [
320
- "-y",
321
- "@modelcontextprotocol/client-remote",
322
- "http://your-server:8000/sse"
323
- ],
324
- "env": {
325
- "HF_TOKEN": "hf_your_token_here"
326
- }
327
- }
328
- }
329
- }
330
- ```
331
-
332
- ### Alternative UV Commands
333
-
334
- Different ways to run with UV:
335
 
336
- ```json
337
- {
338
- "mcpServers": {
339
- "inference-providers-uvx": {
340
- "command": "uvx",
341
- "args": ["mcp_server.py"],
342
- "env": {"HF_TOKEN": "hf_your_token_here"}
343
- },
344
- "inference-providers-uv-run": {
345
- "command": "uv",
346
- "args": ["run", "mcp_server.py"],
347
- "env": {"HF_TOKEN": "hf_your_token_here"}
348
- },
349
- "inference-providers-uv-tool": {
350
- "command": "uv",
351
- "args": ["tool", "run", "mcp_server.py"],
352
- "env": {"HF_TOKEN": "hf_your_token_here"}
353
- }
354
- }
355
- }
356
  ```
357
 
358
- **Differences:**
359
- - **`uvx`**: Installs and runs in isolated environment (recommended)
360
- - **`uv run`**: Runs using project's pyproject.toml (project-aware)
361
- - **`uv tool run`**: Explicit tool execution (most explicit)
362
-
363
- ## 🚨 Troubleshooting
364
-
365
- ### Server Not Appearing in Cursor
366
-
367
- 1. **Check Configuration Syntax**:
368
- ```bash
369
- # Validate JSON syntax
370
- python -c "import json; print(json.load(open('.cursor/mcp.json')))"
371
- ```
372
-
373
- 2. **Verify Command Works**:
374
- ```bash
375
- # Test with UV (recommended)
376
- uvx mcp_server.py
377
-
378
- # Or test with Python
379
- python mcp_server.py
380
- ```
381
-
382
- 3. **Check UV Installation**:
383
- ```bash
384
- # Verify UV is installed
385
- uv --version
386
- uvx --version
387
- ```
388
-
389
- 4. **Check Token Format**:
390
- - Token should start with `hf_`
391
- - No quotes in environment variables
392
- - Token has "Inference Providers" scope
393
-
394
- ### Tool Not Working
395
-
396
- 1. **Check Cursor Logs**:
397
- - Go to `Help → Show Logs`
398
- - Look for MCP-related errors
399
-
400
- 2. **Test Server Manually**:
401
- ```bash
402
- # Test with UV
403
- uvx test_mcp.py
404
-
405
- # Or with Python
406
- python test_mcp.py
407
- ```
408
-
409
- 3. **Verify Dependencies**:
410
- ```bash
411
- # UV automatically handles dependencies, but you can check:
412
- uv pip list
413
- ```
414
-
415
- 4. **Verify Token Permissions**:
416
- - Go to [HF Settings](https://huggingface.co/settings/tokens)
417
- - Ensure token has "Inference Providers" access
418
-
419
- ### Common Error Messages
420
-
421
- | Error | Solution |
422
- |-------|----------|
423
- | `HF_TOKEN is required` | Set HF_TOKEN environment variable |
424
- | `Unknown provider: xyz` | Check provider name spelling |
425
- | `Import "fastmcp" could not be resolved` | Run `uv add fastmcp` or `pip install fastmcp` |
426
- | `Server failed to start` | Check UV/Python path and permissions |
427
- | `uvx: command not found` | Install UV: `curl -LsSf https://astral.sh/uv/install.sh \| sh` |
428
- | `Permission denied` | Check file permissions: `chmod +x mcp_server.py` |
429
-
430
- ### Getting Help
431
-
432
- If you're still having issues:
433
-
434
- 1. **Check our test script**: `python test_mcp.py`
435
- 2. **Review Cursor MCP docs**: [https://docs.cursor.com/context/model-context-protocol](https://docs.cursor.com/context/model-context-protocol)
436
- 3. **Check FastMCP docs**: [https://github.com/jlowin/fastmcp](https://github.com/jlowin/fastmcp)
437
- 4. **Cursor Community**: [https://forum.cursor.com](https://forum.cursor.com)
438
-
439
- ## 🤖 Available MCP Capabilities
440
-
441
- ### 🛠️ Tools
442
-
443
- **`chat_completion`** - Generate chat completions using Hugging Face Inference Providers
444
-
445
- Parameters:
446
- - `provider`: Inference provider (cerebras, cohere, groq, novita, etc.)
447
- - `model`: Model ID from Hugging Face Hub
448
- - `messages`: Chat messages (JSON array or plain text)
449
- - `temperature`: Response randomness (0.0-2.0, default 0.7)
450
- - `max_tokens`: Maximum response length (1-4096, default 512)
451
- - `top_p`: Nucleus sampling (0.0-1.0, default 0.9)
452
- - `stream`: Stream response (boolean, default False)
453
- - `stop_sequences`: Stop sequences (comma-separated)
454
- - `frequency_penalty`: Frequency penalty (-2.0 to 2.0)
455
- - `presence_penalty`: Presence penalty (-2.0 to 2.0)
456
- - `hf_token`: Your Hugging Face token (optional, uses env var)
457
-
458
- ### 📊 Resources
459
-
460
- **`providers`** - Get list of available inference providers and capabilities
461
- **`models/popular`** - Get curated recommendations for popular models
462
-
463
- ### 💭 Prompts
464
-
465
- **`generate_provider_comparison_prompt`** - Generate prompts for comparing providers
466
-
467
- ## 🚀 FastMCP Features Used
468
-
469
- - **@mcp.tool**: Exposes the chat completion function as an MCP tool
470
- - **@mcp.resource**: Provides access to provider and model information
471
- - **@mcp.prompt**: Generates helpful prompts for provider comparison
472
- - **Context**: Rich logging, error handling, and progress reporting
473
- - **Multiple Transports**: Supports stdio, SSE, and HTTP transports
474
-
475
- ## 🎯 Popular Models to Try
476
-
477
- **Chat Models:**
478
- - `deepseek-ai/DeepSeek-V3-0324` (Novita)
479
- - `meta-llama/Llama-3.1-70B-Instruct` (Groq)
480
- - `mistralai/Mixtral-8x7B-Instruct-v0.1` (Together)
481
- - `google/gemma-2-27b-it` (HF Inference)
482
-
483
- **Vision Language Models:**
484
- - `meta-llama/Llama-3.2-11B-Vision-Instruct` (Together)
485
- - `microsoft/Phi-3.5-vision-instruct` (HF Inference)
486
-
487
- ## 📖 Technical Details
488
 
489
- This MCP server is built using:
490
- - **FastMCP v2+** - The fast, Pythonic way to build MCP servers
491
- - **Model Context Protocol (MCP)** - For standardized tool exposure
492
- - **Hugging Face Inference Providers** - For model access across providers
493
- - **Async/Await** - For efficient request handling
494
- - **Rich Context Logging** - For detailed operation tracking
495
 
496
- ## 🔗 Links
497
 
498
- - [FastMCP GitHub](https://github.com/jlowin/fastmcp)
499
- - [FastMCP Documentation](https://gofastmcp.com)
500
  - [Cursor MCP Docs](https://docs.cursor.com/context/model-context-protocol)
501
- - [Model Context Protocol](https://modelcontextprotocol.io/)
502
- - [Inference Providers Documentation](https://huggingface.co/docs/inference-providers)
503
  - [Get HF Token](https://huggingface.co/settings/tokens)
504
- - [Cursor Community Forum](https://forum.cursor.com)
505
 
506
  ## 📝 License
507
 
508
- This project is open source and available under the MIT License.
 
1
+ ---
2
+ title: Inference Providers MCP Server
3
+ emoji: 🤖
4
+ colorFrom: blue
5
+ colorTo: purple
6
+ sdk: gradio
7
+ sdk_version: 5.34.2
8
+ app_file: app.py
9
+ pinned: false
10
+ ---
11
 
12
+ # 🤖 Inference Providers MCP Server
13
 
14
+ A streamlined **Model Context Protocol (MCP) Server** that provides LLMs with access to Hugging Face Inference Providers through a single, focused tool.
15
 
16
+ ## What is this?
 
 
 
 
 
 
 
17
 
18
+ This MCP server exposes a `chat_completion` tool that allows LLMs and AI assistants to chat with language models across 14+ inference providers including Cerebras, Cohere, Fireworks, Groq, and more.
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
19
 
20
+ **Why use this?** Instead of manually switching between different AI providers, your LLM can automatically access the best model for each task through a unified interface.
 
 
21
 
22
+ ## 🚀 Supported Providers
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
23
 
24
+ | Provider | Chat | Vision | Provider | Chat | Vision |
25
+ |----------|------|--------|----------|------|--------|
26
+ | Cerebras | ✅ | ❌ | Nebius | ✅ | ✅ |
27
+ | Cohere | ✅ | ✅ | Novita | ✅ | ✅ |
28
+ | Fal AI | ✅ | ✅ | Nscale | ✅ | ✅ |
29
+ | Featherless AI | ✅ | ✅ | Replicate | ✅ | ✅ |
30
+ | Fireworks | ✅ | ✅ | SambaNova | ✅ | ✅ |
31
+ | Groq | ✅ | ❌ | Together | ✅ | ✅ |
32
+ | HF Inference | ✅ | ✅ | Hyperbolic | ✅ | ✅ |
33
+
34
+ ## 🛠️ Quick Setup
35
+
36
+ ### 1. Get HF Token
37
+ 1. Visit [HF Settings](https://huggingface.co/settings/tokens)
38
+ 2. Create token with **Inference Providers** scope
39
+ 3. Copy the token (starts with `hf_`)
40
 
41
+ ### 2. Configure Your MCP Client
 
42
 
43
+ #### Cursor IDE
44
+ Add to `.cursor/mcp.json`:
45
  ```json
46
  {
47
  "mcpServers": {
48
  "inference-providers": {
49
+ "url": "YOUR_URL/gradio_api/mcp/sse"
 
 
 
 
50
  }
51
  }
52
  }
53
  ```
54
 
55
+ #### Claude Desktop
56
+ Add to MCP settings:
 
 
 
 
 
 
57
  ```json
58
  {
59
  "mcpServers": {
60
  "inference-providers": {
61
+ "command": "npx",
62
+ "args": ["mcp-remote", "YOUR_URL/gradio_api/mcp/sse", "--transport", "sse-only"]
63
  }
64
  }
65
  }
66
  ```
67
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
68
 
69
+ ### 3. Server URLs
70
 
71
+ **HF Spaces:** `https://username-spacename.hf.space/gradio_api/mcp/sse`
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
72
 
73
+ **Local:** `http://localhost:7860/gradio_api/mcp/sse`
74
 
75
+ ## 🎯 How to Use
 
76
 
77
+ Once configured, your LLM can use the tool:
 
 
 
78
 
79
+ > "Use chat completion with Groq and Llama to explain Python best practices"
 
80
 
81
+ > "Chat with DeepSeek V3 via Novita about machine learning concepts"
82
 
83
+ ## 🛠️ Available Tool
 
84
 
85
+ **`chat_completion`** - Generate responses using multiple AI providers
 
 
86
 
87
+ **Parameters:**
88
+ - `provider`: Provider name (novita, groq, cerebras, etc.)
89
+ - `model`: Model ID (e.g., `deepseek-ai/DeepSeek-V3-0324`)
90
+ - `messages`: Input text or JSON messages array
91
+ - `temperature`: Response randomness (0.0-2.0, default: 0.7)
92
+ - `max_tokens`: Max response length (1-4096, default: 512)
93
 
94
+ **Environment:** Requires `HF_TOKEN` environment variable
 
95
 
96
+ ## 🎯 Popular Models
 
 
 
 
 
97
 
98
+ **Text Models:**
99
+ - `deepseek-ai/DeepSeek-V3-0324` (Novita)
100
+ - `meta-llama/Llama-3.1-70B-Instruct` (Groq)
101
+ - `mistralai/Mixtral-8x7B-Instruct-v0.1` (Together)
102
 
103
+ **Vision Models:**
104
+ - `meta-llama/Llama-3.2-11B-Vision-Instruct` (Together)
105
+ - `microsoft/Phi-3.5-vision-instruct` (HF Inference)
106
 
107
+ ## 💻 Local Development
108
 
109
  ```bash
110
+ # Clone and setup
111
+ git clone <repository-url>
112
+ cd inference-providers-mcp
113
+ pip install -r requirements.txt
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
114
 
115
+ # Set token and run
116
+ export HF_TOKEN=hf_your_token_here
117
+ python app.py
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
118
  ```
119
 
120
+ ## 🔧 Technical Details
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
121
 
122
+ - **Built with:** Gradio + MCP support (`gradio[mcp]`)
123
+ - **Protocol:** Model Context Protocol (MCP) via Server-Sent Events
124
+ - **Security:** Environment-based token management
125
+ - **Compatibility:** Works with Cursor, Claude Desktop, and other MCP clients
 
 
126
 
127
+ ## 🔗 Resources
128
 
 
 
129
  - [Cursor MCP Docs](https://docs.cursor.com/context/model-context-protocol)
130
+ - [Gradio MCP Guide](https://huggingface.co/blog/gradio-mcp)
131
+ - [Inference Providers Docs](https://huggingface.co/docs/inference-providers)
132
  - [Get HF Token](https://huggingface.co/settings/tokens)
 
133
 
134
  ## 📝 License
135
 
136
+ MIT License - see the code for details.
mcp_server.py → app.py RENAMED
@@ -1,11 +1,8 @@
 
1
  import os
2
- import json
3
  import requests
4
- from typing import Dict, Any, Optional
5
- from fastmcp import FastMCP, Context
6
-
7
- # Initialize FastMCP server
8
- mcp = FastMCP("Inference Providers MCP Server")
9
 
10
  # Inference Providers configuration
11
  PROVIDERS = {
@@ -82,228 +79,247 @@ PROVIDERS = {
82
  }
83
 
84
 
85
- async def make_request(
86
- provider: str,
87
- endpoint: str,
88
- payload: Dict[str, Any],
89
- hf_token: str,
90
- ctx: Optional[Context] = None,
91
- ) -> Dict[str, Any]:
92
- """Make a request to the inference provider"""
93
- if not hf_token:
94
- error_msg = (
95
- "HF_TOKEN is required. Please set it in the environment or provide it."
96
- )
97
- if ctx:
98
- await ctx.error(error_msg)
99
- return {"error": error_msg}
100
-
101
- provider_config = PROVIDERS.get(provider)
102
- if not provider_config:
103
- error_msg = f"Unknown provider: {provider}"
104
- if ctx:
105
- await ctx.error(error_msg)
106
- return {"error": error_msg}
107
-
108
- url = f"{provider_config['base_url']}/{endpoint}"
109
- headers = {
110
- "Authorization": f"Bearer {hf_token}",
111
- "Content-Type": "application/json",
112
- }
113
-
114
- if ctx:
115
- await ctx.info(f"Making request to {provider} ({url})")
116
-
117
- try:
118
- response = requests.post(url, headers=headers, json=payload, timeout=60)
119
- response.raise_for_status()
120
-
121
- if ctx:
122
- await ctx.info(f"Request successful to {provider}")
123
-
124
- return response.json()
125
- except requests.exceptions.RequestException as e:
126
- error_msg = f"Request failed: {str(e)}"
127
- if ctx:
128
- await ctx.error(error_msg)
129
- return {"error": error_msg}
130
-
131
-
132
- @mcp.tool()
133
- async def chat_completion(
134
  provider: str,
135
  model: str,
136
  messages: str,
137
- ctx: Context,
138
  temperature: float = 0.7,
139
  max_tokens: int = 512,
140
- top_p: float = 0.9,
141
- stream: bool = False,
142
- stop_sequences: str = "",
143
- frequency_penalty: float = 0.0,
144
- presence_penalty: float = 0.0,
145
- hf_token: Optional[str] = None,
146
- ) -> str:
147
  """Generate chat completions using Hugging Face Inference Providers.
148
 
149
- This tool allows you to chat with various language models through
150
- different inference providers including Cerebras, Cohere, Fireworks,
151
- Groq, and others.
152
 
153
  Args:
154
- provider: The inference provider to use (cerebras, cohere, fal-ai,
155
- featherless-ai, fireworks-ai, groq, hf-inference,
156
- hyperbolic, nebius, novita, nscale, replicate, sambanova,
157
- together)
158
  model: The model ID from Hugging Face Hub
159
  (e.g., 'deepseek-ai/DeepSeek-V3-0324')
160
  messages: Either a JSON array of messages in OpenAI format or
161
  plain text for simple queries
162
  temperature: Controls response randomness (0.0-2.0, default 0.7)
163
  max_tokens: Maximum tokens in response (1-4096, default 512)
164
- top_p: Nucleus sampling parameter (0.0-1.0, default 0.9)
165
- stream: Whether to stream the response (default False)
166
- stop_sequences: Comma-separated stop sequences (optional)
167
- frequency_penalty: Penalize frequent tokens (-2.0 to 2.0)
168
- presence_penalty: Penalize present tokens (-2.0 to 2.0)
169
- hf_token: Your Hugging Face token with Inference Providers access
170
- (falls back to HF_TOKEN environment variable)
171
 
172
  Returns:
173
  The generated text response from the language model
174
  """
175
- # Get HF token from parameter or environment
176
- token = hf_token or os.getenv("HF_TOKEN")
177
- if not token:
178
- await ctx.error("HF_TOKEN not provided and not found in environment")
179
- return "Error: HF_TOKEN is required but not provided"
 
 
180
 
181
- await ctx.info(f"Starting chat completion with {provider} provider")
182
- await ctx.info(f"Model: {model}")
 
 
183
 
184
  try:
185
  # Parse messages
186
  if messages.strip().startswith("["):
187
  parsed_messages = json.loads(messages)
188
- await ctx.info(f"Parsed {len(parsed_messages)} messages from JSON")
189
  else:
190
  parsed_messages = [{"role": "user", "content": messages}]
191
- await ctx.info("Created single user message")
192
 
 
193
  payload = {
194
  "model": model,
195
  "messages": parsed_messages,
196
  "temperature": temperature,
197
  "max_tokens": max_tokens,
198
- "top_p": top_p,
199
- "stream": stream,
200
  }
201
 
202
- # Add optional parameters
203
- if stop_sequences.strip():
204
- payload["stop"] = [s.strip() for s in stop_sequences.split(",")]
205
- await ctx.info(f"Added stop sequences: {payload['stop']}")
206
-
207
- if frequency_penalty != 0:
208
- payload["frequency_penalty"] = frequency_penalty
209
-
210
- if presence_penalty != 0:
211
- payload["presence_penalty"] = presence_penalty
212
-
213
- result = await make_request(
214
- provider, "v1/chat/completions", payload, token, ctx
215
- )
216
 
217
- if "error" in result:
218
- await ctx.error(f"API Error: {result['error']}")
219
- return f"Error: {result['error']}"
220
 
 
221
  if "choices" in result and len(result["choices"]) > 0:
222
- response_text = result["choices"][0]["message"]["content"]
223
- await ctx.info(f"Generated response with {len(response_text)} characters")
224
- return response_text
225
  else:
226
- await ctx.warning("Unexpected response format")
227
- return json.dumps(result, indent=2)
228
 
229
- except json.JSONDecodeError as e:
230
- error_msg = f"Invalid JSON format for messages: {str(e)}"
231
- await ctx.error(error_msg)
232
- return f"Error: {error_msg}"
 
 
 
233
  except Exception as e:
234
- error_msg = f"Unexpected error: {str(e)}"
235
- await ctx.error(error_msg)
236
- return f"Error: {error_msg}"
237
-
238
-
239
- @mcp.resource("file://providers")
240
- async def get_providers() -> str:
241
- """Get the list of available inference providers and their capabilities.
242
-
243
- Returns JSON information about all supported providers including their
244
- supported tasks and base URLs.
245
- """
246
- return json.dumps(PROVIDERS, indent=2)
247
-
248
 
249
- @mcp.resource("file://models/popular")
250
- async def get_popular_models() -> str:
251
- """Get a list of popular models for each provider.
252
 
253
- Returns curated recommendations for models to try with each provider.
254
- """
255
- popular_models = {
256
- "chat_models": {
257
- "cerebras": ["llama3.1-70b"],
258
- "cohere": ["command-r-plus"],
259
- "groq": ["meta-llama/Llama-3.1-70B-Instruct"],
260
- "novita": ["deepseek-ai/DeepSeek-V3-0324"],
261
- "together": ["mistralai/Mixtral-8x7B-Instruct-v0.1"],
262
- "hf-inference": ["google/gemma-2-27b-it"],
263
- },
264
- "vision_models": {
265
- "cohere": ["command-r-plus-vision"],
266
- "together": ["meta-llama/Llama-3.2-11B-Vision-Instruct"],
267
- "hf-inference": ["microsoft/Phi-3.5-vision-instruct"],
268
- },
269
- }
270
- return json.dumps(popular_models, indent=2)
271
-
272
-
273
- @mcp.prompt()
274
- def generate_provider_comparison_prompt(task: str = "chat") -> str:
275
- """Generate a prompt to help compare different inference providers.
276
-
277
- Args:
278
- task: The type of task to compare providers for (default: "chat")
279
-
280
- Returns:
281
- A prompt that can be used to get comparative analysis of providers
282
- """
283
- available_providers = [
284
- name
285
- for name, config in PROVIDERS.items()
286
- if f"{task}-completion" in config["tasks"]
287
  ]
288
 
289
- providers_list = ", ".join(available_providers)
290
-
291
- return f"""Please compare the following inference providers for {task} tasks:
292
-
293
- Providers: {providers_list}
294
-
295
- Consider factors like:
296
- - Model selection and capabilities
297
- - Performance and speed
298
- - Pricing (if known)
299
- - Special features or limitations
300
- - Use case recommendations
301
 
302
- Provide a balanced comparison that helps choose the right provider."""
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
303
 
304
 
305
  if __name__ == "__main__":
306
- # Run the MCP server
307
- # Default: stdio transport for local development
308
- # For production, use: mcp.run(transport="sse", host="0.0.0.0", port=8000)
309
- mcp.run()
 
1
+ import gradio as gr
2
  import os
 
3
  import requests
4
+ import json
5
+ from typing import List
 
 
 
6
 
7
  # Inference Providers configuration
8
  PROVIDERS = {
 
79
  }
80
 
81
 
82
+ def chat_completion(
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
83
  provider: str,
84
  model: str,
85
  messages: str,
 
86
  temperature: float = 0.7,
87
  max_tokens: int = 512,
88
+ ):
 
 
 
 
 
 
89
  """Generate chat completions using Hugging Face Inference Providers.
90
 
91
+ This tool provides access to multiple AI providers and language models
92
+ through Hugging Face's unified Inference Providers API.
 
93
 
94
  Args:
95
+ provider: The inference provider to use. Available providers:
96
+ cerebras, cohere, fal-ai, featherless-ai, fireworks-ai,
97
+ groq, hf-inference, hyperbolic, nebius, novita, nscale,
98
+ replicate, sambanova, together
99
  model: The model ID from Hugging Face Hub
100
  (e.g., 'deepseek-ai/DeepSeek-V3-0324')
101
  messages: Either a JSON array of messages in OpenAI format or
102
  plain text for simple queries
103
  temperature: Controls response randomness (0.0-2.0, default 0.7)
104
  max_tokens: Maximum tokens in response (1-4096, default 512)
 
 
 
 
 
 
 
105
 
106
  Returns:
107
  The generated text response from the language model
108
  """
109
+ # Get HF token from environment
110
+ hf_token = os.getenv("HF_TOKEN")
111
+ if not hf_token:
112
+ return (
113
+ "Error: HF_TOKEN environment variable is required. "
114
+ "Please set your Hugging Face token."
115
+ )
116
 
117
+ # Validate provider
118
+ if provider not in PROVIDERS:
119
+ available = ", ".join(PROVIDERS.keys())
120
+ return f"Error: Unknown provider '{provider}'. Available providers: {available}"
121
 
122
  try:
123
  # Parse messages
124
  if messages.strip().startswith("["):
125
  parsed_messages = json.loads(messages)
 
126
  else:
127
  parsed_messages = [{"role": "user", "content": messages}]
 
128
 
129
+ # Build request payload
130
  payload = {
131
  "model": model,
132
  "messages": parsed_messages,
133
  "temperature": temperature,
134
  "max_tokens": max_tokens,
 
 
135
  }
136
 
137
+ # Make request to provider
138
+ provider_config = PROVIDERS[provider]
139
+ url = f"{provider_config['base_url']}/v1/chat/completions"
140
+ headers = {
141
+ "Authorization": f"Bearer {hf_token}",
142
+ "Content-Type": "application/json",
143
+ }
 
 
 
 
 
 
 
144
 
145
+ response = requests.post(url, headers=headers, json=payload, timeout=60)
146
+ response.raise_for_status()
147
+ result = response.json()
148
 
149
+ # Extract response
150
  if "choices" in result and len(result["choices"]) > 0:
151
+ return result["choices"][0]["message"]["content"]
 
 
152
  else:
153
+ return f"Error: Unexpected response format: {json.dumps(result, indent=2)}"
 
154
 
155
+ except json.JSONDecodeError:
156
+ return (
157
+ "Error: Invalid JSON format for messages. "
158
+ "Use either plain text or valid JSON array."
159
+ )
160
+ except requests.exceptions.RequestException as e:
161
+ return f"Error: Request failed: {str(e)}"
162
  except Exception as e:
163
+ return f"Error: {str(e)}"
 
 
 
 
 
 
 
 
 
 
 
 
 
164
 
 
 
 
165
 
166
+ def get_providers_for_task(task: str) -> List[str]:
167
+ """Get available providers for a specific task"""
168
+ return [
169
+ provider for provider, config in PROVIDERS.items() if task in config["tasks"]
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
170
  ]
171
 
 
 
 
 
 
 
 
 
 
 
 
 
172
 
173
+ # Create Gradio interface
174
+ with gr.Blocks(title="Inference Providers MCP Server", theme=gr.themes.Soft()) as app:
175
+ gr.Markdown("""
176
+ # 🤖 Inference Providers MCP Server
177
+
178
+ A streamlined Model Context Protocol (MCP) server for Hugging Face
179
+ Inference Providers, providing LLMs with access to multiple AI
180
+ providers through a simple, focused interface.
181
+
182
+ **Supported Providers:** Cerebras, Cohere, Fal AI, Featherless AI,
183
+ Fireworks, Groq, HF Inference, Hyperbolic, Nebius, Novita, Nscale,
184
+ Replicate, SambaNova, Together
185
+
186
+ **Required:** Set HF_TOKEN environment variable with your Hugging Face
187
+ token that has Inference Providers access.
188
+ """)
189
+
190
+ # Environment status
191
+ hf_token_status = "✅ Set" if os.getenv("HF_TOKEN") else "❌ Not Set"
192
+ gr.Markdown(f"**HF_TOKEN Status:** {hf_token_status}")
193
+
194
+ if not os.getenv("HF_TOKEN"):
195
+ gr.Markdown("""
196
+ **⚠️ Setup Required:**
197
+ 1. Get token: [HF Settings](https://huggingface.co/settings/tokens)
198
+ 2. Set environment: `export HF_TOKEN=hf_your_token_here`
199
+ 3. Restart application
200
+ """)
201
+
202
+ with gr.Tabs():
203
+ # Chat Completion Tab
204
+ with gr.Tab("💬 Chat Completion", id="chat"):
205
+ with gr.Row():
206
+ with gr.Column(scale=1):
207
+ chat_provider = gr.Dropdown(
208
+ choices=get_providers_for_task("chat-completion"),
209
+ label="Provider",
210
+ value="novita",
211
+ info="Select inference provider",
212
+ )
213
+ chat_model = gr.Textbox(
214
+ label="Model",
215
+ value="deepseek-ai/DeepSeek-V3-0324",
216
+ placeholder="e.g., deepseek-ai/DeepSeek-V3-0324",
217
+ info="Model ID from Hugging Face Hub",
218
+ )
219
+
220
+ with gr.Column(scale=2):
221
+ chat_messages = gr.Textbox(
222
+ label="Messages",
223
+ lines=8,
224
+ placeholder=(
225
+ '[{"role": "user", "content": "Hello!"}]'
226
+ "\n\nOr just type directly"
227
+ ),
228
+ info="JSON array of messages or plain text",
229
+ )
230
+
231
+ with gr.Accordion("⚙️ Parameters", open=False):
232
+ with gr.Row():
233
+ chat_temperature = gr.Slider(0.0, 2.0, 0.7, label="Temperature")
234
+ chat_max_tokens = gr.Slider(1, 4096, 512, label="Max Tokens")
235
+
236
+ chat_submit = gr.Button("🚀 Generate", variant="primary")
237
+ chat_output = gr.Textbox(label="Response", lines=10)
238
+
239
+ chat_submit.click(
240
+ chat_completion,
241
+ inputs=[
242
+ chat_provider,
243
+ chat_model,
244
+ chat_messages,
245
+ chat_temperature,
246
+ chat_max_tokens,
247
+ ],
248
+ outputs=chat_output,
249
+ )
250
+
251
+ # MCP Documentation Tab
252
+ with gr.Tab("🔧 MCP Setup", id="mcp"):
253
+ gr.Markdown("""
254
+ ## 🤖 MCP Server Setup
255
+
256
+ This MCP server exposes `chat_completion` tool for LLMs to access
257
+ Hugging Face Inference Providers.
258
+
259
+ ### 📡 Server URL
260
+
261
+ **Local:** `http://localhost:7860/gradio_api/mcp/sse`
262
+
263
+ **HF Spaces:** `https://username-spacename.hf.space/gradio_api/mcp/sse`
264
+
265
+ ### ⚙️ Client Configuration
266
+
267
+ #### Cursor IDE
268
+
269
+ Add to `.cursor/mcp.json`:
270
+ ```json
271
+ {
272
+ "mcpServers": {
273
+ "inference-providers": {
274
+ "url": "YOUR_URL/gradio_api/mcp/sse"
275
+ }
276
+ }
277
+ }
278
+ ```
279
+
280
+ #### Claude Desktop
281
+
282
+ Add to MCP settings:
283
+ ```json
284
+ {
285
+ "mcpServers": {
286
+ "inference-providers": {
287
+ "command": "npx",
288
+ "args": [
289
+ "mcp-remote",
290
+ "YOUR_URL/gradio_api/mcp/sse",
291
+ "--transport", "sse-only"
292
+ ]
293
+ }
294
+ }
295
+ }
296
+ ```
297
+
298
+ ### 🛠️ Tool Details
299
+
300
+ **`chat_completion`** - Generate chat responses
301
+
302
+ **Parameters:**
303
+ - `provider`: Provider name (novita, groq, etc.)
304
+ - `model`: Model ID (deepseek-ai/DeepSeek-V3-0324)
305
+ - `messages`: Input text or JSON messages
306
+ - `temperature`: Randomness (0.0-2.0, default: 0.7)
307
+ - `max_tokens`: Max length (1-4096, default: 512)
308
+
309
+ **Environment:** Requires HF_TOKEN
310
+
311
+ ### 🎯 Usage
312
+
313
+ > "Use chat completion with Groq and Llama to explain Python"
314
+
315
+ ### 🔗 Links
316
+
317
+ - [Cursor MCP](https://docs.cursor.com/context/model-context-protocol)
318
+ - [Gradio MCP Guide](https://huggingface.co/blog/gradio-mcp)
319
+ - [Get HF Token](https://huggingface.co/settings/tokens)
320
+ """)
321
 
322
 
323
  if __name__ == "__main__":
324
+ # Enable MCP server functionality
325
+ app.launch(mcp_server=True)
 
 
pyproject.toml CHANGED
@@ -1,11 +1,12 @@
1
  [project]
2
  name = "inference-providers-mcp"
3
  version = "0.1.0"
4
- description = "FastMCP Server for Hugging Face Inference Providers"
5
  readme = "README.md"
6
  requires-python = ">=3.11"
7
  dependencies = [
8
- "fastmcp>=2.0.0",
 
9
  "requests>=2.31.0",
10
- "python-dotenv>=1.0.0"
11
  ]
 
1
  [project]
2
  name = "inference-providers-mcp"
3
  version = "0.1.0"
4
+ description = "MCP Server for Hugging Face Inference Providers Chat Completion"
5
  readme = "README.md"
6
  requires-python = ">=3.11"
7
  dependencies = [
8
+ "gradio[mcp]>=5.34.0",
9
+ "huggingface_hub>=0.20.0",
10
  "requests>=2.31.0",
11
+ "python-dotenv>=1.0.0",
12
  ]
requirements.txt CHANGED
@@ -1,3 +1,4 @@
1
- fastmcp>=2.0.0
 
2
  requests>=2.31.0
3
  python-dotenv>=1.0.0
 
1
+ gradio[mcp]>=4.0.0
2
+ huggingface_hub>=0.20.0
3
  requests>=2.31.0
4
  python-dotenv>=1.0.0
uv.lock CHANGED
The diff for this file is too large to render. See raw diff