Spaces:
Configuration error
Configuration error
File size: 5,368 Bytes
447ebeb |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 |
# CodeLlama Server: Streaming, Caching, Model Fallbacks (OpenAI + Anthropic), Prompt-tracking
Works with: Anthropic, Huggingface, Cohere, TogetherAI, Azure, OpenAI, etc.
[](https://pypi.org/project/litellm/)
[](https://pypi.org/project/litellm/0.1.1/)

[](https://railway.app/template/HuDPw-?referralCode=jch2ME)
**LIVE DEMO** - https://litellm.ai/playground
## What does CodeLlama Server do
- Uses Together AI's CodeLlama to answer coding questions, with GPT-4 + Claude-2 as backups (you can easily switch this to any model from Huggingface, Replicate, Cohere, AI21, Azure, OpenAI, etc.)
- Sets default system prompt for guardrails `system_prompt = "Only respond to questions about code. Say 'I don't know' to anything outside of that."`
- Integrates with Promptlayer for model + prompt tracking
- Example output
<img src="imgs/code-output.png" alt="Code Output" width="600"/>
- **Consistent Input/Output** Format
- Call all models using the OpenAI format - `completion(model, messages)`
- Text responses will always be available at `['choices'][0]['message']['content']`
- Stream responses will always be available at `['choices'][0]['delta']['content']`
- **Error Handling** Using Model Fallbacks (if `CodeLlama` fails, try `GPT-4`) with cooldowns, and retries
- **Prompt Logging** - Log successful completions to promptlayer for testing + iterating on your prompts in production! (Learn more: https://litellm.readthedocs.io/en/latest/advanced/
**Example: Logs sent to PromptLayer**
<img src="imgs/promptlayer_logging.png" alt="Prompt Logging" width="900"/>
- **Token Usage & Spend** - Track Input + Completion tokens used + Spend/model - https://docs.litellm.ai/docs/token_usage
- **Caching** - Provides in-memory cache + GPT-Cache integration for more advanced usage - https://docs.litellm.ai/docs/caching/gpt_cache
- **Streaming & Async Support** - Return generators to stream text responses - TEST IT π https://litellm.ai/
## API Endpoints
### `/chat/completions` (POST)
This endpoint is used to generate chat completions for 50+ support LLM API Models. Use llama2, GPT-4, Claude2 etc
#### Input
This API endpoint accepts all inputs in raw JSON and expects the following inputs
- `prompt` (string, required): The user's coding related question
- Additional Optional parameters: `temperature`, `functions`, `function_call`, `top_p`, `n`, `stream`. See the full list of supported inputs here: https://litellm.readthedocs.io/en/latest/input/
#### Example JSON body
For claude-2
```json
{
"prompt": "write me a function to print hello world"
}
```
### Making an API request to the Code-Gen Server
```python
import requests
import json
url = "localhost:4000/chat/completions"
payload = json.dumps({
"prompt": "write me a function to print hello world"
})
headers = {
'Content-Type': 'application/json'
}
response = requests.request("POST", url, headers=headers, data=payload)
print(response.text)
```
### Output [Response Format]
Responses from the server are given in the following format.
All responses from the server are returned in the following format (for all LLM models). More info on output here: https://litellm.readthedocs.io/en/latest/output/
```json
{
"choices": [
{
"finish_reason": "stop",
"index": 0,
"message": {
"content": ".\n\n```\ndef print_hello_world():\n print(\"hello world\")\n",
"role": "assistant"
}
}
],
"created": 1693279694.6474009,
"model": "togethercomputer/CodeLlama-34b-Instruct",
"usage": {
"completion_tokens": 14,
"prompt_tokens": 28,
"total_tokens": 42
}
}
```
## Installation & Usage
### Running Locally
1. Clone liteLLM repository to your local machine:
```
git clone https://github.com/BerriAI/litellm-CodeLlama-server
```
2. Install the required dependencies using pip
```
pip install requirements.txt
```
3. Set your LLM API keys
```
os.environ['OPENAI_API_KEY]` = "YOUR_API_KEY"
or
set OPENAI_API_KEY in your .env file
```
4. Run the server:
```
python main.py
```
## Deploying
1. Quick Start: Deploy on Railway
[](https://railway.app/template/HuDPw-?referralCode=jch2ME)
2. `GCP`, `AWS`, `Azure`
This project includes a `Dockerfile` allowing you to build and deploy a Docker Project on your providers
# Support / Talk with founders
- [Our calendar π](https://calendly.com/d/4mp-gd3-k5k/berriai-1-1-onboarding-litellm-hosted-version)
- [Community Discord π](https://discord.gg/wuPM9dRgDw)
- Our numbers π +1 (770) 8783-106 / +1 (412) 618-6238
- Our emails βοΈ [email protected] / [email protected]
## Roadmap
- [ ] Implement user-based rate-limiting
- [ ] Spending controls per project - expose key creation endpoint
- [ ] Need to store a keys db -> mapping created keys to their alias (i.e. project name)
- [ ] Easily add new models as backups / as the entry-point (add this to the available model list)
|