File size: 1,371 Bytes
df766b3 eb2c20c df766b3 aaa8521 eb2c20c aaa8521 eb2c20c df766b3 eb2c20c df766b3 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 |
---
license: mit
library_name: transformers
datasets:
- AI-MO/NuminaMath-CoT
- KbsdJames/Omni-MATH
- RUC-AIBOX/STILL-3-Preview-RL-Data
- hendrycks/competition_math
language:
- en
base_model: agentica-org/DeepScaleR-1.5B-Preview
tags:
- mlx
---
# bobig/DeepScaleR-1.5B-6.5bit
This works well as a draft model for speculative decoding in [LMstudio 3.10 beta](https://lmstudio.ai/docs/advanced/speculative-decoding)
Try it with: [mlx-community/FuseO1-DeepSeekR1-Qwen2.5-Coder-32B-4.5bit](https://huggingface.co/mlx-community/FuseO1-DeepSeekR1-Qwen2.5-Coder-32B-4.5bit)
you should see 30% faster TPS for math/code prompts even with "thinking" slowing down the Specultive Decoding
The Model [bobig/DeepScaleR-1.5B-6.5bit](https://huggingface.co/bobig/DeepScaleR-1.5B-6.5bit) was
converted to MLX format from [agentica-org/DeepScaleR-1.5B-Preview](https://huggingface.co/agentica-org/DeepScaleR-1.5B-Preview)
using mlx-lm version **0.21.4**.
## Use with mlx
```bash
pip install mlx-lm
```
```python
from mlx_lm import load, generate
model, tokenizer = load("bobig/DeepScaleR-1.5B-6.5bit")
prompt = "hello"
if tokenizer.chat_template is not None:
messages = [{"role": "user", "content": prompt}]
prompt = tokenizer.apply_chat_template(
messages, add_generation_prompt=True
)
response = generate(model, tokenizer, prompt=prompt, verbose=True)
```
|