File size: 1,371 Bytes

df766b3
 
 
 
 
 
 
 
 
 
 
 
 
 
 
eb2c20c
df766b3
aaa8521
eb2c20c
 
 
aaa8521
eb2c20c
 
df766b3
 
 
 
 
 
 
 
 
 
 
 
eb2c20c
df766b3

---
license: mit
library_name: transformers
datasets:
- AI-MO/NuminaMath-CoT
- KbsdJames/Omni-MATH
- RUC-AIBOX/STILL-3-Preview-RL-Data
- hendrycks/competition_math
language:
- en
base_model: agentica-org/DeepScaleR-1.5B-Preview
tags:
- mlx
---

# bobig/DeepScaleR-1.5B-6.5bit

This works well as a draft model for speculative decoding in [LMstudio 3.10 beta](https://lmstudio.ai/docs/advanced/speculative-decoding)

Try it with: [mlx-community/FuseO1-DeepSeekR1-Qwen2.5-Coder-32B-4.5bit](https://huggingface.co/mlx-community/FuseO1-DeepSeekR1-Qwen2.5-Coder-32B-4.5bit)

you should see 30% faster TPS for math/code prompts even with "thinking" slowing down the Specultive Decoding

The Model [bobig/DeepScaleR-1.5B-6.5bit](https://huggingface.co/bobig/DeepScaleR-1.5B-6.5bit) was
converted to MLX format from [agentica-org/DeepScaleR-1.5B-Preview](https://huggingface.co/agentica-org/DeepScaleR-1.5B-Preview)
using mlx-lm version **0.21.4**.

## Use with mlx

```bash
pip install mlx-lm
```

```python
from mlx_lm import load, generate

model, tokenizer = load("bobig/DeepScaleR-1.5B-6.5bit")

prompt = "hello"

if tokenizer.chat_template is not None:
    messages = [{"role": "user", "content": prompt}]
    prompt = tokenizer.apply_chat_template(
        messages, add_generation_prompt=True
    )

response = generate(model, tokenizer, prompt=prompt, verbose=True)
```