File size: 2,611 Bytes
495701d
 
 
 
 
 
 
 
 
 
 
 
dbe8bab
495701d
 
 
 
 
 
 
 
 
 
 
 
 
 
 
3616abf
495701d
 
 
 
 
 
 
3616abf
495701d
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
---
library_name: transformers
tags:
- tokenizers
- sglang
license: other
license_name: grok-2
license_link: https://huggingface.co/xai-org/grok-2/blob/main/LICENSE
---

# Grok-2 Tokenizer

A 🤗-compatible version of the **Grok-2 tokenizer** (adapted from [xai-org/grok-2](https://huggingface.co/xai-org/grok-2)).

This means it can be used with Hugging Face libraries including [Transformers](https://github.com/huggingface/transformers),
[Tokenizers](https://github.com/huggingface/tokenizers), and [Transformers.js](https://github.com/xenova/transformers.js).

## Motivation

As Grok 2.5 aka. [xai-org/grok-2](https://github.com/xai-org/grok-2) has been recently released on the 🤗 Hub with [SGLang](https://github.com/sgl-project/sglang)
native support, but the checkpoints on the Hub won't come with a Hugging Face compatible tokenizer, but rather with a `tiktoken`-based
JSON export, which is [internally read and patched in SGLang](https://github.com/sgl-project/sglang/blob/fd71b11b1d96d385b09cb79c91a36f1f01293639/python/sglang/srt/tokenizer/tiktoken_tokenizer.py#L29-L108).

This repository then contains the Hugging Face compatible export so that users can easily interact and play around with the Grok-2 tokenizer,
besides that allowing to use it via SGLang without having to pull the repository manually from the Hub and then using a mount, to prevent from directly having
to point to the tokenizer path, so that Grok-2 can be deployed as:

```bash
python3 -m sglang.launch_server --model-path xai-org/grok-2 --tokenizer-path alvarobartt/grok-2-tokenizer --tp-size 8 --quantization fp8 --attention-backend triton
```

Rather than the former 2-step process:

```bash
hf download xai-org/grok-2 --local-dir /local/grok-2

python3 -m sglang.launch_server --model-path /local/grok-2 --tokenizer-path /local/grok-2/tokenizer.tok.json --tp-size 8 --quantization fp8 --attention-backend triton
```

## Example

```py
from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("alvarobartt/grok-2-tokenizer")

assert tokenizer.encode("Human: What is Deep Learning?<|separator|>\n\n") == [
    35406,
    186,
    2171,
    458,
    17454,
    14803,
    191,
    1,
    417,
]

assert (
    tokenizer.apply_chat_template(
        [{"role": "user", "content": "What is the capital of France?"}], tokenize=False
    )
    == "Human: What is the capital of France?<|separator|>\n\n"
)
```

> [!NOTE]
> This repository has been inspired by earlier similar work by [Xenova](https://huggingface.co/Xenova) in [`Xenova/grok-1-tokenizer`](https://huggingface.co/Xenova/grok-1-tokenizer).