|
You need a custom version of the `tokenizers` library to use this tokenizer. |
|
|
|
To install this custom version you can: |
|
```bash |
|
pip install transformers |
|
git clone https://github.com/huggingface/tokenizers.git |
|
cd tokenizers |
|
git checkout bigscience_fork |
|
cd bindings/python |
|
pip install setuptools_rust |
|
pip install -e . |
|
``` |
|
|
|
and then to load it, do: |
|
```python |
|
from transformers import AutoTokenizer |
|
|
|
tokenizer = AutoTokenizer.from_pretrained("bigscience-catalogue-data-dev/byte-level-bpe-tokenizer-no-norm-250k-whitespace-and-eos-regex-alpha-v3-dedup-lines-articles") |
|
``` |