Transformers
gpt2-numfix / README.md
cyrilzhang's picture
Update README.md
b0e76f1
|
raw
history blame
819 Bytes
metadata
license: mit

GPT-2 Tokenizer with unmerged digits

from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained('cyrilzhang/gpt2-numfix')

A fork of the GPT-2 tokenizer, which removes multi-digit tokens:

tokenizer('123.45')  # [16, 17, 18, 13, 19, 20]
gpt2_tokenizer('123.45')  # [10163, 13, 2231]

Backwards-compatible:

tokenizer.decode([10163, 46387])  # '<unused123> pigeon'
gpt2_tokenizer.decode([10163, 46387])  # '123 pigeon'
  • This is for my investigations into the arithmetic capabilities of large language models. There is no model here, only a tokenizer.
  • PaLM does this.
  • Many models (illustriously, GPT-3) use the GPT-2 tokenizer, which doesn't do this.