cyrilzhang
/

gpt2-numfix

Model card Files Files and versions Community

cyrilzhang commited on Dec 25, 2022

Commit

ed23458

·

1 Parent(s): b0e76f1

Update README.md

Files changed (1) hide show

README.md +4 -4

README.md CHANGED Viewed

@@ -4,19 +4,18 @@ license: mit
 ## GPT-2 Tokenizer with unmerged digits
 ```python
 from transformers import AutoTokenizer
 tokenizer = AutoTokenizer.from_pretrained('cyrilzhang/gpt2-numfix')
-```
-A fork of the GPT-2 tokenizer, which **removes multi-digit tokens**:
-```python
 tokenizer('123.45')  # [16, 17, 18, 13, 19, 20]
 gpt2_tokenizer('123.45')  # [10163, 13, 2231]
 ```
-Backwards-compatible:
 ```python
 tokenizer.decode([10163, 46387])  # '<unused123> pigeon'
 gpt2_tokenizer.decode([10163, 46387])  # '123 pigeon'
@@ -24,4 +23,5 @@ gpt2_tokenizer.decode([10163, 46387])  # '123 pigeon'
 - This is for my investigations into the arithmetic capabilities of large language models. There is no model here, only a tokenizer.
 - [PaLM](https://arxiv.org/abs/2204.02311) does this.
 - Many models (illustriously, [GPT-3](https://arxiv.org/abs/2005.14165)) use the GPT-2 tokenizer, which doesn't do this.

 ## GPT-2 Tokenizer with unmerged digits
+A fork of the GPT-2 tokenizer, which **removes multi-digit tokens**:
 ```python
 from transformers import AutoTokenizer
 tokenizer = AutoTokenizer.from_pretrained('cyrilzhang/gpt2-numfix')
 tokenizer('123.45')  # [16, 17, 18, 13, 19, 20]
 gpt2_tokenizer('123.45')  # [10163, 13, 2231]
 ```
+Backward-compatible:
 ```python
 tokenizer.decode([10163, 46387])  # '<unused123> pigeon'
 gpt2_tokenizer.decode([10163, 46387])  # '123 pigeon'
 - This is for my investigations into the arithmetic capabilities of large language models. There is no model here, only a tokenizer.
 - [PaLM](https://arxiv.org/abs/2204.02311) does this.
+- I think it's very reasonable.
 - Many models (illustriously, [GPT-3](https://arxiv.org/abs/2005.14165)) use the GPT-2 tokenizer, which doesn't do this.