Transformers
cyrilzhang commited on
Commit
ed23458
·
1 Parent(s): b0e76f1

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +4 -4
README.md CHANGED
@@ -4,19 +4,18 @@ license: mit
4
 
5
  ## GPT-2 Tokenizer with unmerged digits
6
 
 
 
7
  ```python
8
  from transformers import AutoTokenizer
9
 
10
  tokenizer = AutoTokenizer.from_pretrained('cyrilzhang/gpt2-numfix')
11
- ```
12
 
13
- A fork of the GPT-2 tokenizer, which **removes multi-digit tokens**:
14
- ```python
15
  tokenizer('123.45') # [16, 17, 18, 13, 19, 20]
16
  gpt2_tokenizer('123.45') # [10163, 13, 2231]
17
  ```
18
 
19
- Backwards-compatible:
20
  ```python
21
  tokenizer.decode([10163, 46387]) # '<unused123> pigeon'
22
  gpt2_tokenizer.decode([10163, 46387]) # '123 pigeon'
@@ -24,4 +23,5 @@ gpt2_tokenizer.decode([10163, 46387]) # '123 pigeon'
24
 
25
  - This is for my investigations into the arithmetic capabilities of large language models. There is no model here, only a tokenizer.
26
  - [PaLM](https://arxiv.org/abs/2204.02311) does this.
 
27
  - Many models (illustriously, [GPT-3](https://arxiv.org/abs/2005.14165)) use the GPT-2 tokenizer, which doesn't do this.
 
4
 
5
  ## GPT-2 Tokenizer with unmerged digits
6
 
7
+ A fork of the GPT-2 tokenizer, which **removes multi-digit tokens**:
8
+
9
  ```python
10
  from transformers import AutoTokenizer
11
 
12
  tokenizer = AutoTokenizer.from_pretrained('cyrilzhang/gpt2-numfix')
 
13
 
 
 
14
  tokenizer('123.45') # [16, 17, 18, 13, 19, 20]
15
  gpt2_tokenizer('123.45') # [10163, 13, 2231]
16
  ```
17
 
18
+ Backward-compatible:
19
  ```python
20
  tokenizer.decode([10163, 46387]) # '<unused123> pigeon'
21
  gpt2_tokenizer.decode([10163, 46387]) # '123 pigeon'
 
23
 
24
  - This is for my investigations into the arithmetic capabilities of large language models. There is no model here, only a tokenizer.
25
  - [PaLM](https://arxiv.org/abs/2204.02311) does this.
26
+ - I think it's very reasonable.
27
  - Many models (illustriously, [GPT-3](https://arxiv.org/abs/2005.14165)) use the GPT-2 tokenizer, which doesn't do this.