File size: 3,331 Bytes
6c0c4b1
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
---
license: apache-2.0
---

Anthropic's client side tokenizer.

Accuracy compared to actual Claude 3 Haiku tokenizer (Claude 3 family has the same tokenizer):

```python
Tokenization results saved to __temp.txt.tokens
Text: Hello, world! This is a simple...
Actual tokens: 17
Predicted tokens: 10
Accuracy: 58.82%
--------------------------------------------------
Tokenization results saved to __temp.txt.tokens
Text: The quick brown fox jumps over...
Actual tokens: 19
Predicted tokens: 10
Accuracy: 52.63%
--------------------------------------------------
Tokenization results saved to __temp.txt.tokens
Text: In computer programming, a hel...
Actual tokens: 29
Predicted tokens: 21
Accuracy: 72.41%
--------------------------------------------------
Tokenization results saved to __temp.txt.tokens
Text: Artificial intelligence (AI) i...
Actual tokens: 30
Predicted tokens: 24
Accuracy: 80.00%
--------------------------------------------------
Tokenization results saved to __temp.txt.tokens
Text: The Eiffel Tower is a wrought-...
Actual tokens: 56
Predicted tokens: 48
Accuracy: 85.71%
--------------------------------------------------
Tokenization results saved to __temp.txt.tokens
Text: To be, or not to be, that is t...
Actual tokens: 60
Predicted tokens: 50
Accuracy: 83.33%
--------------------------------------------------
Tokenization results saved to __temp.txt.tokens
Text: In the beginning God created t...
Actual tokens: 38
Predicted tokens: 31
Accuracy: 81.58%
--------------------------------------------------
Tokenization results saved to __temp.txt.tokens
Text: Four score and seven years ago...
Actual tokens: 41
Predicted tokens: 34
Accuracy: 82.93%
--------------------------------------------------
Tokenization results saved to __temp.txt.tokens
Text: I have a dream that one day th...
Actual tokens: 51
Predicted tokens: 43
Accuracy: 84.31%
--------------------------------------------------
Tokenization results saved to __temp.txt.tokens
Text: That's one small step for man,...
Actual tokens: 22
Predicted tokens: 14
Accuracy: 63.64%
--------------------------------------------------
Tokenization results saved to __temp.txt.tokens
Text: Here are the key points about ...
Actual tokens: 203
Predicted tokens: 195
Accuracy: 96.06%
--------------------------------------------------
Tokenization results saved to __temp.txt.tokens
Text: This appears to be an excerpt ...
Actual tokens: 179
Predicted tokens: 180
Accuracy: 99.44%
--------------------------------------------------
Tokenization results saved to __temp.txt.tokens
Text: This is the beginning of the b...
Actual tokens: 194
Predicted tokens: 191
Accuracy: 98.45%
--------------------------------------------------
Tokenization results saved to __temp.txt.tokens
Text: That is the opening lines of t...
Actual tokens: 177
Predicted tokens: 163
Accuracy: 92.09%
--------------------------------------------------
Tokenization results saved to __temp.txt.tokens
Text: That's a powerful and inspirin...
Actual tokens: 193
Predicted tokens: 190
Accuracy: 98.45%
--------------------------------------------------
Tokenization results saved to __temp.txt.tokens
Text: That famous quote is from Neil...
Actual tokens: 131
Predicted tokens: 122
Accuracy: 93.13%
--------------------------------------------------
Average accuracy: 82.69%
```