spaces commit
Browse files
README.md
CHANGED
@@ -1,51 +1,114 @@
|
|
1 |
-
#
|
2 |
|
3 |
-
|
4 |
|
5 |
-
##
|
6 |
|
7 |
-
|
8 |
|
9 |
-
|
10 |
-
- A custom regex pattern is defined to tokenize text in English, Hindi, and Kannada. This pattern is used to split text into tokens that are then processed by the BPE algorithm.
|
11 |
|
12 |
-
|
13 |
-
|
|
|
|
|
|
|
|
|
|
|
14 |
|
15 |
-
|
16 |
-
- Texts from the datasets are concatenated into a single corpus, which is then saved to a file. This corpus serves as the input for training the BPE tokenizer.
|
17 |
|
18 |
-
|
19 |
-
- Functions are defined to handle control characters, visualize tokens, and manage token rendering.
|
20 |
|
21 |
-
|
22 |
-
- The BPE algorithm is trained on the prepared corpus. The process involves iteratively merging the most frequent pairs of tokens until the desired vocabulary size is reached.
|
23 |
|
24 |
-
|
25 |
-
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
26 |
- 256 byte-level tokens
|
27 |
- 3000 merge operations
|
28 |
- 1 special `<|endoftext|>` token
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
29 |
|
30 |
-
|
31 |
-
-
|
|
|
|
|
32 |
|
33 |
-
|
34 |
-
-
|
|
|
|
|
35 |
|
36 |
-
##
|
37 |
|
38 |
-
|
39 |
-
-
|
40 |
-
|
41 |
-
|
42 |
-
|
43 |
-
-
|
44 |
|
45 |
-
##
|
46 |
|
47 |
-
|
48 |
|
49 |
-
##
|
50 |
|
51 |
-
|
|
|
|
|
|
|
|
|
|
1 |
+
# Multilingual Tokenizer Comparison
|
2 |
|
3 |
+
A web application to compare tokenization between a custom multilingual BPE tokenizer and OpenAI's GPT-4 tokenizer.
|
4 |
|
5 |
+
## Live Demo
|
6 |
|
7 |
+
Try it out: [Huggingface Spaces Demo](https://huggingface.co/spaces/YOUR_USERNAME/YOUR_SPACE_NAME)
|
8 |
|
9 |
+
## Features
|
|
|
10 |
|
11 |
+
- Supports multiple scripts:
|
12 |
+
- Latin (English)
|
13 |
+
- Devanagari (Hindi)
|
14 |
+
- Kannada
|
15 |
+
- Shows token counts and IDs for both tokenizers
|
16 |
+
- Interactive web interface
|
17 |
+
- Example texts for comparison
|
18 |
|
19 |
+
## Tokenizer Details
|
|
|
20 |
|
21 |
+
### Overview
|
|
|
22 |
|
23 |
+
The custom tokenizer was developed using Byte Pair Encoding (BPE) with a custom regex pattern designed specifically for multilingual text. The development process included:
|
|
|
24 |
|
25 |
+
1. **Custom Regex for BPE Tokenization**:
|
26 |
+
- A specialized regex pattern that handles English, Hindi, and Kannada scripts
|
27 |
+
- Carefully designed to preserve linguistic units in each script
|
28 |
+
|
29 |
+
2. **Training Corpus Composition**:
|
30 |
+
- English (60%): From `HuggingFaceFW/fineweb-edu` dataset
|
31 |
+
- Hindi (20%): From `ai4bharat/sangraha` dataset (Devanagari script)
|
32 |
+
- Kannada (20%): From `ai4bharat/sangraha` dataset (Kannada script)
|
33 |
+
- This distribution aligns with token distribution patterns observed in models like GPT-4
|
34 |
+
|
35 |
+
3. **Vocabulary Details**:
|
36 |
+
- Total Size: 3257 tokens
|
37 |
+
- Composition:
|
38 |
- 256 byte-level tokens
|
39 |
- 3000 merge operations
|
40 |
- 1 special `<|endoftext|>` token
|
41 |
+
- Achieves approximately 4.07x compression ratio
|
42 |
+
|
43 |
+
### Technical Implementation
|
44 |
+
|
45 |
+
The tokenizer implementation includes:
|
46 |
+
- Custom regex patterns for multilingual text segmentation
|
47 |
+
- BPE training with controlled merge operations
|
48 |
+
- Special token handling
|
49 |
+
- Efficient encoding/decoding mechanisms
|
50 |
+
|
51 |
+
## Installation
|
52 |
+
|
53 |
+
```bash
|
54 |
+
# Clone the repository
|
55 |
+
git clone https://github.com/YOUR_USERNAME/REPO_NAME.git
|
56 |
+
cd REPO_NAME
|
57 |
+
|
58 |
+
# Install dependencies
|
59 |
+
pip install -r requirements.txt
|
60 |
+
|
61 |
+
# Run the app locally
|
62 |
+
python app.py
|
63 |
+
```
|
64 |
+
|
65 |
+
## Project Structure
|
66 |
+
|
67 |
+
```
|
68 |
+
├── app.py # Gradio web interface
|
69 |
+
├── tokenizer.py # Custom tokenizer implementation
|
70 |
+
├── bpe_tok.model # Trained tokenizer model
|
71 |
+
├── requirements.txt # Project dependencies
|
72 |
+
└── README.md # Project documentation
|
73 |
+
```
|
74 |
+
|
75 |
+
|
76 |
+
## Development Process
|
77 |
+
|
78 |
+
The tokenizer development involved several key steps:
|
79 |
+
|
80 |
+
1. **Dataset Preparation**:
|
81 |
+
- Careful selection of multilingual datasets
|
82 |
+
- Balanced sampling to maintain script representation
|
83 |
+
- Text cleaning and preprocessing
|
84 |
|
85 |
+
2. **Tokenizer Training**:
|
86 |
+
- Custom regex pattern development
|
87 |
+
- BPE training with controlled vocabulary growth
|
88 |
+
- Optimization for multilingual support
|
89 |
|
90 |
+
3. **Performance Metrics**:
|
91 |
+
- Compression ratio: 4.07x
|
92 |
+
- Balanced token distribution across scripts
|
93 |
+
- Efficient handling of mixed-script text
|
94 |
|
95 |
+
## Usage Examples
|
96 |
|
97 |
+
The tokenizer effectively handles various text combinations:
|
98 |
+
- Pure English text
|
99 |
+
- Pure Hindi text
|
100 |
+
- Pure Kannada text
|
101 |
+
- Mixed script text
|
102 |
+
- Special tokens and control characters
|
103 |
|
104 |
+
## License
|
105 |
|
106 |
+
MIT License
|
107 |
|
108 |
+
## Contributing
|
109 |
|
110 |
+
1. Fork the repository
|
111 |
+
2. Create your feature branch (`git checkout -b feature/AmazingFeature`)
|
112 |
+
3. Commit your changes (`git commit -m 'Add some AmazingFeature'`)
|
113 |
+
4. Push to the branch (`git push origin feature/AmazingFeature`)
|
114 |
+
5. Open a Pull Request
|
app.py
CHANGED
@@ -4,12 +4,13 @@ from tokenizer import CustomTokenizer
|
|
4 |
|
5 |
# Initialize tokenizers
|
6 |
custom_tokenizer = CustomTokenizer("bpe_tok.model")
|
7 |
-
tiktoken_encoder = tiktoken.
|
|
|
8 |
|
9 |
def encode_text(text):
|
10 |
# Get encodings from both tokenizers
|
11 |
custom_tokens = custom_tokenizer.encode(text, allowed_special={"<|endoftext|>"})
|
12 |
-
tiktoken_tokens = tiktoken_encoder.encode(text)
|
13 |
|
14 |
# Format output
|
15 |
custom_output = f"Token count: {len(custom_tokens)}\nTokens: {custom_tokens}"
|
@@ -26,7 +27,7 @@ iface = gr.Interface(
|
|
26 |
gr.Textbox(label="Tiktoken Output", lines=4)
|
27 |
],
|
28 |
title="Tokenizer Comparison",
|
29 |
-
description="Compare custom BPE tokenizer with Tiktoken GPT-
|
30 |
examples=[
|
31 |
["आज तो बहुत थक गया हूँ, ಸ್ವಲ್ಪ विश्रಾಂತಿ ಬೇಕು।"],
|
32 |
["मौसम कितना अच्छा है! ನೀವೂ ಹೊರಗೆ ಬನ್ನಿ, let's enjoy together."],
|
|
|
4 |
|
5 |
# Initialize tokenizers
|
6 |
custom_tokenizer = CustomTokenizer("bpe_tok.model")
|
7 |
+
tiktoken_encoder = tiktoken.encoding_for_model("gpt-4")
|
8 |
+
|
9 |
|
10 |
def encode_text(text):
|
11 |
# Get encodings from both tokenizers
|
12 |
custom_tokens = custom_tokenizer.encode(text, allowed_special={"<|endoftext|>"})
|
13 |
+
tiktoken_tokens = tiktoken_encoder.encode(text, allowed_special={"<|endoftext|>"})
|
14 |
|
15 |
# Format output
|
16 |
custom_output = f"Token count: {len(custom_tokens)}\nTokens: {custom_tokens}"
|
|
|
27 |
gr.Textbox(label="Tiktoken Output", lines=4)
|
28 |
],
|
29 |
title="Tokenizer Comparison",
|
30 |
+
description="Compare custom BPE tokenizer with Tiktoken GPT-4 tokenizer",
|
31 |
examples=[
|
32 |
["आज तो बहुत थक गया हूँ, ಸ್ವಲ್ಪ विश्रಾಂತಿ ಬೇಕು।"],
|
33 |
["मौसम कितना अच्छा है! ನೀವೂ ಹೊರಗೆ ಬನ್ನಿ, let's enjoy together."],
|
app.yaml
ADDED
@@ -0,0 +1,3 @@
|
|
|
|
|
|
|
|
|
1 |
+
sdk: gradio
|
2 |
+
sdk_version: 4.19.2
|
3 |
+
app_file: app.py
|
requirements.txt
CHANGED
@@ -1,3 +1,4 @@
|
|
1 |
gradio
|
2 |
tiktoken
|
3 |
-
regex
|
|
|
|
1 |
gradio
|
2 |
tiktoken
|
3 |
+
regex
|
4 |
+
datasets
|