ace-1 commited on
Commit
c53145d
·
1 Parent(s): 9b87e96

spaces commit

Browse files
Files changed (4) hide show
  1. README.md +94 -31
  2. app.py +4 -3
  3. app.yaml +3 -0
  4. requirements.txt +2 -1
README.md CHANGED
@@ -1,51 +1,114 @@
1
- # BPE Tokenization with Custom Regex
2
 
3
- This notebook demonstrates the process of training a Byte Pair Encoding (BPE) tokenizer using a custom regex pattern. The tokenizer is designed to handle multilingual text, specifically English, Hindi, and Kannada. The notebook includes steps for tokenization, vocabulary building, and encoding/decoding of text.
4
 
5
- ## Overview
6
 
7
- The notebook is structured into several key sections:
8
 
9
- 1. **Custom Regex for BPE Tokenization**:
10
- - A custom regex pattern is defined to tokenize text in English, Hindi, and Kannada. This pattern is used to split text into tokens that are then processed by the BPE algorithm.
11
 
12
- 2. **Dataset Loading**:
13
- - Datasets from Hugging Face are loaded for English, Hindi (Devanagari script), and Kannada. These datasets are used to create a corpus for training the tokenizer.
 
 
 
 
 
14
 
15
- 3. **Corpus Preparation**:
16
- - Texts from the datasets are concatenated into a single corpus, which is then saved to a file. This corpus serves as the input for training the BPE tokenizer.
17
 
18
- 4. **Utility Functions**:
19
- - Functions are defined to handle control characters, visualize tokens, and manage token rendering.
20
 
21
- 5. **Training BPE**:
22
- - The BPE algorithm is trained on the prepared corpus. The process involves iteratively merging the most frequent pairs of tokens until the desired vocabulary size is reached.
23
 
24
- 6. **Vocabulary and Model Saving**:
25
- - The trained vocabulary and model are saved to disk for later use. The vocabulary consists of 3257 tokens, which includes:
 
 
 
 
 
 
 
 
 
 
 
26
  - 256 byte-level tokens
27
  - 3000 merge operations
28
  - 1 special `<|endoftext|>` token
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
29
 
30
- 7. **Encoding and Decoding**:
31
- - Functions are provided to encode text into token IDs and decode token IDs back into text. Special tokens are handled as part of this process.
 
 
32
 
33
- 8. **Testing**:
34
- - The tokenizer is tested on sample texts to verify its performance and compression ratio.
 
 
35
 
36
- ## Key Details
37
 
38
- - **Vocabulary Size**: The final vocabulary size is set to 3257 tokens (256 byte tokens + 3000 merges + 1 `<|endoftext|>` token).
39
- - **Tokenizer Training Corpus Composition**: The training corpus is constructed by combining texts from multiple datasets with the following distribution:
40
- - `HuggingFaceFW/fineweb-edu` (English): 60% of the corpus, aligning with the token distribution patterns observed in advanced language models like GPT-4, where English tokens constitute a significant majority
41
- - `ai4bharat/sangraha` (Hindi - Devanagari script): 20% of the corpus
42
- - `ai4bharat/sangraha` (Kannada - Kannada script): 20% of the corpus
43
- - **Compression Ratio**: The compression ratio achieved by the BPE tokenizer is approximately 4.07x, indicating the efficiency of the tokenization process in reducing the size of the text representation.
44
 
45
- ## Usage
46
 
47
- To use the tokenizer, load the saved model and vocabulary files, and utilize the provided encoding and decoding functions to process text. The tokenizer is capable of handling multilingual text and special tokens, making it suitable for diverse applications.
48
 
49
- ## Conclusion
50
 
51
- This notebook provides a comprehensive guide to training a BPE tokenizer with custom regex patterns for multilingual text. The process includes dataset preparation, tokenization, vocabulary building, and model saving, offering a robust solution for text processing tasks.
 
 
 
 
 
1
+ # Multilingual Tokenizer Comparison
2
 
3
+ A web application to compare tokenization between a custom multilingual BPE tokenizer and OpenAI's GPT-4 tokenizer.
4
 
5
+ ## Live Demo
6
 
7
+ Try it out: [Huggingface Spaces Demo](https://huggingface.co/spaces/YOUR_USERNAME/YOUR_SPACE_NAME)
8
 
9
+ ## Features
 
10
 
11
+ - Supports multiple scripts:
12
+ - Latin (English)
13
+ - Devanagari (Hindi)
14
+ - Kannada
15
+ - Shows token counts and IDs for both tokenizers
16
+ - Interactive web interface
17
+ - Example texts for comparison
18
 
19
+ ## Tokenizer Details
 
20
 
21
+ ### Overview
 
22
 
23
+ The custom tokenizer was developed using Byte Pair Encoding (BPE) with a custom regex pattern designed specifically for multilingual text. The development process included:
 
24
 
25
+ 1. **Custom Regex for BPE Tokenization**:
26
+ - A specialized regex pattern that handles English, Hindi, and Kannada scripts
27
+ - Carefully designed to preserve linguistic units in each script
28
+
29
+ 2. **Training Corpus Composition**:
30
+ - English (60%): From `HuggingFaceFW/fineweb-edu` dataset
31
+ - Hindi (20%): From `ai4bharat/sangraha` dataset (Devanagari script)
32
+ - Kannada (20%): From `ai4bharat/sangraha` dataset (Kannada script)
33
+ - This distribution aligns with token distribution patterns observed in models like GPT-4
34
+
35
+ 3. **Vocabulary Details**:
36
+ - Total Size: 3257 tokens
37
+ - Composition:
38
  - 256 byte-level tokens
39
  - 3000 merge operations
40
  - 1 special `<|endoftext|>` token
41
+ - Achieves approximately 4.07x compression ratio
42
+
43
+ ### Technical Implementation
44
+
45
+ The tokenizer implementation includes:
46
+ - Custom regex patterns for multilingual text segmentation
47
+ - BPE training with controlled merge operations
48
+ - Special token handling
49
+ - Efficient encoding/decoding mechanisms
50
+
51
+ ## Installation
52
+
53
+ ```bash
54
+ # Clone the repository
55
+ git clone https://github.com/YOUR_USERNAME/REPO_NAME.git
56
+ cd REPO_NAME
57
+
58
+ # Install dependencies
59
+ pip install -r requirements.txt
60
+
61
+ # Run the app locally
62
+ python app.py
63
+ ```
64
+
65
+ ## Project Structure
66
+
67
+ ```
68
+ ├── app.py # Gradio web interface
69
+ ├── tokenizer.py # Custom tokenizer implementation
70
+ ├── bpe_tok.model # Trained tokenizer model
71
+ ├── requirements.txt # Project dependencies
72
+ └── README.md # Project documentation
73
+ ```
74
+
75
+
76
+ ## Development Process
77
+
78
+ The tokenizer development involved several key steps:
79
+
80
+ 1. **Dataset Preparation**:
81
+ - Careful selection of multilingual datasets
82
+ - Balanced sampling to maintain script representation
83
+ - Text cleaning and preprocessing
84
 
85
+ 2. **Tokenizer Training**:
86
+ - Custom regex pattern development
87
+ - BPE training with controlled vocabulary growth
88
+ - Optimization for multilingual support
89
 
90
+ 3. **Performance Metrics**:
91
+ - Compression ratio: 4.07x
92
+ - Balanced token distribution across scripts
93
+ - Efficient handling of mixed-script text
94
 
95
+ ## Usage Examples
96
 
97
+ The tokenizer effectively handles various text combinations:
98
+ - Pure English text
99
+ - Pure Hindi text
100
+ - Pure Kannada text
101
+ - Mixed script text
102
+ - Special tokens and control characters
103
 
104
+ ## License
105
 
106
+ MIT License
107
 
108
+ ## Contributing
109
 
110
+ 1. Fork the repository
111
+ 2. Create your feature branch (`git checkout -b feature/AmazingFeature`)
112
+ 3. Commit your changes (`git commit -m 'Add some AmazingFeature'`)
113
+ 4. Push to the branch (`git push origin feature/AmazingFeature`)
114
+ 5. Open a Pull Request
app.py CHANGED
@@ -4,12 +4,13 @@ from tokenizer import CustomTokenizer
4
 
5
  # Initialize tokenizers
6
  custom_tokenizer = CustomTokenizer("bpe_tok.model")
7
- tiktoken_encoder = tiktoken.get_encoding("gpt2")
 
8
 
9
  def encode_text(text):
10
  # Get encodings from both tokenizers
11
  custom_tokens = custom_tokenizer.encode(text, allowed_special={"<|endoftext|>"})
12
- tiktoken_tokens = tiktoken_encoder.encode(text)
13
 
14
  # Format output
15
  custom_output = f"Token count: {len(custom_tokens)}\nTokens: {custom_tokens}"
@@ -26,7 +27,7 @@ iface = gr.Interface(
26
  gr.Textbox(label="Tiktoken Output", lines=4)
27
  ],
28
  title="Tokenizer Comparison",
29
- description="Compare custom BPE tokenizer with Tiktoken GPT-2 tokenizer",
30
  examples=[
31
  ["आज तो बहुत थक गया हूँ, ಸ್ವಲ್ಪ विश्रಾಂತಿ ಬೇಕು।"],
32
  ["मौसम कितना अच्छा है! ನೀವೂ ಹೊರಗೆ ಬನ್ನಿ, let's enjoy together."],
 
4
 
5
  # Initialize tokenizers
6
  custom_tokenizer = CustomTokenizer("bpe_tok.model")
7
+ tiktoken_encoder = tiktoken.encoding_for_model("gpt-4")
8
+
9
 
10
  def encode_text(text):
11
  # Get encodings from both tokenizers
12
  custom_tokens = custom_tokenizer.encode(text, allowed_special={"<|endoftext|>"})
13
+ tiktoken_tokens = tiktoken_encoder.encode(text, allowed_special={"<|endoftext|>"})
14
 
15
  # Format output
16
  custom_output = f"Token count: {len(custom_tokens)}\nTokens: {custom_tokens}"
 
27
  gr.Textbox(label="Tiktoken Output", lines=4)
28
  ],
29
  title="Tokenizer Comparison",
30
+ description="Compare custom BPE tokenizer with Tiktoken GPT-4 tokenizer",
31
  examples=[
32
  ["आज तो बहुत थक गया हूँ, ಸ್ವಲ್ಪ विश्रಾಂತಿ ಬೇಕು।"],
33
  ["मौसम कितना अच्छा है! ನೀವೂ ಹೊರಗೆ ಬನ್ನಿ, let's enjoy together."],
app.yaml ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ sdk: gradio
2
+ sdk_version: 4.19.2
3
+ app_file: app.py
requirements.txt CHANGED
@@ -1,3 +1,4 @@
1
  gradio
2
  tiktoken
3
- regex
 
 
1
  gradio
2
  tiktoken
3
+ regex
4
+ datasets