Spaces:

ace-1
/

bpe_tok

Sleeping

App Files Files Community

ace-1 commited on Jan 11

Commit

c53145d

1 Parent(s): 9b87e96

spaces commit

Browse files

Files changed (4) hide show

README.md +94 -31
app.py +4 -3
app.yaml +3 -0
requirements.txt +2 -1

README.md CHANGED Viewed

@@ -1,51 +1,114 @@
-# BPE Tokenization with Custom Regex
-This notebook demonstrates the process of training a Byte Pair Encoding (BPE) tokenizer using a custom regex pattern. The tokenizer is designed to handle multilingual text, specifically English, Hindi, and Kannada. The notebook includes steps for tokenization, vocabulary building, and encoding/decoding of text.
-## Overview
-The notebook is structured into several key sections:
-1. **Custom Regex for BPE Tokenization**:
-   - A custom regex pattern is defined to tokenize text in English, Hindi, and Kannada. This pattern is used to split text into tokens that are then processed by the BPE algorithm.
-2. **Dataset Loading**:
-   - Datasets from Hugging Face are loaded for English, Hindi (Devanagari script), and Kannada. These datasets are used to create a corpus for training the tokenizer.
-3. **Corpus Preparation**:
-   - Texts from the datasets are concatenated into a single corpus, which is then saved to a file. This corpus serves as the input for training the BPE tokenizer.
-4. **Utility Functions**:
-   - Functions are defined to handle control characters, visualize tokens, and manage token rendering.
-5. **Training BPE**:
-   - The BPE algorithm is trained on the prepared corpus. The process involves iteratively merging the most frequent pairs of tokens until the desired vocabulary size is reached.
-6. **Vocabulary and Model Saving**:
-   - The trained vocabulary and model are saved to disk for later use. The vocabulary consists of 3257 tokens, which includes:
      - 256 byte-level tokens
      - 3000 merge operations
      - 1 special `<|endoftext|>` token
-7. **Encoding and Decoding**:
-   - Functions are provided to encode text into token IDs and decode token IDs back into text. Special tokens are handled as part of this process.
-8. **Testing**:
-   - The tokenizer is tested on sample texts to verify its performance and compression ratio.
-## Key Details
-- **Vocabulary Size**: The final vocabulary size is set to 3257 tokens (256 byte tokens + 3000 merges + 1 `<|endoftext|>` token).
-- **Tokenizer Training Corpus Composition**: The training corpus is constructed by combining texts from multiple datasets with the following distribution:
-  - `HuggingFaceFW/fineweb-edu` (English): 60% of the corpus, aligning with the token distribution patterns observed in advanced language models like GPT-4, where English tokens constitute a significant majority
-  - `ai4bharat/sangraha` (Hindi - Devanagari script): 20% of the corpus
-  - `ai4bharat/sangraha` (Kannada - Kannada script): 20% of the corpus
-- **Compression Ratio**: The compression ratio achieved by the BPE tokenizer is approximately 4.07x, indicating the efficiency of the tokenization process in reducing the size of the text representation.
-## Usage
-To use the tokenizer, load the saved model and vocabulary files, and utilize the provided encoding and decoding functions to process text. The tokenizer is capable of handling multilingual text and special tokens, making it suitable for diverse applications.
-## Conclusion
-This notebook provides a comprehensive guide to training a BPE tokenizer with custom regex patterns for multilingual text. The process includes dataset preparation, tokenization, vocabulary building, and model saving, offering a robust solution for text processing tasks.

+# Multilingual Tokenizer Comparison
+A web application to compare tokenization between a custom multilingual BPE tokenizer and OpenAI's GPT-4 tokenizer.
+## Live Demo
+Try it out: [Huggingface Spaces Demo](https://huggingface.co/spaces/YOUR_USERNAME/YOUR_SPACE_NAME)
+## Features
+- Supports multiple scripts:
+  - Latin (English)
+  - Devanagari (Hindi)
+  - Kannada
+- Shows token counts and IDs for both tokenizers
+- Interactive web interface
+- Example texts for comparison
+## Tokenizer Details
+### Overview
+The custom tokenizer was developed using Byte Pair Encoding (BPE) with a custom regex pattern designed specifically for multilingual text. The development process included:
+1. **Custom Regex for BPE Tokenization**:
+   - A specialized regex pattern that handles English, Hindi, and Kannada scripts
+   - Carefully designed to preserve linguistic units in each script
+2. **Training Corpus Composition**:
+   - English (60%): From `HuggingFaceFW/fineweb-edu` dataset
+   - Hindi (20%): From `ai4bharat/sangraha` dataset (Devanagari script)
+   - Kannada (20%): From `ai4bharat/sangraha` dataset (Kannada script)
+   - This distribution aligns with token distribution patterns observed in models like GPT-4
+3. **Vocabulary Details**:
+   - Total Size: 3257 tokens
+   - Composition:
      - 256 byte-level tokens
      - 3000 merge operations
      - 1 special `<|endoftext|>` token
+   - Achieves approximately 4.07x compression ratio
+### Technical Implementation
+The tokenizer implementation includes:
+- Custom regex patterns for multilingual text segmentation
+- BPE training with controlled merge operations
+- Special token handling
+- Efficient encoding/decoding mechanisms
+## Installation
+```bash
+# Clone the repository
+git clone https://github.com/YOUR_USERNAME/REPO_NAME.git
+cd REPO_NAME
+# Install dependencies
+pip install -r requirements.txt
+# Run the app locally
+python app.py
+```
+## Project Structure
+```
+├── app.py # Gradio web interface
+├── tokenizer.py # Custom tokenizer implementation
+├── bpe_tok.model # Trained tokenizer model
+├── requirements.txt # Project dependencies
+└── README.md # Project documentation
+```
+## Development Process
+The tokenizer development involved several key steps:
+1. **Dataset Preparation**:
+   - Careful selection of multilingual datasets
+   - Balanced sampling to maintain script representation
+   - Text cleaning and preprocessing
+2. **Tokenizer Training**:
+   - Custom regex pattern development
+   - BPE training with controlled vocabulary growth
+   - Optimization for multilingual support
+3. **Performance Metrics**:
+   - Compression ratio: 4.07x
+   - Balanced token distribution across scripts
+   - Efficient handling of mixed-script text
+## Usage Examples
+The tokenizer effectively handles various text combinations:
+- Pure English text
+- Pure Hindi text
+- Pure Kannada text
+- Mixed script text
+- Special tokens and control characters
+## License
+MIT License
+## Contributing
+1. Fork the repository
+2. Create your feature branch (`git checkout -b feature/AmazingFeature`)
+3. Commit your changes (`git commit -m 'Add some AmazingFeature'`)
+4. Push to the branch (`git push origin feature/AmazingFeature`)
+5. Open a Pull Request

app.py CHANGED Viewed

@@ -4,12 +4,13 @@ from tokenizer import CustomTokenizer
 # Initialize tokenizers
 custom_tokenizer = CustomTokenizer("bpe_tok.model")
-tiktoken_encoder = tiktoken.get_encoding("gpt2")
 def encode_text(text):
     # Get encodings from both tokenizers
     custom_tokens = custom_tokenizer.encode(text, allowed_special={"<|endoftext|>"})
-    tiktoken_tokens = tiktoken_encoder.encode(text)
     # Format output
     custom_output = f"Token count: {len(custom_tokens)}\nTokens: {custom_tokens}"
@@ -26,7 +27,7 @@ iface = gr.Interface(
         gr.Textbox(label="Tiktoken Output", lines=4)
     ],
     title="Tokenizer Comparison",
-    description="Compare custom BPE tokenizer with Tiktoken GPT-2 tokenizer",
     examples=[
         ["आज तो बहुत थक गया हूँ, ಸ್ವಲ್ಪ विश्रಾಂತಿ ಬೇಕು।"],
         ["मौसम कितना अच्छा है! ನೀವೂ ಹೊರಗೆ ಬನ್ನಿ, let's enjoy together."],

 # Initialize tokenizers
 custom_tokenizer = CustomTokenizer("bpe_tok.model")
+tiktoken_encoder = tiktoken.encoding_for_model("gpt-4")
 def encode_text(text):
     # Get encodings from both tokenizers
     custom_tokens = custom_tokenizer.encode(text, allowed_special={"<|endoftext|>"})
+    tiktoken_tokens = tiktoken_encoder.encode(text, allowed_special={"<|endoftext|>"})
     # Format output
     custom_output = f"Token count: {len(custom_tokens)}\nTokens: {custom_tokens}"
         gr.Textbox(label="Tiktoken Output", lines=4)
     ],
     title="Tokenizer Comparison",
+    description="Compare custom BPE tokenizer with Tiktoken GPT-4 tokenizer",
     examples=[
         ["आज तो बहुत थक गया हूँ, ಸ್ವಲ್ಪ विश्रಾಂತಿ ಬೇಕು।"],
         ["मौसम कितना अच्छा है! ನೀವೂ ಹೊರಗೆ ಬನ್ನಿ, let's enjoy together."],

app.yaml ADDED Viewed

	@@ -0,0 +1,3 @@

+sdk: gradio
+sdk_version: 4.19.2
+app_file: app.py

requirements.txt CHANGED Viewed

@@ -1,3 +1,4 @@
 gradio
 tiktoken
-regex

 gradio
 tiktoken
+regex
+datasets