File size: 3,370 Bytes
e79fb1d c53145d 9b87e96 c53145d 9b87e96 c53145d 9b87e96 83bb950 9b87e96 c53145d 9b87e96 c53145d 9b87e96 c53145d 9b87e96 c53145d 9b87e96 c53145d 9b87e96 c53145d 9b87e96 c53145d 83bb950 c53145d 9b87e96 c53145d 9b87e96 c53145d 9b87e96 c53145d 9b87e96 c53145d 9b87e96 c53145d 9b87e96 c53145d 9b87e96 c53145d 9b87e96 c53145d |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 |
---
title: Multilingual Tokenizer Comparison
emoji: π
colorFrom: blue
colorTo: blue
sdk: gradio
sdk_version: "4.19.2"
app_file: app.py
pinned: false
---
# Multilingual Tokenizer Comparison
A web application to compare tokenization between a custom multilingual BPE tokenizer and OpenAI's GPT-4 tokenizer.
## Live Demo
Try it out: [Huggingface Spaces Demo](https://huggingface.co/spaces/ace-1/bpe_tok)
## Features
- Supports multiple scripts:
- Latin (English)
- Devanagari (Hindi)
- Kannada
- Shows token counts and IDs for both tokenizers
- Interactive web interface
- Example texts for comparison
## Tokenizer Details
### Overview
The custom tokenizer was developed using Byte Pair Encoding (BPE) with a custom regex pattern designed specifically for multilingual text. The development process included:
1. **Custom Regex for BPE Tokenization**:
- A specialized regex pattern that handles English, Hindi, and Kannada scripts
- Carefully designed to preserve linguistic units in each script
2. **Training Corpus Composition**:
- English (60%): From `HuggingFaceFW/fineweb-edu` dataset
- Hindi (20%): From `ai4bharat/sangraha` dataset (Devanagari script)
- Kannada (20%): From `ai4bharat/sangraha` dataset (Kannada script)
- This distribution aligns with token distribution patterns observed in models like GPT-4
3. **Vocabulary Details**:
- Total Size: 3257 tokens
- Composition:
- 256 byte-level tokens
- 3000 merge operations
- 1 special `<|endoftext|>` token
- Achieves approximately 4.07x compression ratio
### Technical Implementation
The tokenizer implementation includes:
- Custom regex patterns for multilingual text segmentation
- BPE training with controlled merge operations
- Special token handling
- Efficient encoding/decoding mechanisms
## Installation
```bash
# Clone the repository
git clone https://github.com/MohammedYaseen97/bpe_tok_era.git
cd bpe_tok_era
# Install dependencies
pip install -r requirements.txt
# Run the app locally
python app.py
```
## Project Structure
```
βββ app.py # Gradio web interface
βββ tokenizer.py # Custom tokenizer implementation
βββ bpe_tok.model # Trained tokenizer model
βββ requirements.txt # Project dependencies
βββ README.md # Project documentation
```
## Development Process
The tokenizer development involved several key steps:
1. **Dataset Preparation**:
- Careful selection of multilingual datasets
- Balanced sampling to maintain script representation
- Text cleaning and preprocessing
2. **Tokenizer Training**:
- Custom regex pattern development
- BPE training with controlled vocabulary growth
- Optimization for multilingual support
3. **Performance Metrics**:
- Compression ratio: 4.07x
- Balanced token distribution across scripts
- Efficient handling of mixed-script text
## Usage Examples
The tokenizer effectively handles various text combinations:
- Pure English text
- Pure Hindi text
- Pure Kannada text
- Mixed script text
- Special tokens and control characters
## License
MIT License
## Contributing
1. Fork the repository
2. Create your feature branch (`git checkout -b feature/AmazingFeature`)
3. Commit your changes (`git commit -m 'Add some AmazingFeature'`)
4. Push to the branch (`git push origin feature/AmazingFeature`)
5. Open a Pull Request |