File size: 3,370 Bytes
e79fb1d
 
 
 
 
 
 
 
 
 
 
c53145d
9b87e96
c53145d
9b87e96
c53145d
9b87e96
83bb950
9b87e96
c53145d
9b87e96
c53145d
 
 
 
 
 
 
9b87e96
c53145d
9b87e96
c53145d
9b87e96
c53145d
9b87e96
c53145d
 
 
 
 
 
 
 
 
 
 
 
 
9b87e96
 
 
c53145d
 
 
 
 
 
 
 
 
 
 
 
 
 
83bb950
 
c53145d
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
9b87e96
c53145d
 
 
 
9b87e96
c53145d
 
 
 
9b87e96
c53145d
9b87e96
c53145d
 
 
 
 
 
9b87e96
c53145d
9b87e96
c53145d
9b87e96
c53145d
9b87e96
c53145d
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
---
title: Multilingual Tokenizer Comparison
emoji: πŸ”
colorFrom: blue
colorTo: blue
sdk: gradio
sdk_version: "4.19.2"
app_file: app.py
pinned: false
---

# Multilingual Tokenizer Comparison

A web application to compare tokenization between a custom multilingual BPE tokenizer and OpenAI's GPT-4 tokenizer.

## Live Demo

Try it out: [Huggingface Spaces Demo](https://huggingface.co/spaces/ace-1/bpe_tok)

## Features

- Supports multiple scripts:
  - Latin (English)
  - Devanagari (Hindi)
  - Kannada
- Shows token counts and IDs for both tokenizers
- Interactive web interface
- Example texts for comparison

## Tokenizer Details

### Overview

The custom tokenizer was developed using Byte Pair Encoding (BPE) with a custom regex pattern designed specifically for multilingual text. The development process included:

1. **Custom Regex for BPE Tokenization**: 
   - A specialized regex pattern that handles English, Hindi, and Kannada scripts
   - Carefully designed to preserve linguistic units in each script

2. **Training Corpus Composition**:
   - English (60%): From `HuggingFaceFW/fineweb-edu` dataset
   - Hindi (20%): From `ai4bharat/sangraha` dataset (Devanagari script)
   - Kannada (20%): From `ai4bharat/sangraha` dataset (Kannada script)
   - This distribution aligns with token distribution patterns observed in models like GPT-4

3. **Vocabulary Details**:
   - Total Size: 3257 tokens
   - Composition:
     - 256 byte-level tokens
     - 3000 merge operations
     - 1 special `<|endoftext|>` token
   - Achieves approximately 4.07x compression ratio

### Technical Implementation

The tokenizer implementation includes:
- Custom regex patterns for multilingual text segmentation
- BPE training with controlled merge operations
- Special token handling
- Efficient encoding/decoding mechanisms

## Installation

```bash
# Clone the repository
git clone https://github.com/MohammedYaseen97/bpe_tok_era.git
cd bpe_tok_era

# Install dependencies
pip install -r requirements.txt

# Run the app locally
python app.py
```

## Project Structure

```
β”œβ”€β”€ app.py # Gradio web interface
β”œβ”€β”€ tokenizer.py # Custom tokenizer implementation
β”œβ”€β”€ bpe_tok.model # Trained tokenizer model
β”œβ”€β”€ requirements.txt # Project dependencies
└── README.md # Project documentation
```


## Development Process

The tokenizer development involved several key steps:

1. **Dataset Preparation**:
   - Careful selection of multilingual datasets
   - Balanced sampling to maintain script representation
   - Text cleaning and preprocessing

2. **Tokenizer Training**:
   - Custom regex pattern development
   - BPE training with controlled vocabulary growth
   - Optimization for multilingual support

3. **Performance Metrics**:
   - Compression ratio: 4.07x
   - Balanced token distribution across scripts
   - Efficient handling of mixed-script text

## Usage Examples

The tokenizer effectively handles various text combinations:
- Pure English text
- Pure Hindi text
- Pure Kannada text
- Mixed script text
- Special tokens and control characters

## License

MIT License

## Contributing

1. Fork the repository
2. Create your feature branch (`git checkout -b feature/AmazingFeature`)
3. Commit your changes (`git commit -m 'Add some AmazingFeature'`)
4. Push to the branch (`git push origin feature/AmazingFeature`)
5. Open a Pull Request