Spaces:

ace-1
/

bpe_tok

Sleeping

App Files Files Community

bpe_tok / README.md

ace-1

hf space changes

e79fb1d 7 months ago

preview code

raw

history blame contribute delete

3.37 kB

	---
	title: Multilingual Tokenizer Comparison
	emoji: 🔍
	colorFrom: blue
	colorTo: blue
	sdk: gradio
	sdk_version: "4.19.2"
	app_file: app.py
	pinned: false
	---

	# Multilingual Tokenizer Comparison

	A web application to compare tokenization between a custom multilingual BPE tokenizer and OpenAI's GPT-4 tokenizer.

	## Live Demo

	Try it out: [Huggingface Spaces Demo](https://huggingface.co/spaces/ace-1/bpe_tok)

	## Features

	- Supports multiple scripts:
	- Latin (English)
	- Devanagari (Hindi)
	- Kannada
	- Shows token counts and IDs for both tokenizers
	- Interactive web interface
	- Example texts for comparison

	## Tokenizer Details

	### Overview

	The custom tokenizer was developed using Byte Pair Encoding (BPE) with a custom regex pattern designed specifically for multilingual text. The development process included:

	1. Custom Regex for BPE Tokenization:
	- A specialized regex pattern that handles English, Hindi, and Kannada scripts
	- Carefully designed to preserve linguistic units in each script

	2. Training Corpus Composition:
	- English (60%): From `HuggingFaceFW/fineweb-edu` dataset
	- Hindi (20%): From `ai4bharat/sangraha` dataset (Devanagari script)
	- Kannada (20%): From `ai4bharat/sangraha` dataset (Kannada script)
	- This distribution aligns with token distribution patterns observed in models like GPT-4

	3. Vocabulary Details:
	- Total Size: 3257 tokens
	- Composition:
	- 256 byte-level tokens
	- 3000 merge operations
	- 1 special `<\|endoftext\|>` token
	- Achieves approximately 4.07x compression ratio

	### Technical Implementation

	The tokenizer implementation includes:
	- Custom regex patterns for multilingual text segmentation
	- BPE training with controlled merge operations
	- Special token handling
	- Efficient encoding/decoding mechanisms

	## Installation

	```bash
	# Clone the repository
	git clone https://github.com/MohammedYaseen97/bpe_tok_era.git
	cd bpe_tok_era

	# Install dependencies
	pip install -r requirements.txt

	# Run the app locally
	python app.py
	```

	## Project Structure

	```
	├── app.py # Gradio web interface
	├── tokenizer.py # Custom tokenizer implementation
	├── bpe_tok.model # Trained tokenizer model
	├── requirements.txt # Project dependencies
	└── README.md # Project documentation
	```


	## Development Process

	The tokenizer development involved several key steps:

	1. Dataset Preparation:
	- Careful selection of multilingual datasets
	- Balanced sampling to maintain script representation
	- Text cleaning and preprocessing

	2. Tokenizer Training:
	- Custom regex pattern development
	- BPE training with controlled vocabulary growth
	- Optimization for multilingual support

	3. Performance Metrics:
	- Compression ratio: 4.07x
	- Balanced token distribution across scripts
	- Efficient handling of mixed-script text

	## Usage Examples

	The tokenizer effectively handles various text combinations:
	- Pure English text
	- Pure Hindi text
	- Pure Kannada text
	- Mixed script text
	- Special tokens and control characters

	## License

	MIT License

	## Contributing

	1. Fork the repository
	2. Create your feature branch (`git checkout -b feature/AmazingFeature`)
	3. Commit your changes (`git commit -m 'Add some AmazingFeature'`)
	4. Push to the branch (`git push origin feature/AmazingFeature`)
	5. Open a Pull Request