Instructions to use tamang0000/assamese-tokenizer-50k with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use tamang0000/assamese-tokenizer-50k with Transformers:
# Load model directly from transformers import AutoModel model = AutoModel.from_pretrained("tamang0000/assamese-tokenizer-50k", dtype="auto") - Notebooks
- Google Colab
- Kaggle
Assamese Tokenizer (50K Vocabulary)
Model Details
This repository contains a custom tokenizer for the Assamese language with a vocabulary size of 50,000 tokens. The tokenizer was trained on the Assamese language subset of the CC-100 multilingual dataset. This tokenizer can be used for various Natural Language Processing (NLP) tasks involving the Assamese language.
Repository Details
- Repository Name: tamang0000/assamese-tokenizer-50k
- Tokenizer Vocabulary Size: 50,000 tokens
- Training Dataset: CC-100 Multilingual Dataset (Assamese Language Subset)
- Model Type: Tokenizer
- Framework: Hugging Face Transformers
- License: MIT License
Tokenizer Usage
You can load and use this tokenizer with the Hugging Face transformers library. Below are the steps to load and use the tokenizer in your projects.
Training Details
- Dataset: The tokenizer was trained exclusively on the Assamese language subset of the CC-100 multilingual dataset.
- Vocabulary Size: 50,000 tokens.
- Normalization: Includes normalization steps such as lowercasing and stripping accents.
Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐ Ask for provider support