You need to agree to share your contact information to access this model

This repository is publicly accessible, but you have to accept the conditions to access its files and content.

C1Tech/whisper_small_persian

C1Tech/whisper_small_persian is a Persian ASR model based on Whisper architecture, fine-tuned on a large scale custom persian dataset.

With only 74 million parameters, this model achieves state-of-the-art performance on Persian ASR tasks, outperforming larger models like openai Whisper Large-v3 (1550M parameters) and Meta Wav2Vec2-XLSR (300M parameters).

Benchmark Performance

We evaluated the model on multiple Persian ASR benchmarks, including Common Voice, and fleurs. Results show that our model outperforms popular models like vosk, fast-conformer and openai's whisper on these benchmarks:

Model Image 1 Model Image 2

The benchmark results highlight the model's efficiency and accuracy, proving that high-quality Persian ASR is achievable even with a compact model.

For more detailed evaluation and comparison with other models, please refer to the Open Persian ASR Leaderboard.

Usage

Whisper small is supported in Hugging Face 🤗 Transformers. To run the model, first install the Transformers library.

pip install --upgrade pip
pip install --upgrade transformers

The model can be used with the pipeline class to transcribe audios of arbitrary length:

import torch
from transformers import AutoModelForSpeechSeq2Seq, AutoProcessor, pipeline


device = "cuda:0" if torch.cuda.is_available() else "cpu"
torch_dtype = torch.float16 if torch.cuda.is_available() else torch.float32

model_id = "C1Tech/whisper_small_persian"

model = AutoModelForSpeechSeq2Seq.from_pretrained(
    model_id, torch_dtype=torch_dtype, low_cpu_mem_usage=True, use_safetensors=True
)
model.to(device)

processor = AutoProcessor.from_pretrained(model_id)

pipe = pipeline(
    "automatic-speech-recognition",
    model=model,
    tokenizer=processor.tokenizer,
    feature_extractor=processor.feature_extractor,
    torch_dtype=torch_dtype,
    device=device,
)

To transcribe a local audio file, simply pass the path to your audio file when you call the pipeline:

result = pipe("audio.mp3")

Multiple audio files can be transcribed in parallel by specifying them as a list and setting the batch_size parameter:

result = pipe(["audio_1.mp3", "audio_2.mp3"], batch_size=2)

Transformers is compatible with all Whisper decoding strategies, such as temperature fallback and condition on previous tokens. The following example demonstrates how to enable these heuristics:

generate_kwargs = {
    "num_beams": 3,
    "condition_on_prev_tokens": False,
    "compression_ratio_threshold": 1.35,  # zlib compression ratio threshold (in token space)
    "temperature": (0.0, 0.2, 0.4, 0.6, 0.8, 1.0),
    "logprob_threshold": -1.0,
    "no_speech_threshold": 0.6,
    "return_timestamps": True,
    "language": "fa"
}

result = pipe(sample, generate_kwargs=generate_kwargs)

Finally, the model can be made to predict timestamps. For sentence-level timestamps, pass the return_timestamps argument:

result = pipe(sample, return_timestamps=True)
print(result["chunks"])

And for word-level timestamps:

result = pipe(sample, return_timestamps="word")
print(result["chunks"])

For further information, keep in touch: info@c1tech.group

Downloads last month: 2

Safetensors

Model size

0.2B params

Tensor type

F32