C1Tech/whisper_small_persian
C1Tech/whisper_small_persian is a Persian ASR model based on Whisper architecture, fine-tuned on a large scale custom persian dataset.
With only 74 million parameters, this model achieves state-of-the-art performance on Persian ASR tasks, outperforming larger models like openai Whisper Large-v3 (1550M parameters) and Meta Wav2Vec2-XLSR (300M parameters).
Benchmark Performance
We evaluated the model on multiple Persian ASR benchmarks, including Common Voice, and fleurs. Results show that our model outperforms popular models like vosk, fast-conformer and openai's whisper on these benchmarks:
The benchmark results highlight the model's efficiency and accuracy, proving that high-quality Persian ASR is achievable even with a compact model.
For more detailed evaluation and comparison with other models, please refer to the Open Persian ASR Leaderboard.
Usage
Whisper small is supported in Hugging Face 🤗 Transformers. To run the model, first install the Transformers library.
pip install --upgrade pip
pip install --upgrade transformers
The model can be used with the pipeline
class to transcribe audios of arbitrary length:
import torch
from transformers import AutoModelForSpeechSeq2Seq, AutoProcessor, pipeline
device = "cuda:0" if torch.cuda.is_available() else "cpu"
torch_dtype = torch.float16 if torch.cuda.is_available() else torch.float32
model_id = "C1Tech/whisper_small_persian"
model = AutoModelForSpeechSeq2Seq.from_pretrained(
model_id, torch_dtype=torch_dtype, low_cpu_mem_usage=True, use_safetensors=True
)
model.to(device)
processor = AutoProcessor.from_pretrained(model_id)
pipe = pipeline(
"automatic-speech-recognition",
model=model,
tokenizer=processor.tokenizer,
feature_extractor=processor.feature_extractor,
torch_dtype=torch_dtype,
device=device,
)
To transcribe a local audio file, simply pass the path to your audio file when you call the pipeline:
result = pipe("audio.mp3")
Multiple audio files can be transcribed in parallel by specifying them as a list and setting the batch_size parameter:
result = pipe(["audio_1.mp3", "audio_2.mp3"], batch_size=2)
Transformers is compatible with all Whisper decoding strategies, such as temperature fallback and condition on previous tokens. The following example demonstrates how to enable these heuristics:
generate_kwargs = {
"num_beams": 3,
"condition_on_prev_tokens": False,
"compression_ratio_threshold": 1.35, # zlib compression ratio threshold (in token space)
"temperature": (0.0, 0.2, 0.4, 0.6, 0.8, 1.0),
"logprob_threshold": -1.0,
"no_speech_threshold": 0.6,
"return_timestamps": True,
"language": "fa"
}
result = pipe(sample, generate_kwargs=generate_kwargs)
Finally, the model can be made to predict timestamps. For sentence-level timestamps, pass the return_timestamps argument:
result = pipe(sample, return_timestamps=True)
print(result["chunks"])
And for word-level timestamps:
result = pipe(sample, return_timestamps="word")
print(result["chunks"])
For further information, keep in touch: info@c1tech.group
- Downloads last month
- 2