Join the conversation

Join the community of Machine Learners and AI enthusiasts.

Sign Up
fdaudens 
posted an update 11 days ago
Post
3094
Forget everything you know about transcription models - NVIDIA's parakeet-tdt-0.6b-v2 changed the game for me!

Just tested it with Steve Jobs' Stanford speech and was speechless (pun intended). The video isn’t sped up.

3 things that floored me:
- Transcription took just 10 seconds for a 15-min file
- Got a CSV with perfect timestamps, punctuation & capitalization
- Stunning accuracy (correctly captured "Reed College" and other specifics)

NVIDIA also released a demo where you can click any transcribed segment to play it instantly.

The improvement is significant: number 1 on the ASR Leaderboard, 6% error rate (best in class) with complete commercial freedom (cc-by-4.0 license).

Time to update those Whisper pipelines! H/t @Steveeeeeeen for the finding!

Model: nvidia/parakeet-tdt-0.6b-v2
Demo: nvidia/parakeet-tdt-0.6b-v2
ASR Leaderboard: hf-audio/open_asr_leaderboard

Looks like it needs Nemo? Hope we can use it directly through transformers like whisper