Forget everything you know about transcription models - NVIDIA's parakeet-tdt-0.6b-v2 changed the game for me!
Just tested it with Steve Jobs' Stanford speech and was speechless (pun intended). The video isn’t sped up.
3 things that floored me: - Transcription took just 10 seconds for a 15-min file - Got a CSV with perfect timestamps, punctuation & capitalization - Stunning accuracy (correctly captured "Reed College" and other specifics)
NVIDIA also released a demo where you can click any transcribed segment to play it instantly.
The improvement is significant: number 1 on the ASR Leaderboard, 6% error rate (best in class) with complete commercial freedom (cc-by-4.0 license).
Time to update those Whisper pipelines! H/t @Steveeeeeeen for the finding!
📄 arxiv Paper: In-Context Edit: Enabling Instructional Image Editing with In-Context Generation in Large Scale Diffusion Transformer (2504.20690)
🔥 Why it’s cool: - Achieves high-quality, multi-task image editing. - Uses only 1% of the training parameters and 0.1% of the training data compared to existing methods — extremely efficient - Beats several commercial models on background preservation, ID control, and consistency - Open-source, low-cost, faster, and stronger — think of it as the “DeepSeek of image editing” 👀
We also implemented a Gradio demo app, available directly in our GitHub repo! And we made a flashy demo video — happy to send it your way!
🔥 AgenticAI: The Ultimate Multimodal AI with 16 MBTI Girlfriend Personas! 🔥
Hello AI community! Today, our team is thrilled to introduce AgenticAI, an innovative open-source AI assistant that combines deep technical capabilities with uniquely personalized interaction. 💘
Complete MBTI Implementation: All 16 MBTI female personas modeled after iconic characters (Dana Scully, Lara Croft, etc.) Persona Depth: Customize age groups and thinking patterns for hyper-personalized AI interactions Personality Consistency: Each MBTI type demonstrates consistent problem-solving approaches, conversation patterns, and emotional expressions
🚀 Cutting-Edge Multimodal Capabilities
Integrated File Analysis: Deep analysis and cross-referencing of images, videos, CSV, PDF, and TXT files Advanced Image Understanding: Interprets complex diagrams, mathematical equations, charts, and tables Video Processing: Extracts key frames from videos and understands contextual meaning Document RAG: Intelligent analysis and summarization of PDF/CSV/TXT files
💡 Deep Research & Knowledge Enhancement
Real-time Web Search: SerpHouse API integration for latest information retrieval and citation Deep Reasoning Chains: Step-by-step inference process for solving complex problems Academic Analysis: In-depth approach to mathematical problems, scientific questions, and data analysis Structured Knowledge Generation: Systematic code, data analysis, and report creation
🖼️ Creative Generation Engine
FLUX Image Generation: Custom image creation reflecting the selected MBTI persona traits Data Visualization: Automatic generation of code for visualizing complex datasets Creative Writing: Story and scenario writing matching the selected persona's style