Stream-Omni: Simultaneous Multimodal Interactions with Large Language-Vision-Speech Model
Paper
•
2506.13642
•
Published
•
26
Shaolei Zhang, Shoutao Guo, Qingkai Fang, Yan Zhou, Yang Feng*
The introduction and usage of Stream-Omni refer to https://github.com/ictnlp/Stream-Omni.
Stream-Omni is an end-to-end language-vision-speech chatbot that simultaneously supports interaction across various modality combinations, with the following features💡:
| Microphone Input | File Input |
|---|---|
Stream-Omni can produce intermediate textual results (ASR transcription and text response) during speech interaction, offering users a seamless "see-while-hear" experience.