π Mini-GPT Omni-0.000001
A "crazy" transformation of a micro-model into a Unified Multimodal Foundation. This model treats text, images, and audio as a single continuous stream of tokens, enabling cross-modal reasoning in a shared latent space.
π οΈ Architecture: The "Everything is a Token" Approach
This repository implements a Unified Decoder architecture. Instead of separate pipelines, we project all modalities into a shared $d_{model} = 512$:
- Eyes (vision_tokenizer.py): Chops images into 16x16 patches, treating pixels as "visual words".
- Ears (audio_tokenizer.py): Uses a neural codec to discretize audio waves into acoustic units.
- Brain (omni_transformer.py): A 6-layer Transformer that predicts the next token, regardless of modality.
π Getting Started
- Install Dependencies:
pip install -r requirements.txt
- Downloads last month
- 1
Inference Providers
NEW
This model isn't deployed by any Inference Provider.
π
Ask for provider support