🌌 Mini-GPT Omni-0.000001

A "crazy" transformation of a micro-model into a Unified Multimodal Foundation. This model treats text, images, and audio as a single continuous stream of tokens, enabling cross-modal reasoning in a shared latent space.

🛠️ Architecture: The "Everything is a Token" Approach

This repository implements a Unified Decoder architecture. Instead of separate pipelines, we project all modalities into a shared $d_{model} = 512$:

Eyes (vision_tokenizer.py): Chops images into 16x16 patches, treating pixels as "visual words".
Ears (audio_tokenizer.py): Uses a neural codec to discretize audio waves into acoustic units.
Brain (omni_transformer.py): A 6-layer Transformer that predicts the next token, regardless of modality.

🚀 Getting Started

Install Dependencies:
```
pip install -r requirements.txt
```

Downloads last month: 1

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support