🌌 Mini-GPT Omni-0.000001

A "crazy" transformation of a micro-model into a Unified Multimodal Foundation. This model treats text, images, and audio as a single continuous stream of tokens, enabling cross-modal reasoning in a shared latent space.

πŸ› οΈ Architecture: The "Everything is a Token" Approach

This repository implements a Unified Decoder architecture. Instead of separate pipelines, we project all modalities into a shared $d_{model} = 512$:

  • Eyes (vision_tokenizer.py): Chops images into 16x16 patches, treating pixels as "visual words".
  • Ears (audio_tokenizer.py): Uses a neural codec to discretize audio waves into acoustic units.
  • Brain (omni_transformer.py): A 6-layer Transformer that predicts the next token, regardless of modality.

πŸš€ Getting Started

  1. Install Dependencies:
    pip install -r requirements.txt
    
Downloads last month
1
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support