Multimodal Gemma 270M
A multimodal vision-language model combining Google Gemma 270M with CLIP ViT-L/14 for image understanding and visual question answering.
Model Description
This model connects a frozen CLIP vision encoder to a Gemma 270M language model through a learned projection layer, enabling the model to understand and reason about images.
Architecture
| Component | Model | Parameters |
|---|---|---|
| Language Model | Google Gemma 3 270M | 270M |
| Vision Encoder | OpenAI CLIP ViT-L/14 | 304M |
| Vision Projector | MLP (1024 → 896) | 1.7M |
| Total | ~580M | |
| Trainable (LoRA + Projector) | ~9.3M |
Training Details
- Dataset: LLaVA-Instruct-150K (50K subset)
- Hardware: NVIDIA A100 40GB
- Training Time: ~1.5 hours
- Batch Size: 12 (effective: 24 with gradient accumulation)
- Precision: BF16 mixed precision
- Optimizer: AdamW with fused kernels
- LoRA Rank: 32, Alpha: 64
Training Configuration
training:
batch_size: 12
accumulate_grad_batches: 2
max_epochs: 5 (early stopped at epoch 3)
lora_lr: 5e-4
projector_lr: 2e-3
precision: bf16-mixed
lora:
r: 32
alpha: 64
target_modules: [q_proj, v_proj, k_proj, o_proj, gate_proj, up_proj, down_proj]
Performance
| Metric | Value |
|---|---|
| Training Loss | 1.029 |
| Validation Loss | 1.203 |
Usage
Installation
pip install torch transformers peft lightning omegaconf huggingface_hub
Download Checkpoint
from huggingface_hub import hf_hub_download
checkpoint_path = hf_hub_download(
repo_id="sagar007/multimodal-gemma-270m-checkpoints",
filename="final_model.ckpt"
)
Inference
from src.models.multimodal_gemma import MultimodalGemma
import torch
# Load model
checkpoint = torch.load(checkpoint_path)
config = checkpoint['hyper_parameters']['config']
model = MultimodalGemma(config)
model.load_state_dict(checkpoint['state_dict'], strict=False)
model.eval()
# Generate
output = model.generate(
input_ids=input_ids,
images=pixel_values,
max_new_tokens=100,
temperature=0.7
)
Files
| File | Size | Description |
|---|---|---|
final_model.ckpt |
1.24 GB | Final trained model |
last.ckpt |
1.24 GB | Last checkpoint |
multimodal-gemma-epoch=XX-val/ |
~1.2 GB each | Epoch checkpoints |
Limitations
- Trained on English data only
- Limited to 224x224 image resolution
- Best for simple visual QA tasks
- May hallucinate details not present in images
Citation
@misc{multimodal-gemma-270m,
author = {Sagar Pallai},
title = {Multimodal Gemma 270M: Vision-Language Model},
year = {2025},
publisher = {HuggingFace},
url = {https://huggingface.co/sagar007/multimodal-gemma-270m-checkpoints}
}
License
Apache 2.0
Acknowledgments
- Google for Gemma models
- OpenAI for CLIP
- LLaVA team for the instruction dataset
Model tree for sagar007/multigemma
Base model
google/gemma-3-270m