You need to agree to share your contact information to access this model

This repository is publicly accessible, but you have to accept the conditions to access its files and content.

Multimodal Gemma 270M

A multimodal vision-language model combining Google Gemma 270M with CLIP ViT-L/14 for image understanding and visual question answering.

Model Description

This model connects a frozen CLIP vision encoder to a Gemma 270M language model through a learned projection layer, enabling the model to understand and reason about images.

Architecture

Component	Model	Parameters
Language Model	Google Gemma 3 270M	270M
Vision Encoder	OpenAI CLIP ViT-L/14	304M
Vision Projector	MLP (1024 → 896)	1.7M
Total		~580M
Trainable (LoRA + Projector)		~9.3M

Training Details

Dataset: LLaVA-Instruct-150K (50K subset)
Hardware: NVIDIA A100 40GB
Training Time: ~1.5 hours
Batch Size: 12 (effective: 24 with gradient accumulation)
Precision: BF16 mixed precision
Optimizer: AdamW with fused kernels
LoRA Rank: 32, Alpha: 64

Training Configuration

training:
  batch_size: 12
  accumulate_grad_batches: 2
  max_epochs: 5 (early stopped at epoch 3)
  lora_lr: 5e-4
  projector_lr: 2e-3
  precision: bf16-mixed

lora:
  r: 32
  alpha: 64
  target_modules: [q_proj, v_proj, k_proj, o_proj, gate_proj, up_proj, down_proj]

Performance

Metric	Value
Training Loss	1.029
Validation Loss	1.203

Usage

Installation

pip install torch transformers peft lightning omegaconf huggingface_hub

Download Checkpoint

from huggingface_hub import hf_hub_download

checkpoint_path = hf_hub_download(
    repo_id="sagar007/multimodal-gemma-270m-checkpoints",
    filename="final_model.ckpt"
)

Inference

from src.models.multimodal_gemma import MultimodalGemma
import torch

# Load model
checkpoint = torch.load(checkpoint_path)
config = checkpoint['hyper_parameters']['config']
model = MultimodalGemma(config)
model.load_state_dict(checkpoint['state_dict'], strict=False)
model.eval()

# Generate
output = model.generate(
    input_ids=input_ids,
    images=pixel_values,
    max_new_tokens=100,
    temperature=0.7
)

Files

File	Size	Description
`final_model.ckpt`	1.24 GB	Final trained model
`last.ckpt`	1.24 GB	Last checkpoint
`multimodal-gemma-epoch=XX-val/`	~1.2 GB each	Epoch checkpoints

Limitations

Trained on English data only
Limited to 224x224 image resolution
Best for simple visual QA tasks
May hallucinate details not present in images

Citation

@misc{multimodal-gemma-270m,
  author = {Sagar Pallai},
  title = {Multimodal Gemma 270M: Vision-Language Model},
  year = {2025},
  publisher = {HuggingFace},
  url = {https://huggingface.co/sagar007/multimodal-gemma-270m-checkpoints}
}

License

Apache 2.0

Acknowledgments

Google for Gemma models
OpenAI for CLIP
LLaVA team for the instruction dataset

Downloads last month: -; Downloads are not tracked for this model. How to track

Model tree for sagar007/multigemma

Base model

google/gemma-3-270m

Finetuned

(114)

this model

sagar007
/

multigemma