You need to agree to share your contact information to access this model

This repository is publicly accessible, but you have to accept the conditions to access its files and content.

Log in or Sign Up to review the conditions and access this model content.

Multimodal Gemma 270M

A multimodal vision-language model combining Google Gemma 270M with CLIP ViT-L/14 for image understanding and visual question answering.

Model Description

This model connects a frozen CLIP vision encoder to a Gemma 270M language model through a learned projection layer, enabling the model to understand and reason about images.

Architecture

Component Model Parameters
Language Model Google Gemma 3 270M 270M
Vision Encoder OpenAI CLIP ViT-L/14 304M
Vision Projector MLP (1024 → 896) 1.7M
Total ~580M
Trainable (LoRA + Projector) ~9.3M

Training Details

  • Dataset: LLaVA-Instruct-150K (50K subset)
  • Hardware: NVIDIA A100 40GB
  • Training Time: ~1.5 hours
  • Batch Size: 12 (effective: 24 with gradient accumulation)
  • Precision: BF16 mixed precision
  • Optimizer: AdamW with fused kernels
  • LoRA Rank: 32, Alpha: 64

Training Configuration

training:
  batch_size: 12
  accumulate_grad_batches: 2
  max_epochs: 5 (early stopped at epoch 3)
  lora_lr: 5e-4
  projector_lr: 2e-3
  precision: bf16-mixed

lora:
  r: 32
  alpha: 64
  target_modules: [q_proj, v_proj, k_proj, o_proj, gate_proj, up_proj, down_proj]

Performance

Metric Value
Training Loss 1.029
Validation Loss 1.203

Usage

Installation

pip install torch transformers peft lightning omegaconf huggingface_hub

Download Checkpoint

from huggingface_hub import hf_hub_download

checkpoint_path = hf_hub_download(
    repo_id="sagar007/multimodal-gemma-270m-checkpoints",
    filename="final_model.ckpt"
)

Inference

from src.models.multimodal_gemma import MultimodalGemma
import torch

# Load model
checkpoint = torch.load(checkpoint_path)
config = checkpoint['hyper_parameters']['config']
model = MultimodalGemma(config)
model.load_state_dict(checkpoint['state_dict'], strict=False)
model.eval()

# Generate
output = model.generate(
    input_ids=input_ids,
    images=pixel_values,
    max_new_tokens=100,
    temperature=0.7
)

Files

File Size Description
final_model.ckpt 1.24 GB Final trained model
last.ckpt 1.24 GB Last checkpoint
multimodal-gemma-epoch=XX-val/ ~1.2 GB each Epoch checkpoints

Limitations

  • Trained on English data only
  • Limited to 224x224 image resolution
  • Best for simple visual QA tasks
  • May hallucinate details not present in images

Citation

@misc{multimodal-gemma-270m,
  author = {Sagar Pallai},
  title = {Multimodal Gemma 270M: Vision-Language Model},
  year = {2025},
  publisher = {HuggingFace},
  url = {https://huggingface.co/sagar007/multimodal-gemma-270m-checkpoints}
}

License

Apache 2.0

Acknowledgments

  • Google for Gemma models
  • OpenAI for CLIP
  • LLaVA team for the instruction dataset
Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for sagar007/multigemma

Finetuned
(114)
this model

Dataset used to train sagar007/multigemma

Space using sagar007/multigemma 1