<a href="https://colab.research.google.com/github/merveenoyan/smollm/blob/main/vision/finetuning/SmolVLM2_Video_FT.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Fine-tune SmolVLM2 on Video Captioning
In this notebook we will fine-tune SmolVLM2-500M-Video-Instruct on  Video Feedback dataset. It is ran on a Colab A100 for full fine-tuning, but you can squeeze it to L4 with QLoRA.

In [None]:
%pip install -q accelerate datasets peft bitsandbytes tensorboard pyav num2words

  Preparing metadata (setup.py) ... [?25l[?25hdone
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m163.5/163.5 kB[0m [31m5.4 MB/s[0m eta [36m0:00:00[0m
[?25h  Building wheel for docopt (setup.py) ... [?25l[?25hdone


In [None]:
%pip install -q git+https://github.com/huggingface/transformers.git

In [None]:
%pip install -q flash-attn --no-build-isolation

We will push out model to Hub so we need to authenticate ourselves.

In [None]:
from huggingface_hub import notebook_login

notebook_login()

VBox(children=(HTML(value='<center> <img\nsrc=https://huggingface.co/front/assets/huggingface_logo-noborder.sv…

In this notebook we will do full fine-tuning on 500M variant. You can also apply QLoRA or LoRA on 2.2B variant, which loads an adapter to the quantized version of the model, saving space. If you want to do full fine-tuning, set `USE_LORA` and `USE_QLORA` to False. If you want to do LoRA, set `USE_QLORA` to False and `USE_LORA` to True.

Small model should learn more so we suggest disabling QLoRA or LoRA when fine-tuning it.

In [None]:
import torch
from peft import LoraConfig, prepare_model_for_kbit_training, get_peft_model
from transformers import AutoProcessor, BitsAndBytesConfig, AutoModelForImageTextToText
import os


USE_LORA = False
USE_QLORA = False
# SMOL = True

# model_id = "HuggingFaceTB/SmolVLM2-500M-Video-Instruct" if SMOL else "HuggingFaceTB/SmolVLM2-2.2B-Instruct"
model_id = "HuggingFaceTB/SmolVLM2-256M-Video-Instruct"

processor = AutoProcessor.from_pretrained(
    model_id
)

if USE_QLORA or USE_LORA:
    lora_config = LoraConfig(
        r=8,
        lora_alpha=8,
        lora_dropout=0.1,
        target_modules=['down_proj','o_proj','k_proj','q_proj','gate_proj','up_proj','v_proj'],
        use_dora=False if USE_QLORA else True,
        init_lora_weights="gaussian"
    )
    lora_config.inference_mode = False
    if USE_QLORA:
        bnb_config = BitsAndBytesConfig(
            load_in_4bit=True,
            bnb_4bit_use_double_quant=True,
            bnb_4bit_quant_type="nf4",
            bnb_4bit_compute_dtype=torch.bfloat16
        )

    model = AutoModelForImageTextToText.from_pretrained(
        model_id,
        quantization_config=bnb_config if USE_QLORA else None,
        _attn_implementation="flash_attention_2",
        device_map="auto"
    )
    model.add_adapter(lora_config)
    model.enable_adapters()
    model = prepare_model_for_kbit_training(model)
    model = get_peft_model(model, lora_config)
    print(model.get_nb_trainable_parameters())
else:
    model = AutoModelForImageTextToText.from_pretrained(
        model_id,
        torch_dtype=torch.bfloat16,
        _attn_implementation="flash_attention_2",
    ).to("cuda")

    # if you'd like to only fine-tune LLM
    for param in model.model.vision_model.parameters():
        param.requires_grad = False

peak_mem = torch.cuda.max_memory_allocated()
print(f"The model as is is holding: {peak_mem / 1024**3:.2f} of GPU RAM")

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.
You are attempting to use Flash Attention 2.0 with a model not initialized on GPU. Make sure to move the model to GPU after initializing it on CPU with `model.to('cuda')`.


The model as is is holding: 0.97 of GPU RAM


## Loading the dataset and Preprocessing

We will load a dataset that contains generated videos and their super short captions of 4k examples. We are loading small chunk of it for training and smaller one for test.

In [None]:
from datasets import load_dataset

ds = load_dataset("TIGER-Lab/VideoFeedback", "real")

In [None]:
split_ds = ds["train"].train_test_split(test_size=0.5)
train_ds = split_ds["train"]

In [None]:
del split_ds, ds

Take a sneak peek.

In [None]:
print(f"prompt:  {train_ds[0]['text prompt']}, video: {train_ds[0]['video link']}")

prompt:  A dog inside of a dog kennel on a patio., video: https://huggingface.co/datasets/hexuan21/VideoFeedback-videos-mp4/resolve/main/p/p110924.mp4


Let's write our data collating function. We will apply prompt template to have videos and captions together so model can learn to caption. Then we pass the formatted prompts and videos to the processor which processes both.

In [None]:
from torch.nn.utils.rnn import pad_sequence

image_token_id = processor.tokenizer.additional_special_tokens_ids[
    processor.tokenizer.additional_special_tokens.index("<image>")
]

def collate_fn(examples):
    instances = []
    for example in examples:
        prompt = example["text prompt"]

        user_content = [{"type": "text", "text": "Caption the video."}]
        user_content.append({"type": "video", "path": example["video link"]})

        messages = [
            {"role": "user", "content": user_content},
            {"role": "assistant", "content": [{"type": "text", "text": f"{prompt}"}]}
        ]

        instance = processor.apply_chat_template(messages, add_generation_prompt=False,
                                                 tokenize=True, return_dict=True, return_tensors="pt").to("cuda").to(model.dtype)
        instances.append(instance)


    input_ids = pad_sequence(
        [inst["input_ids"].squeeze(0) for inst in instances],
        batch_first=True,
        padding_value=processor.tokenizer.pad_token_id
    )
    attention_mask = pad_sequence(
        [inst["attention_mask"].squeeze(0) for inst in instances],
        batch_first=True,
        padding_value=0
    )
    labels = pad_sequence(
        [inst["input_ids"].squeeze(0).clone() for inst in instances],
        batch_first=True,
        padding_value=-100
    )

    labels[labels == image_token_id] = -100

    out = {
        "input_ids": input_ids,
        "attention_mask": attention_mask,
        "labels": labels
    }


    # Step 1: figure out maximum frames, height, width across the batch
    pvs = [inst["pixel_values"].squeeze(0) for inst in instances if "pixel_values" in inst]
    if pvs:  # there is at least one non-None pixel_values
        max_frames = max(pv.shape[0] for pv in pvs)
        max_h = max(pv.shape[-2] for pv in pvs)
        max_w = max(pv.shape[-1] for pv in pvs)
    else:
        max_h = max_w = processor.video_size['longest_edge']
        max_frames = 1

    padded_pixel_values_list = []
    for ex in instances:
        pv = ex.get("pixel_values", None).squeeze(0)

        if pv is None:
            # text-only => fill pixel data + mask with zeros
            shape_pv = (max_frames, 3, max_h, max_w)
            padded_pv = torch.zeros(shape_pv, dtype=torch.float32)
        else:
            f, c, h, w = pv.shape
            # Prepare final storage
            padded_pv = torch.zeros(
                (max_frames, c, max_h, max_w),
                dtype=pv.dtype,
                device=pv.device
            )
            padded_pv[:f, :, :h, :w] = pv
        padded_pixel_values_list.append(padded_pv)

    out["pixel_values"] = torch.stack(padded_pixel_values_list, dim=0)
    return out

## Training

We can now initialize `Trainer` and initialize `TrainingArguments` to pass to `Trainer`.

Some notes:
- If you use 8-bit QLoRA with the below setup it uses around 16.4 GB VRAM (beautiful, fits comfortably inside L4, Colab free tier)
- We use gradient accumulation to simulate a larger batch size.
- We also save up on memory from intermediate activations by using gradient checkpointing.

**Disclaimer:**
The techniques here aren't free lunch. The latter two will add additional compute to the training, thus slow down a bit (for reference on two A100s with bsz of 16, we were able to train for 2 hrs 43 mins with the gradient accumulation steps of 4, disabling it reduced it with 2 hr 35 mins).
If you want to speed-up, you might play around, reduce to 4-bit precision and have a higher batch size. Note that 4-bit might result in model learning less.

In [None]:
from transformers import TrainingArguments, Trainer

model_name = model_id.split("/")[-1]

training_args = TrainingArguments(
    num_train_epochs=1,
    per_device_train_batch_size=2,
    gradient_accumulation_steps=1,
    warmup_steps=50,
    learning_rate=1e-4,
    weight_decay=0.01,
    logging_steps=25,
    save_strategy="steps",
    save_steps=250,
    save_total_limit=1,
    optim="adamw_hf", # for 8-bit, keep paged_adamw_8bit, else adamw_hf
    bf16=True,
    output_dir=f"./{model_name}-video-feedback",
    hub_model_id=f"{model_name}-video-feedback",
    remove_unused_columns=False,
    report_to="tensorboard",
    dataloader_pin_memory=False
)


In [None]:
trainer = Trainer(
    model=model,
    args=training_args,
    data_collator=collate_fn,
    train_dataset=train_ds,
)

In [None]:
trainer.train()



Step,Training Loss
25,3.3456
50,0.7095
75,0.341
100,0.2722
125,0.2506
150,0.2904
175,0.2611
200,0.258
225,0.2765
250,0.2659


TrainOutput(global_step=1000, training_loss=0.3446595501899719, metrics={'train_runtime': 1194.5916, 'train_samples_per_second': 1.674, 'train_steps_per_second': 0.837, 'total_flos': 1550232912784896.0, 'train_loss': 0.3446595501899719, 'epoch': 1.0})

In [None]:
trainer.push_to_hub()

Upload 3 LFS files:   0%|          | 0/3 [00:00<?, ?it/s]

events.out.tfevents.1740055910.82ea94387a47.41010.0:   0%|          | 0.00/17.1k [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/1.02G [00:00<?, ?B/s]

training_args.bin:   0%|          | 0.00/5.43k [00:00<?, ?B/s]

CommitInfo(commit_url='https://huggingface.co/merve/SmolVLM2-500M-Video-Instruct-video-feedback/commit/2f33b0685d991475ac091593e224f3e5e7b7cac7', commit_message='End of training', commit_description='', oid='2f33b0685d991475ac091593e224f3e5e7b7cac7', pr_url=None, repo_url=RepoUrl('https://huggingface.co/merve/SmolVLM2-500M-Video-Instruct-video-feedback', endpoint='https://huggingface.co', repo_type='model', repo_id='merve/SmolVLM2-500M-Video-Instruct-video-feedback'), pr_revision=None, pr_num=None)

The test example is a video of a woman walking by, you can download and check from [here](https://huggingface.co/datasets/hexuan21/VideoFeedback-videos-mp4/blob/main/p/p000304.mp4).

In [None]:
messages = [{"role": "user",
                 "content": [{"type": "text", "text": "Caption the video."},
                  {"type": "video", "path": "https://huggingface.co/datasets/hexuan21/VideoFeedback-videos-mp4/resolve/main/p/p000304.mp4"}]}]


inputs = processor.apply_chat_template(messages, add_generation_prompt=True,
                                          tokenize=True, return_dict=True, return_tensors="pt").to("cuda").to(model.dtype)

generated_ids = model.generate(**inputs, do_sample=False, max_new_tokens=64)
generated_texts = processor.batch_decode(
    generated_ids,
    skip_special_tokens=True,
)

print(generated_texts[0])

User: Caption the video.You are provided the following series of three frames from a 0:00:03 [H:MM:SS] video.

Frame from 00:00:
Frame from 00:01:
Frame from 00:02:


Assistant: woman in white shirt walks by
