ProfVLM: Video-Language Model for Sports Proficiency Analysis

ProfVLM is a multimodal model that combines video understanding with language generation for analyzing human performance and proficiency levels in human activities.

Model Description

ProfVLM integrates:

  • Language Model: HuggingFaceTB/SmolLM2-135M-Instruct with LoRA adapters
  • Vision Encoder: facebook/timesformer-base-finetuned-k600
  • Custom Video Adapter: AttentiveProjector with multi-head attention for view integration

Key Features

  • Multi-view support: Processes 5 camera view(s) simultaneously
  • Temporal modeling: Analyzes 8 frames per video
  • Proficiency assessment: Classifies performance levels (Novice, Early Expert, Intermediate Expert, Late Expert)
  • Sport agnostic: Trained on multiple sports (basketball, cooking, dance, bouldering, soccer, music)

Model Architecture

Video Input (B, V, T, C, H, W) β†’ TimesFormer β†’ AttentiveProjector β†’ LLM β†’ Text Analysis

Where:

  • B: Batch size
  • V: Number of views (5)
  • T: Number of frames (8)
  • C, H, W: Channel, Height, Width

Usage

import torch
from transformers import AutoTokenizer, AutoImageProcessor
from your_module import ProfVLM, load_model

# Load the model
model = load_model("path/to/model", device="cuda")
model.eval()

# Prepare your video data
# videos should be a list of lists: [[view1_frames, view2_frames, ...]]
# where each view contains 8 RGB frames

messages = [
    {"role": "system", "content": "You are a visual agent for human performance analysis."},
    {"role": "user", "content": "Here are 8 frames sampled from a video: <|video_start|><|video|><|video_end|>. Given this video, analyze the proficiency level of the subject."}
]

prompt = model.processor.tokenizer.apply_chat_template(messages, add_generation_prompt=True, tokenize=False)
batch = model.processor(text=[prompt], videos=[videos], return_tensors="pt", padding=True)

# Generate analysis
with torch.no_grad():
    # ... (implementation details as in your generate_on_test_set function)
    pass

Training Details

Dataset

  • Multi-sport dataset with proficiency annotations
  • Sports: Basketball, Cooking, Dance, Bouldering, Soccer, Music
  • Proficiency levels: Novice, Early Expert, Intermediate Expert, Late Expert

Training Configuration

  • LoRA: r=32, alpha=64, dropout=0.1
  • Video Processing: 8 frames per video, 5 view(s)
  • Optimization: AdamW with cosine scheduling
  • Mixed Precision: FP16 training

Performance

The model demonstrates strong performance in:

  • Multi-view video understanding
  • Temporal feature integration
  • Cross-sport proficiency assessment
  • Human performance analysis

Files Structure

model/
β”œβ”€β”€ llm_lora/              # LoRA adapter weights
β”œβ”€β”€ tokenizer/             # Tokenizer files  
β”œβ”€β”€ vision_processor/      # Vision processor config
β”œβ”€β”€ video_adapter.pt       # Custom video adapter weights
β”œβ”€β”€ config.json           # Model configuration
└── README.md             # This file

Requirements

torch>=2.0.0
transformers>=4.35.0
peft>=0.6.0
av>=10.0.0
opencv-python>=4.8.0
torchvision>=0.15.0
numpy>=1.24.0
pillow>=9.5.0

Citation

If you use this model, please cite:

coming soon....

License

This model is released under the Apache 2.0 License.

Acknowledgments

  • Base LLM: HuggingFaceTB/SmolLM2-135M-Instruct
  • Vision Encoder: facebook/timesformer-base-finetuned-k600
  • Built with πŸ€— Transformers and PyTorch
Downloads last month
3
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Model tree for EdBianchi/ProfVLMv1-EgoExos-Attn

Finetuned
(178)
this model