ProfVLM: Video-Language Model for Sports Proficiency Analysis

ProfVLM is a multimodal model that combines video understanding with language generation for analyzing human performance and proficiency levels in human activities.

Model Description

ProfVLM integrates:

Language Model: HuggingFaceTB/SmolLM2-135M-Instruct with LoRA adapters
Vision Encoder: facebook/timesformer-base-finetuned-k600
Custom Video Adapter: AttentiveProjector with multi-head attention for view integration

Key Features

Multi-view support: Processes 5 camera view(s) simultaneously
Temporal modeling: Analyzes 8 frames per video
Proficiency assessment: Classifies performance levels (Novice, Early Expert, Intermediate Expert, Late Expert)
Sport agnostic: Trained on multiple sports (basketball, cooking, dance, bouldering, soccer, music)

Model Architecture

Video Input (B, V, T, C, H, W) → TimesFormer → AttentiveProjector → LLM → Text Analysis

Where:

B: Batch size
V: Number of views (5)
T: Number of frames (8)
C, H, W: Channel, Height, Width

Usage

import torch
from transformers import AutoTokenizer, AutoImageProcessor
from your_module import ProfVLM, load_model

# Load the model
model = load_model("path/to/model", device="cuda")
model.eval()

# Prepare your video data
# videos should be a list of lists: [[view1_frames, view2_frames, ...]]
# where each view contains 8 RGB frames

messages = [
    {"role": "system", "content": "You are a visual agent for human performance analysis."},
    {"role": "user", "content": "Here are 8 frames sampled from a video: <|video_start|><|video|><|video_end|>. Given this video, analyze the proficiency level of the subject."}
]

prompt = model.processor.tokenizer.apply_chat_template(messages, add_generation_prompt=True, tokenize=False)
batch = model.processor(text=[prompt], videos=[videos], return_tensors="pt", padding=True)

# Generate analysis
with torch.no_grad():
    # ... (implementation details as in your generate_on_test_set function)
    pass

Training Details

Dataset

Multi-sport dataset with proficiency annotations
Sports: Basketball, Cooking, Dance, Bouldering, Soccer, Music
Proficiency levels: Novice, Early Expert, Intermediate Expert, Late Expert

Training Configuration

LoRA: r=32, alpha=64, dropout=0.1
Video Processing: 8 frames per video, 5 view(s)
Optimization: AdamW with cosine scheduling
Mixed Precision: FP16 training

Performance

The model demonstrates strong performance in:

Multi-view video understanding
Temporal feature integration
Cross-sport proficiency assessment
Human performance analysis

Files Structure

model/
├── llm_lora/              # LoRA adapter weights
├── tokenizer/             # Tokenizer files  
├── vision_processor/      # Vision processor config
├── video_adapter.pt       # Custom video adapter weights
├── config.json           # Model configuration
└── README.md             # This file

Requirements

torch>=2.0.0
transformers>=4.35.0
peft>=0.6.0
av>=10.0.0
opencv-python>=4.8.0
torchvision>=0.15.0
numpy>=1.24.0
pillow>=9.5.0

Citation

If you use this model, please cite:

coming soon....

License

This model is released under the Apache 2.0 License.

Acknowledgments

Base LLM: HuggingFaceTB/SmolLM2-135M-Instruct
Vision Encoder: facebook/timesformer-base-finetuned-k600
Built with 🤗 Transformers and PyTorch

EdBianchi
/

ProfVLMv1-EgoExos-Attn