ProfVLM: Video-Language Model for Sports Proficiency Analysis
ProfVLM is a multimodal model that combines video understanding with language generation for analyzing human performance and proficiency levels in human activities.
Model Description
ProfVLM integrates:
- Language Model: HuggingFaceTB/SmolLM2-135M-Instruct with LoRA adapters
- Vision Encoder: facebook/timesformer-base-finetuned-k600
- Custom Video Adapter: AttentiveProjector with multi-head attention for view integration
Key Features
- Multi-view support: Processes 5 camera view(s) simultaneously
- Temporal modeling: Analyzes 8 frames per video
- Proficiency assessment: Classifies performance levels (Novice, Early Expert, Intermediate Expert, Late Expert)
- Sport agnostic: Trained on multiple sports (basketball, cooking, dance, bouldering, soccer, music)
Model Architecture
Video Input (B, V, T, C, H, W) β TimesFormer β AttentiveProjector β LLM β Text Analysis
Where:
- B: Batch size
- V: Number of views (5)
- T: Number of frames (8)
- C, H, W: Channel, Height, Width
Usage
import torch
from transformers import AutoTokenizer, AutoImageProcessor
from your_module import ProfVLM, load_model
# Load the model
model = load_model("path/to/model", device="cuda")
model.eval()
# Prepare your video data
# videos should be a list of lists: [[view1_frames, view2_frames, ...]]
# where each view contains 8 RGB frames
messages = [
{"role": "system", "content": "You are a visual agent for human performance analysis."},
{"role": "user", "content": "Here are 8 frames sampled from a video: <|video_start|><|video|><|video_end|>. Given this video, analyze the proficiency level of the subject."}
]
prompt = model.processor.tokenizer.apply_chat_template(messages, add_generation_prompt=True, tokenize=False)
batch = model.processor(text=[prompt], videos=[videos], return_tensors="pt", padding=True)
# Generate analysis
with torch.no_grad():
# ... (implementation details as in your generate_on_test_set function)
pass
Training Details
Dataset
- Multi-sport dataset with proficiency annotations
- Sports: Basketball, Cooking, Dance, Bouldering, Soccer, Music
- Proficiency levels: Novice, Early Expert, Intermediate Expert, Late Expert
Training Configuration
- LoRA: r=32, alpha=64, dropout=0.1
- Video Processing: 8 frames per video, 5 view(s)
- Optimization: AdamW with cosine scheduling
- Mixed Precision: FP16 training
Performance
The model demonstrates strong performance in:
- Multi-view video understanding
- Temporal feature integration
- Cross-sport proficiency assessment
- Human performance analysis
Files Structure
model/
βββ llm_lora/ # LoRA adapter weights
βββ tokenizer/ # Tokenizer files
βββ vision_processor/ # Vision processor config
βββ video_adapter.pt # Custom video adapter weights
βββ config.json # Model configuration
βββ README.md # This file
Requirements
torch>=2.0.0
transformers>=4.35.0
peft>=0.6.0
av>=10.0.0
opencv-python>=4.8.0
torchvision>=0.15.0
numpy>=1.24.0
pillow>=9.5.0
Citation
If you use this model, please cite:
coming soon....
License
This model is released under the Apache 2.0 License.
Acknowledgments
- Base LLM: HuggingFaceTB/SmolLM2-135M-Instruct
- Vision Encoder: facebook/timesformer-base-finetuned-k600
- Built with π€ Transformers and PyTorch
- Downloads last month
- 3
Inference Providers
NEW
This model isn't deployed by any Inference Provider.
π
Ask for provider support
Model tree for EdBianchi/ProfVLMv1-EgoExos-Attn
Base model
HuggingFaceTB/SmolLM2-135M
Quantized
HuggingFaceTB/SmolLM2-135M-Instruct