Unmasked Teacher: Towards Training-Efficient Video Foundation Models Paper • 2303.16058 • Published Mar 28, 2023
Harvest Video Foundation Models via Efficient Post-Pretraining Paper • 2310.19554 • Published Oct 30, 2023
MVBench: A Comprehensive Multi-modal Video Understanding Benchmark Paper • 2311.17005 • Published Nov 28, 2023 • 2
Grounded SAM: Assembling Open-World Models for Diverse Visual Tasks Paper • 2401.14159 • Published Jan 25, 2024 • 2
UniFormer: Unified Transformer for Efficient Spatiotemporal Representation Learning Paper • 2201.04676 • Published Jan 12, 2022
UniFormer: Unifying Convolution and Self-attention for Visual Recognition Paper • 2201.09450 • Published Jan 24, 2022
You Only Need 90K Parameters to Adapt Light: A Light Weight Transformer for Image Enhancement and Exposure Correction Paper • 2205.14871 • Published May 30, 2022
UniFormerV2: Spatiotemporal Learning by Arming Image ViTs with Video UniFormer Paper • 2211.09552 • Published Nov 17, 2022
InternVideo: General Video Foundation Models via Generative and Discriminative Learning Paper • 2212.03191 • Published Dec 6, 2022
MUSES: 3D-Controllable Image Generation via Multi-Modal Agent Collaboration Paper • 2408.10605 • Published Aug 20, 2024
TimeSuite: Improving MLLMs for Long Video Understanding via Grounded Tuning Paper • 2410.19702 • Published Oct 25, 2024
VideoChat-Flash: Hierarchical Compression for Long-Context Video Modeling Paper • 2501.00574 • Published Dec 31, 2024 • 6
Make Your Training Flexible: Towards Deployment-Efficient Video Models Paper • 2503.14237 • Published Mar 18 • 5
Packing Input Frame Context in Next-Frame Prediction Models for Video Generation Paper • 2504.12626 • Published 27 days ago • 48