File size: 5,475 Bytes
4df3a2a |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 |
---
pipeline_tag: video-text-to-text
library_name: transformers
---
# TimeZero: Temporal Video Grounding with Reasoning-Guided LVLM
<div style='display:flex; gap: 0.25rem; '>
<a href='./TimeZero_TechReport.pdf'><img src='https://img.shields.io/badge/Paper-PDF-red'></a>
<a href='https://huggingface.co/wwwyyy/TimeZero-Charades-7B'><img src='https://img.shields.io/badge/%F0%9F%A4%97%20Hugging%20Face-Checkpoint-blue'></a>
</div>
### Updates
- 2025-03-17: TimeZero initial release! Code and evaluation scripts are now available.
- 2025-03-17: TimeZero achieves SOTA performance on Charades-STA!
### Overview
TimeZero is a reasoning-guided Large Vision-Language Model (LVLM) for Temporal Video Grounding (TVG). It excels at identifying temporal segments within videos that correspond to a given natural language query. TimeZero achieves this entirely through a reinforcement learning approach that allows the model to reason about video-language relationships *during inference*.
Key Features:
* **Reinforcement Learning Training:** TimeZero is trained *entirely* using reinforcement learning, enhancing its ability to generate accurate temporal boundaries.
* **Test-Time Reasoning:** The model exhibits emergent reasoning capabilities during inference, generating a chain of thought to justify its segment predictions.
* **SOTA Performance:** TimeZero sets a new SOTA on the Charades-STA benchmark.
This README provides an overview of TimeZero, including setup instructions, the training process, and evaluation guidelines.
**Example:**

**Training Visualization:**

## Setup
```bash
conda create -n timezero python=3.11
conda env create -f environment.yml
conda activate timezero
```
## Training
TimeZero training involves the following steps:
1. **Data Preprocessing:**
Download the dataset [Charades-STA](https://github.com/jiyanggao/TALL#charades-sta-anno-download), [ActivityNet](https://cs.stanford.edu/people/ranjaykrishna/densevid/)
Before training, you need to preprocess the video data.
```bash
bash preprocess_video.sh
```
Specify the path to the Charades-STA dataset (video files, annotations, etc.).
2. **GRPO Training:**
```bash
cd scripts
bash run_grpo_video.sh
```
**`run_grpo_video.sh`**
```bash
#!/bin/bash
export DEBUG_MODE="false" # Set to "true" for verbose logging during training.
export LOG_PATH="./debug_log.txt"
torchrun --nproc_per_node="4" \
--nnodes="1" \
--node_rank="0" \
--master_addr="127.0.0.1" \
--master_port="12361" \
src/open_r1/grpo_video.py \
--deepspeed scripts/zero3_offload.json \
--output_dir $OUTDIR \
--model_name_or_path mllm/Qwen2.5-VL-7B-Instruct \
--preprocessed_data_path ./Charades_preprocessed_data_maxpix_3584 \
--train_data_path ./Charades/charades_annotation/train.json \
--eval_data_path ./Charades/charades_annotation/val.json \
--video_folder ./Charades/Charades_v1 \
--dataset_name xxx \
--max_prompt_length 8192 \
--max_completion_length 1024 \
--num_generations 8 \
--per_device_train_batch_size 1 \
--gradient_accumulation_steps 2 \
--logging_steps 1 \
--bf16 \
--torch_dtype bfloat16 \
--data_seed 42 \
--gradient_checkpointing true \
--attn_implementation flash_attention_2 \
--num_train_epochs 2 \
--run_name $WANDB_NAME \
--report_to wandb \
--save_steps 50 \
--save_only_model true
```
## Evaluation
After training, evaluate your model's performance:
```bash
bash scripts/evaluate.sh # Use evaluate.sh for evaluation.
```
**`evaluate.sh`**
```
python evaluate.py --model_base <path_to_your_trained_model> --dataset <charades or activitynet>
```
> The evaluation script (`evaluate.py`) needs to be implemented to load your model, process the test data, and calculate the relevant metrics ([email protected], [email protected], [email protected], etc.).
## Results
- **Charades-STA (Finetuned)**
TimeZero outperforms previous state-of-the-art methods by a large margin.
| Method | Type | [email protected] | [email protected] | [email protected] |
| --------------------- | ---- | ------ | ------ | ------ |
| EaTR (VLP sota) | VLP | - | 68.4 | 44.9 |
| TimeSuite (LVLM sota) | SFT | 79.4 | 67.1 | 43.0 |
| TimeZero (ours) | RL | 83.3 | 72.5 | 47.9 |
- **ActivityNet (Finetuned)**
TimeZero surpasses previous state-of-the-art LVLMs.
| Method | Type | [email protected] | [email protected] | [email protected] |
| ----------------- | ---- | ------ | ------ | ------ |
| EaTR (VLP sota) | VLP | - | 58.18 | 37.64 |
| TRACE (LVLM sota) | SFT | 54.0 | 37.7 | 24.0 |
| TimeZero (ours) | RL | 68.6 | 47.3 | 26.9 |
## Acknowledgements
We thank the authors of the following projects for their contributions:
* [TRACE](https://github.com/gyxxyg/TRACE)
* [R1-V](https://github.com/Deep-Agent/R1-V)
* [Qwen2.5-VL](https://github.com/QwenLM/Qwen2.5-VL)
## Citation
```bibtex
@article{wang2025timezero,
title={TimeZero: Temporal Video Grounding with Reasoning-Guided LVLM},
author={Wang, Ye and Xu, Boshen and Yue, Zihao and Xiao, Zihan and Wang, Ziheng and Zhang, Liang and Yang, Dingyi and Wang, Wenxuan and Jin, Qin},
booktitle={arxiv},
year={2025}
}
```
|