Safetensors
modernbert
zli12321 commited on
Commit
0948786
·
verified ·
1 Parent(s): b5b9ec0

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +114 -3
README.md CHANGED
@@ -1,3 +1,114 @@
1
- ---
2
- license: apache-2.0
3
- ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: apache-2.0
3
+ ---
4
+
5
+ # VideoHallu: Evaluating and Mitigating Multi-modal Hallucinations for Synthetic Videos
6
+
7
+ [Zongxia Li*](https://zli12321.github.io/), [Xiyang Wu*](https://wuxiyang1996.github.io/), [Yubin Qin](https://www.linkedin.com/in/yubin-qin/), [Guangyao Shi](https://guangyaoshi.github.io/), [Hongyang Du](https://www.linkedin.com/in/hongyangdu/), [Dinesh Manocha](https://www.cs.umd.edu/people/dmanocha), [Tianyi Zhou](https://tianyizhou.github.io/), [Jordan Lee Boyd-Graber](https://users.umiacs.umd.edu/~ying/)
8
+
9
+ [[📖 Paper](https://arxiv.org/abs/2505.01481)] [[🤗 Dataset](https://huggingface.co/datasets/zli12321/VideoHalluB)][[🌍Website](https://smashedpython.github.io/videohallu.github.io/)]
10
+
11
+
12
+
13
+ ## 👀 About VideoHallu
14
+
15
+ With the recent success of video generation models such as [Sora](https://openai.com/sora/), [Veo2](https://veo2.ai), [Kling](https://www.klingai.com/global/), the visual quality of generated videos has reached new heights—making evaluation more challenging and pushing it beyond traditional metrics like frame consistency, resolution, and realism. However, we find that MLLMs struggle to detect abnormalities in generated videos, which is crucial for developing reliable automatic video evaluation methods.
16
+
17
+ We introduce VideoHallu, a curated dataset that includes videos generated by seven video generation models and a question-answer set to test MLLM's abilities to catch generated videos' abnormalities.
18
+
19
+ We also use GRPO to train [Qwen-2.5-VL-7B](https://huggingface.co/Qwen/Qwen2.5-VL-7B-Instruct) on a subset of our dataset and show improvement on generated video understanding.
20
+
21
+
22
+ ## 🔥 News
23
+ - [2025/05/02] We release our datasets in huggingface🤗.
24
+
25
+ ## 🔍 Dataset
26
+
27
+ To facilitate GRPO training, we also randomly sample 1,000 videos from [PhysBench](https://huggingface.co/datasets/WeiChow/PhysBench-train) training data to first improve model' reasoning abilities in real-world videos, then train the model on part of our synthetic videos.
28
+
29
+ Our data spans the following categories:
30
+
31
+ <img src="./images/fig1.png" style="zoom:35%;" />
32
+
33
+
34
+ ## Getting Started
35
+
36
+ ```
37
+ # Download the dataset
38
+ pip install huggingface_hub
39
+
40
+ # Download data to your local dir
41
+ huggingface-cli download IntelligenceLab/VideoHallu --repo-type dataset --local-dir ./new_video_folders --local-dir-use-symlinks False
42
+ ```
43
+
44
+ ## 🏅 <a name='rb'></a>Reward Model
45
+ We use [ModernBERT](https://huggingface.co/docs/transformers/en/model_doc/modernbert) as the base model to finetune on [MOCHA](https://arxiv.org/abs/2010.03636), [Prometheus-preference](https://huggingface.co/datasets/prometheus-eval/Preference-Collection), [Pedants](https://arxiv.org/abs/2402.11161) to evaluate free-form text generations. We use RewardBert as the reward in GRPO finetuning.
46
+
47
+ #### Method: `compute_score`
48
+ **Parameters**
49
+ - `reference_answer` (list of str): A list of gold (correct) answers to the question
50
+ - `candidate_answer` (str): The answer provided by a candidate that needs to be evaluated
51
+
52
+ **Returns**
53
+ - `tuple`: A tuple of normalized and raw scores.
54
+
55
+ ```python
56
+ from qa_metrics.RewardBert import RewardBert
57
+
58
+ rb = RewardBert(device='cuda')
59
+ reference_answer = "The Frog Prince"
60
+ candidate_answer = "The movie \"The Princess and the Frog\" is loosely based off the Brother Grimm's \"Iron Henry\""
61
+ rb.compute_score(reference_answer, candidate_answer)
62
+ # (0.29113227128982544, 2.1645290851593018)
63
+
64
+
65
+ ## Acknowledgements
66
+
67
+ We sincerely appreciate the contributions of the open-source community. The related projects are as follows: [R1-V](https://github.com/Deep-Agent/R1-V) , [DeepSeek-R1](https://github.com/deepseek-ai/DeepSeek-R1) , [Video-R1](https://github.com/tulerfeng/Video-R1), [Qwen-2.5-VL](https://arxiv.org/abs/2502.13923)
68
+
69
+ ## Citations
70
+
71
+ If you find our work helpful for your research, please consider citing our work.
72
+
73
+ ```
74
+ @misc{li2025videohalluevaluatingmitigatingmultimodal,
75
+ title={VideoHallu: Evaluating and Mitigating Multi-modal Hallucinations for Synthetic Videos},
76
+ author={Zongxia Li and Xiyang Wu and Yubin Qin and Guangyao Shi and Hongyang Du and Dinesh Manocha and Tianyi Zhou and Jordan Lee Boyd-Graber},
77
+ year={2025},
78
+ eprint={2505.01481},
79
+ archivePrefix={arXiv},
80
+ primaryClass={cs.CV},
81
+ url={https://arxiv.org/abs/2505.01481},
82
+ }
83
+
84
+
85
+ @misc{li2025surveystateartlarge,
86
+ title={A Survey of State of the Art Large Vision Language Models: Alignment, Benchmark, Evaluations and Challenges},
87
+ author={Zongxia Li and Xiyang Wu and Hongyang Du and Fuxiao Liu and Huy Nghiem and Guangyao Shi},
88
+ year={2025},
89
+ eprint={2501.02189},
90
+ archivePrefix={arXiv},
91
+ primaryClass={cs.CV},
92
+ url={https://arxiv.org/abs/2501.02189},
93
+ }
94
+
95
+ @misc{guan2024hallusionbenchadvanceddiagnosticsuite,
96
+ title={HallusionBench: An Advanced Diagnostic Suite for Entangled Language Hallucination and Visual Illusion in Large Vision-Language Models},
97
+ author={Tianrui Guan and Fuxiao Liu and Xiyang Wu and Ruiqi Xian and Zongxia Li and Xiaoyu Liu and Xijun Wang and Lichang Chen and Furong Huang and Yaser Yacoob and Dinesh Manocha and Tianyi Zhou},
98
+ year={2024},
99
+ eprint={2310.14566},
100
+ archivePrefix={arXiv},
101
+ primaryClass={cs.CV},
102
+ url={https://arxiv.org/abs/2310.14566},
103
+ }
104
+
105
+ @misc{wu2024autohallusionautomaticgenerationhallucination,
106
+ title={AutoHallusion: Automatic Generation of Hallucination Benchmarks for Vision-Language Models},
107
+ author={Xiyang Wu and Tianrui Guan and Dianqi Li and Shuaiyi Huang and Xiaoyu Liu and Xijun Wang and Ruiqi Xian and Abhinav Shrivastava and Furong Huang and Jordan Lee Boyd-Graber and Tianyi Zhou and Dinesh Manocha},
108
+ year={2024},
109
+ eprint={2406.10900},
110
+ archivePrefix={arXiv},
111
+ primaryClass={cs.CV},
112
+ url={https://arxiv.org/abs/2406.10900},
113
+ }
114
+ ```