OpenGVLab
/

InternVL3-78B

Image-Text-to-Text

feature-extraction

Model card Files Files and versions Community

Weiyun1025 commited on 2 days ago

Commit

852965e

·

verified ·

1 Parent(s): 4f7e3ac

Update README.md

Files changed (1) hide show

README.md +2 -2

README.md CHANGED Viewed

@@ -105,7 +105,7 @@ In this work, we use the Best-of-N evaluation strategy and employ [VisualPRM-8B]
 ### Multimodal Reasoning and Mathematics
-![image/png](https://huggingface.co/datasets/Weiyun1025/InternVL-Performance/resolve/main/internvl3/reasoning.png)
 ### OCR, Chart, and Document Understanding
@@ -161,7 +161,7 @@ The evaluation results in the Figure below shows that the model with native mult
 As shown in the table below, models fine-tuned with MPO demonstrate superior reasoning performance across seven multimodal reasoning benchmarks compared to their counterparts without MPO. Specifically, InternVL3-78B and InternVL3-38B outperform their counterparts by 4.1 and 4.5 points, respectively. Notably, the training data used for MPO is a subset of that used for SFT, indicating that the performance improvements primarily stem from the training algorithm rather than the training data.
-![image/png](https://huggingface.co/datasets/Weiyun1025/InternVL-Performance/resolve/main/internvl3/ablation-mpo.png)
 ### Variable Visual Position Encoding

 ### Multimodal Reasoning and Mathematics
+![image/png](https://huggingface.co/datasets/OpenGVLab/VisualPRM400K-v1.1/resolve/main/visualprm-performance.png)
 ### OCR, Chart, and Document Understanding
 As shown in the table below, models fine-tuned with MPO demonstrate superior reasoning performance across seven multimodal reasoning benchmarks compared to their counterparts without MPO. Specifically, InternVL3-78B and InternVL3-38B outperform their counterparts by 4.1 and 4.5 points, respectively. Notably, the training data used for MPO is a subset of that used for SFT, indicating that the performance improvements primarily stem from the training algorithm rather than the training data.
+![image/png](https://huggingface.co/datasets/OpenGVLab/MMPR-v1.2/resolve/main/ablation-mpo.png)
 ### Variable Visual Position Encoding