codefuse-ai
/

CodeFuse-VLM-14B

Text Generation

Model card Files Files and versions Community

CodeFuse-VLM-14B / README.md

codefuse-admin

init model

f9e02c9 over 1 year ago

|

2.67 kB

	## CodeFuse-VLM
	CodeFuse-VLM is a Multimodal LLM(MLLM) framework that provides users with multiple vision encoders, multimodal alignment adapters, and LLMs. Through CodeFuse-VLM framework, users are able to customize their own MLLM model to adapt their own tasks.
	As more and more models are published on Huggingface community, there will be more open-source vision encoders and LLMs. Each of these models has their own specialties, e.g. Code-LLama is good at code-related tasks but has poor performance for Chinese tasks. Therefore, we built CodeFuse-VLM framework to support multiple vision encoders, multimodal alignment adapters, and LLMs to adapt different types of tasks.
	<p align="center">
	<img src="./CodeFuse-VLM-arch.png" width="50%" />
	</p>

	Under CodeFuse-VLM framework, we use cross attention multimodal adapter, Qwen-14B LLM, and Qwen-VL's vision encoder to train CodeFuse-VLM-14B model. On multiple benchmarks, our CodeFuse-VLM-14B shows superior performances over Qwen-VL and LLAVA-1.5.

	<p align="center">
	<img src="./CodeFuse-VLM-14B-performance.png" width="50%" />
	</p>

	Here is the table for different MLLM model's performance on benchmarks
	Model \| MMBench \| MMBench-CN \| VqaV2 \| GQA \| TextVQA \| Vizwiz
	\| ------------- \| ------------- \| ------------- \| ------------- \| ------------- \| ------------- \| ------------- \|
	LLAVA-1.5 \| 67.7 \| 63.6 \| 80.0 \| 63.3 \| 61.3 \| 53.6
	Qwen-VL \| 60.6 \| 56.7 \| 78.2 \| 57.5 \| 63.8 \| 38.9
	CodeFuse-VLM-14B \| 75.7 \| 69.8 \| 79.3 \| 59.4 \| 63.9 \| 45.3


	## Contents
	- [Install](#Install)
	- [Datasets](#Datasets)
	- [Multimodal Alignment](#Multimodal-Alignment)
	- [Visual Instruction Tuning](#Visual-Instruction-Tuning)
	- [Evaluation](#Evaluation)

	## Install
	Please run sh init\_env.sh

	## Datasets
	Here's the table of datasets we used to train CodeFuse-VLM-14B:

	Dataset \| Task Type \| Number of Samples
	\| ------------- \| ------------- \| ------------- \|
	synthdog-en \| OCR \| 800,000
	synthdog-zh \| OCR \| 800,000
	cc3m(downsampled)\| Image Caption \| 600,000
	cc3m(downsampled)\| Image Caption \| 600,000
	SBU \| Image Caption \| 850,000
	Visual Genome VQA (Downsampled) \| Visual Question Answer(VQA) \| 500,000
	Visual Genome Region descriptions (Downsampled) \| Reference Grouding \| 500,000
	Visual Genome objects (Downsampled) \| Grounded Caption \| 500,000
	OCR VQA (Downsampled) \| OCR and VQA \| 500,000

	Please download these datasets on their own official websites.

	## Multimodal Alignment
	Please run sh scripts/pretrain.sh or sh scripts/pretrain\_multinode.sh


	## Visual Instruction Tuning
	Please run sh scripts/finetune.sh or sh scripts/finetune\_multinode.sh

	## Evaluation
	Please run python scrips in directory llava/eval/