Image-Text-to-Text
English
lbourdois commited on
Commit
5658c46
·
verified ·
1 Parent(s): 31a6193

Improve language tag

Browse files

Hi! As the model is multilingual, this is a PR to add other languages than English to the language tag to improve the referencing. Note that 29 languages are announced in the README, but only 13 are explicitly listed. I was therefore only able to add these 13 languages.

Files changed (1) hide show
  1. README.md +89 -77
README.md CHANGED
@@ -1,77 +1,89 @@
1
- ---
2
- base_model:
3
- - Qwen/Qwen2.5-1.5B-Instruct
4
- - google/siglip-so400m-patch14-384
5
- datasets:
6
- - weizhiwang/Open-Qwen2VL-Data
7
- - MAmmoTH-VL/MAmmoTH-VL-Instruct-12M
8
- language:
9
- - en
10
- license: cc
11
- pipeline_tag: image-text-to-text
12
- ---
13
-
14
- # Model Card for Open-Qwen2VL
15
-
16
- Open-Qwen2VL is a multimodal model that takes images and text as input and produces text as output. This model is described in the paper [Open-Qwen2VL: Compute-Efficient Pre-Training of Fully-Open Multimodal LLMs on Academic Resources](https://huggingface.co/papers/2504.00595). The code is available at [https://github.com/Victorwz/Open-Qwen2VL](https://github.com/Victorwz/Open-Qwen2VL).
17
-
18
- ## Updates
19
- - [4/1/2025] The codebase, model, data, and paper are released.
20
-
21
- <!-- ## Model Details -->
22
-
23
- ## How to Use
24
-
25
- Please firstly install Open-Qwen2VL via
26
- ```
27
- pip install git+https://github.com/Victorwz/Open-Qwen2VL.git#subdirectory=prismatic-vlms
28
- ```
29
-
30
- You can load the model and perform inference as follows:
31
- ```python
32
- import requests
33
- import torch
34
- from PIL import Image
35
- from prismatic import load
36
-
37
- device = torch.device("cuda") if torch.cuda.is_available() else torch.device("cpu")
38
-
39
- # Load a pretrained VLM (either local path, or ID to auto-download from the HF Hub)
40
- vlm = load("Open-Qwen2VL")
41
- vlm.to(device, dtype=torch.bfloat16)
42
-
43
- # Download an image and specify a prompt
44
- image_url = "https://huggingface.co/adept/fuyu-8b/resolve/main/bus.png"
45
- # image = Image.open(requests.get(image_url, stream=True).raw).convert("RGB")
46
- image = [vlm.vision_backbone.image_transform(Image.open(requests.get(image_url, stream=True).raw).convert("RGB")).unsqueeze(0)]
47
- user_prompt = "<image>\nDescribe the image."
48
-
49
- # Generate!
50
- generated_text = vlm.generate_batch(
51
- image,
52
- [user_prompt],
53
- do_sample=False,
54
- max_new_tokens=512,
55
- min_length=1,
56
- )
57
- print(generated_text[0])
58
- ```
59
- The image caption results look like:
60
- ```
61
- The image depicts a blue and orange bus parked on the side of a street. ...
62
- ```
63
-
64
-
65
- ## Acknowledgement
66
- This work was partially supported by the BioPACIFIC Materials Innovation Platform of the National Science Foundation under Award No. DMR-1933487
67
-
68
- ## Citation
69
- ```bibtex
70
- @article{Open-Qwen2VL,
71
- title={Open-Qwen2VL: Compute-Efficient Pre-Training of Fully-Open Multimodal LLMs on Academic Resources},
72
- author={Wang, Weizhi and Tian, Yu and Yang, Linjie and Wang, Heng and Yan, Xifeng},
73
- journal={arXiv preprint arXiv:2504.00595},
74
- year={2025}
75
- }
76
- ```
77
-
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ base_model:
3
+ - Qwen/Qwen2.5-1.5B-Instruct
4
+ - google/siglip-so400m-patch14-384
5
+ datasets:
6
+ - weizhiwang/Open-Qwen2VL-Data
7
+ - MAmmoTH-VL/MAmmoTH-VL-Instruct-12M
8
+ language:
9
+ - zho
10
+ - eng
11
+ - fra
12
+ - spa
13
+ - por
14
+ - deu
15
+ - ita
16
+ - rus
17
+ - jpn
18
+ - kor
19
+ - vie
20
+ - tha
21
+ - ara
22
+ license: cc
23
+ pipeline_tag: image-text-to-text
24
+ ---
25
+
26
+ # Model Card for Open-Qwen2VL
27
+
28
+ Open-Qwen2VL is a multimodal model that takes images and text as input and produces text as output. This model is described in the paper [Open-Qwen2VL: Compute-Efficient Pre-Training of Fully-Open Multimodal LLMs on Academic Resources](https://huggingface.co/papers/2504.00595). The code is available at [https://github.com/Victorwz/Open-Qwen2VL](https://github.com/Victorwz/Open-Qwen2VL).
29
+
30
+ ## Updates
31
+ - [4/1/2025] The codebase, model, data, and paper are released.
32
+
33
+ <!-- ## Model Details -->
34
+
35
+ ## How to Use
36
+
37
+ Please firstly install Open-Qwen2VL via
38
+ ```
39
+ pip install git+https://github.com/Victorwz/Open-Qwen2VL.git#subdirectory=prismatic-vlms
40
+ ```
41
+
42
+ You can load the model and perform inference as follows:
43
+ ```python
44
+ import requests
45
+ import torch
46
+ from PIL import Image
47
+ from prismatic import load
48
+
49
+ device = torch.device("cuda") if torch.cuda.is_available() else torch.device("cpu")
50
+
51
+ # Load a pretrained VLM (either local path, or ID to auto-download from the HF Hub)
52
+ vlm = load("Open-Qwen2VL")
53
+ vlm.to(device, dtype=torch.bfloat16)
54
+
55
+ # Download an image and specify a prompt
56
+ image_url = "https://huggingface.co/adept/fuyu-8b/resolve/main/bus.png"
57
+ # image = Image.open(requests.get(image_url, stream=True).raw).convert("RGB")
58
+ image = [vlm.vision_backbone.image_transform(Image.open(requests.get(image_url, stream=True).raw).convert("RGB")).unsqueeze(0)]
59
+ user_prompt = "<image>\nDescribe the image."
60
+
61
+ # Generate!
62
+ generated_text = vlm.generate_batch(
63
+ image,
64
+ [user_prompt],
65
+ do_sample=False,
66
+ max_new_tokens=512,
67
+ min_length=1,
68
+ )
69
+ print(generated_text[0])
70
+ ```
71
+ The image caption results look like:
72
+ ```
73
+ The image depicts a blue and orange bus parked on the side of a street. ...
74
+ ```
75
+
76
+
77
+ ## Acknowledgement
78
+ This work was partially supported by the BioPACIFIC Materials Innovation Platform of the National Science Foundation under Award No. DMR-1933487
79
+
80
+ ## Citation
81
+ ```bibtex
82
+ @article{Open-Qwen2VL,
83
+ title={Open-Qwen2VL: Compute-Efficient Pre-Training of Fully-Open Multimodal LLMs on Academic Resources},
84
+ author={Wang, Weizhi and Tian, Yu and Yang, Linjie and Wang, Heng and Yan, Xifeng},
85
+ journal={arXiv preprint arXiv:2504.00595},
86
+ year={2025}
87
+ }
88
+ ```
89
+