weizhiwang
/

Open-Qwen2VL

Image-Text-to-Text

English

Model card Files Files and versions Community

Improve language tag

by lbourdois - opened Apr 28

base: refs/heads/main

←

from: refs/pr/4

Discussion Files changed

+89

-77

Files changed (1) hide show

README.md +89 -77

README.md CHANGED Viewed

@@ -1,77 +1,89 @@
----
-base_model:
-- Qwen/Qwen2.5-1.5B-Instruct
-- google/siglip-so400m-patch14-384
-datasets:
-- weizhiwang/Open-Qwen2VL-Data
-- MAmmoTH-VL/MAmmoTH-VL-Instruct-12M
-language:
-- en
-license: cc
-pipeline_tag: image-text-to-text
----
-# Model Card for Open-Qwen2VL
-Open-Qwen2VL is a multimodal model that takes images and text as input and produces text as output.  This model is described in the paper [Open-Qwen2VL: Compute-Efficient Pre-Training of Fully-Open Multimodal LLMs on Academic Resources](https://huggingface.co/papers/2504.00595).  The code is available at [https://github.com/Victorwz/Open-Qwen2VL](https://github.com/Victorwz/Open-Qwen2VL).
-## Updates
-- [4/1/2025] The codebase, model, data, and paper are released.
-<!-- ## Model Details -->
-## How to Use
-Please firstly install Open-Qwen2VL via
-```
-pip install git+https://github.com/Victorwz/Open-Qwen2VL.git#subdirectory=prismatic-vlms
-```
-You can load the model and perform inference as follows:
-```python
-import requests
-import torch
-from PIL import Image
-from prismatic import load
-device = torch.device("cuda") if torch.cuda.is_available() else torch.device("cpu")
-# Load a pretrained VLM (either local path, or ID to auto-download from the HF Hub)
-vlm = load("Open-Qwen2VL")
-vlm.to(device, dtype=torch.bfloat16)
-# Download an image and specify a prompt
-image_url = "https://huggingface.co/adept/fuyu-8b/resolve/main/bus.png"
-# image = Image.open(requests.get(image_url, stream=True).raw).convert("RGB")
-image = [vlm.vision_backbone.image_transform(Image.open(requests.get(image_url, stream=True).raw).convert("RGB")).unsqueeze(0)]
-user_prompt = "<image>\nDescribe the image."
-# Generate!
-generated_text = vlm.generate_batch(
-    image,
-    [user_prompt],
-    do_sample=False,
-    max_new_tokens=512,
-    min_length=1,
-)
-print(generated_text[0])
-```
-The image caption results look like:
-```
-The image depicts a blue and orange bus parked on the side of a street. ...
-```
-## Acknowledgement
-This work was partially supported by the BioPACIFIC Materials Innovation Platform of the National Science Foundation under Award No. DMR-1933487
-## Citation
-```bibtex
-@article{Open-Qwen2VL,
-    title={Open-Qwen2VL: Compute-Efficient Pre-Training of Fully-Open Multimodal LLMs on Academic Resources},
-    author={Wang, Weizhi and Tian, Yu and Yang, Linjie and Wang, Heng and Yan, Xifeng},
-    journal={arXiv preprint arXiv:2504.00595},
-    year={2025}
-  }
-```

+---
+base_model:
+- Qwen/Qwen2.5-1.5B-Instruct
+- google/siglip-so400m-patch14-384
+datasets:
+- weizhiwang/Open-Qwen2VL-Data
+- MAmmoTH-VL/MAmmoTH-VL-Instruct-12M
+language:
+- zho
+- eng
+- fra
+- spa
+- por
+- deu
+- ita
+- rus
+- jpn
+- kor
+- vie
+- tha
+- ara
+license: cc
+pipeline_tag: image-text-to-text
+---
+# Model Card for Open-Qwen2VL
+Open-Qwen2VL is a multimodal model that takes images and text as input and produces text as output.  This model is described in the paper [Open-Qwen2VL: Compute-Efficient Pre-Training of Fully-Open Multimodal LLMs on Academic Resources](https://huggingface.co/papers/2504.00595).  The code is available at [https://github.com/Victorwz/Open-Qwen2VL](https://github.com/Victorwz/Open-Qwen2VL).
+## Updates
+- [4/1/2025] The codebase, model, data, and paper are released.
+<!-- ## Model Details -->
+## How to Use
+Please firstly install Open-Qwen2VL via
+```
+pip install git+https://github.com/Victorwz/Open-Qwen2VL.git#subdirectory=prismatic-vlms
+```
+You can load the model and perform inference as follows:
+```python
+import requests
+import torch
+from PIL import Image
+from prismatic import load
+device = torch.device("cuda") if torch.cuda.is_available() else torch.device("cpu")
+# Load a pretrained VLM (either local path, or ID to auto-download from the HF Hub)
+vlm = load("Open-Qwen2VL")
+vlm.to(device, dtype=torch.bfloat16)
+# Download an image and specify a prompt
+image_url = "https://huggingface.co/adept/fuyu-8b/resolve/main/bus.png"
+# image = Image.open(requests.get(image_url, stream=True).raw).convert("RGB")
+image = [vlm.vision_backbone.image_transform(Image.open(requests.get(image_url, stream=True).raw).convert("RGB")).unsqueeze(0)]
+user_prompt = "<image>\nDescribe the image."
+# Generate!
+generated_text = vlm.generate_batch(
+    image,
+    [user_prompt],
+    do_sample=False,
+    max_new_tokens=512,
+    min_length=1,
+)
+print(generated_text[0])
+```
+The image caption results look like:
+```
+The image depicts a blue and orange bus parked on the side of a street. ...
+```
+## Acknowledgement
+This work was partially supported by the BioPACIFIC Materials Innovation Platform of the National Science Foundation under Award No. DMR-1933487
+## Citation
+```bibtex
+@article{Open-Qwen2VL,
+    title={Open-Qwen2VL: Compute-Efficient Pre-Training of Fully-Open Multimodal LLMs on Academic Resources},
+    author={Wang, Weizhi and Tian, Yu and Yang, Linjie and Wang, Heng and Yan, Xifeng},
+    journal={arXiv preprint arXiv:2504.00595},
+    year={2025}
+  }
+```