sibthinon commited on
Commit
1c7f7cc
Β·
1 Parent(s): 1527d73

Add visual_bge

Browse files
Files changed (30) hide show
  1. visual_bge/README.md +0 -181
  2. visual_bge/__init__.py +0 -1
  3. visual_bge/{visual_bge/eva_clip β†’ eva_clip}/__init__.py +0 -0
  4. visual_bge/{visual_bge/eva_clip β†’ eva_clip}/bpe_simple_vocab_16e6.txt.gz +0 -0
  5. visual_bge/{visual_bge/eva_clip β†’ eva_clip}/constants.py +0 -0
  6. visual_bge/{visual_bge/eva_clip β†’ eva_clip}/eva_vit_model.py +0 -0
  7. visual_bge/{visual_bge/eva_clip β†’ eva_clip}/factory.py +0 -0
  8. visual_bge/{visual_bge/eva_clip β†’ eva_clip}/hf_configs.py +0 -0
  9. visual_bge/{visual_bge/eva_clip β†’ eva_clip}/hf_model.py +0 -0
  10. visual_bge/{visual_bge/eva_clip β†’ eva_clip}/loss.py +0 -0
  11. visual_bge/{visual_bge/eva_clip β†’ eva_clip}/model.py +0 -0
  12. visual_bge/{visual_bge/eva_clip β†’ eva_clip}/model_configs/EVA01-CLIP-B-16.json +0 -0
  13. visual_bge/{visual_bge/eva_clip β†’ eva_clip}/model_configs/EVA01-CLIP-g-14-plus.json +0 -0
  14. visual_bge/{visual_bge/eva_clip β†’ eva_clip}/model_configs/EVA01-CLIP-g-14.json +0 -0
  15. visual_bge/{visual_bge/eva_clip β†’ eva_clip}/model_configs/EVA02-CLIP-B-16.json +0 -0
  16. visual_bge/{visual_bge/eva_clip β†’ eva_clip}/model_configs/EVA02-CLIP-L-14-336.json +0 -0
  17. visual_bge/{visual_bge/eva_clip β†’ eva_clip}/model_configs/EVA02-CLIP-L-14.json +0 -0
  18. visual_bge/{visual_bge/eva_clip β†’ eva_clip}/model_configs/EVA02-CLIP-bigE-14-plus.json +0 -0
  19. visual_bge/{visual_bge/eva_clip β†’ eva_clip}/model_configs/EVA02-CLIP-bigE-14.json +0 -0
  20. visual_bge/{visual_bge/eva_clip β†’ eva_clip}/modified_resnet.py +0 -0
  21. visual_bge/{visual_bge/eva_clip β†’ eva_clip}/openai.py +0 -0
  22. visual_bge/{visual_bge/eva_clip β†’ eva_clip}/pretrained.py +0 -0
  23. visual_bge/{visual_bge/eva_clip β†’ eva_clip}/rope.py +0 -0
  24. visual_bge/{visual_bge/eva_clip β†’ eva_clip}/timm_model.py +0 -0
  25. visual_bge/{visual_bge/eva_clip β†’ eva_clip}/tokenizer.py +0 -0
  26. visual_bge/{visual_bge/eva_clip β†’ eva_clip}/transform.py +0 -0
  27. visual_bge/{visual_bge/eva_clip β†’ eva_clip}/transformer.py +0 -0
  28. visual_bge/{visual_bge/eva_clip β†’ eva_clip}/utils.py +0 -0
  29. visual_bge/{visual_bge/modeling.py β†’ modeling.py} +0 -0
  30. visual_bge/setup.py +0 -18
visual_bge/README.md DELETED
@@ -1,181 +0,0 @@
1
- <h1 align="center">Visualized BGE</h1>
2
-
3
- <p align="center">
4
- <a href="https://arxiv.org/abs/2406.04292">
5
- <img alt="Build" src="http://img.shields.io/badge/cs.CV-arXiv%3A2406.04292-B31B1B.svg">
6
- </a>
7
- <a href="https://github.com/FlagOpen/FlagEmbedding/tree/master/research/visual_bge">
8
- <img alt="Build" src="https://img.shields.io/badge/Github-VISTA Code-blue">
9
- </a>
10
- <a href="https://huggingface.co/BAAI/bge-visualized">
11
- <img alt="Build" src="https://img.shields.io/badge/πŸ€— Model-VISTA Model-yellow">
12
- </p>
13
-
14
- <p align="center">
15
- </a>
16
- <a href="https://huggingface.co/datasets/JUNJIE99/VISTA_S2">
17
- <img alt="Build" src="https://img.shields.io/badge/πŸ€— Dataset-VISTA S2 Training Dataset-yellow">
18
- </a>
19
- <a href="https://huggingface.co/datasets/JUNJIE99/VISTA_Evaluation">
20
- <img alt="Build" src="https://img.shields.io/badge/πŸ€— Dataset-Zero_Shot Multimodal Retrieval Dataset-yellow">
21
- </a>
22
- </p>
23
-
24
- ## πŸ”” News
25
- **[2024.8.27] The core code for the evaluation and fine-tuning of VISTA can be obtained from [this link](https://github.com/JUNJIE99/VISTA_Evaluation_FineTuning). This includes Stage2 training, downstream task fine-tuning, as well as the datasets we used for evaluation.**
26
-
27
- **[2024.6.13] We have released [VISTA-S2 dataset](https://huggingface.co/datasets/JUNJIE99/VISTA_S2), a hybrid multi-modal dataset consisting of over 500,000 instances for multi-modal training (Stage-2 training in our paper).**
28
-
29
- **[2024.6.7] We have released our paper. [Arxiv Link](https://arxiv.org/abs/2406.04292)**
30
-
31
- **[2024.3.18] We have released our code and model.**
32
-
33
-
34
-
35
-
36
- ## Introduction
37
- In this project, we introduce Visualized-BGE, a universal multi-modal embedding model. By incorporating image token embedding into the BGE Text Embedding framework, Visualized-BGE gains the flexibility to process multi-modal data that goes beyond just text. Visualized-BGE is mainly used for hybrid modal retrieval tasks, including but not limited to:
38
-
39
- - Multi-Modal Knowledge Retrieval (query: text; candidate: image-text pairs, text, or image) e.g. [WebQA](https://github.com/WebQnA/WebQA)
40
- - Composed Image Retrieval (query: image-text pair; candidate: images) e.g. [CIRR](https://github.com/Cuberick-Orion/CIRR), [FashionIQ](https://github.com/XiaoxiaoGuo/fashion-iq)
41
- - Knowledge Retrieval with Multi-Modal Queries (query: image-text pair; candidate: texts) e.g. [ReMuQ](https://github.com/luomancs/ReMuQ)
42
-
43
- Moreover, Visualized BGE fully preserves the strong text embedding capabilities of the original BGE model : )
44
-
45
- ## Specs
46
- ### Model
47
- | **Model Name** | **Dimension** | **Text Embedding Model** | **Language** | **Weight** |
48
- | --- | --- | --- | --- | --- |
49
- | BAAI/bge-visualized-base-en-v1.5 | 768 | [BAAI/bge-base-en-v1.5](https://huggingface.co/BAAI/bge-base-en-v1.5) | English | [πŸ€— HF link](https://huggingface.co/BAAI/bge-visualized/blob/main/Visualized_base_en_v1.5.pth) |
50
- | BAAI/bge-visualized-m3 | 1024 | [BAAI/bge-m3](https://huggingface.co/BAAI/bge-m3) | Multilingual | [πŸ€— HF link](https://huggingface.co/BAAI/bge-visualized/blob/main/Visualized_m3.pth) |
51
-
52
-
53
- ### Data
54
- We have generated a hybrid multi-modal dataset consisting of over 500,000 instances for multi-modal training (Stage-2 training in our paper). You can download our dataset from this [πŸ€— HF Link](https://huggingface.co/datasets/JUNJIE99/VISTA_S2).
55
- Process the image compression package with the following commands:
56
-
57
- ```bash
58
- cat images.tar.part* > images.tar
59
- tar -xvf images.tar
60
- ```
61
- If you obtain the following directory structure. You can then use the annotation information (json files) for your own training:
62
- ```
63
- images
64
- |__coco
65
- |__edit_image
66
- ```
67
-
68
- ## Usage
69
- ### Installation:
70
- #### Install FlagEmbedding:
71
- ```
72
- git clone https://github.com/FlagOpen/FlagEmbedding.git
73
- cd FlagEmbedding/research/visual_bge
74
- pip install -e .
75
- ```
76
- #### Another Core Packages:
77
- ```
78
- pip install torchvision timm einops ftfy
79
- ```
80
- You don't need to install `xformer` and `apex`. They are not essential for inference and can often cause issues.
81
-
82
- ### Generate Embedding for Multi-Modal Data:
83
- Visualized-BGE provides the versatility to encode multi-modal data in a variety of formats, whether it's purely text, solely image-based, or a combination of both.
84
-
85
- > **Note:** Please download the model weight file ([bge-visualized-base-en-v1.5](https://huggingface.co/BAAI/bge-visualized/resolve/main/Visualized_base_en_v1.5.pth?download=true), [bge-visualized-m3](https://huggingface.co/BAAI/bge-visualized/resolve/main/Visualized_m3.pth?download=true)) in advance and pass the path to the `model_weight` parameter.
86
-
87
- - Composed Image Retrieval
88
- ``` python
89
- ####### Use Visualized BGE doing composed image retrieval
90
- import torch
91
- from visual_bge.modeling import Visualized_BGE
92
-
93
- model = Visualized_BGE(model_name_bge = "BAAI/bge-base-en-v1.5", model_weight="path: Visualized_base_en_v1.5.pth")
94
- model.eval()
95
- with torch.no_grad():
96
- query_emb = model.encode(image="./imgs/cir_query.png", text="Make the background dark, as if the camera has taken the photo at night")
97
- candi_emb_1 = model.encode(image="./imgs/cir_candi_1.png")
98
- candi_emb_2 = model.encode(image="./imgs/cir_candi_2.png")
99
-
100
- sim_1 = query_emb @ candi_emb_1.T
101
- sim_2 = query_emb @ candi_emb_2.T
102
- print(sim_1, sim_2) # tensor([[0.8750]]) tensor([[0.7816]])
103
- ```
104
-
105
- - Multi-Modal Knowledge Retrieval
106
- ``` python
107
- ####### Use Visualized BGE doing multi-modal knowledge retrieval
108
- import torch
109
- from visual_bge.modeling import Visualized_BGE
110
-
111
- model = Visualized_BGE(model_name_bge = "BAAI/bge-base-en-v1.5", model_weight="path: Visualized_base_en_v1.5.pth")
112
- model.eval()
113
- with torch.no_grad():
114
- query_emb = model.encode(text="Are there sidewalks on both sides of the Mid-Hudson Bridge?")
115
- candi_emb_1 = model.encode(text="The Mid-Hudson Bridge, spanning the Hudson River between Poughkeepsie and Highland.", image="./imgs/wiki_candi_1.jpg")
116
- candi_emb_2 = model.encode(text="Golden_Gate_Bridge", image="./imgs/wiki_candi_2.jpg")
117
- candi_emb_3 = model.encode(text="The Mid-Hudson Bridge was designated as a New York State Historic Civil Engineering Landmark by the American Society of Civil Engineers in 1983. The bridge was renamed the \"Franklin Delano Roosevelt Mid-Hudson Bridge\" in 1994.")
118
-
119
- sim_1 = query_emb @ candi_emb_1.T
120
- sim_2 = query_emb @ candi_emb_2.T
121
- sim_3 = query_emb @ candi_emb_3.T
122
- print(sim_1, sim_2, sim_3) # tensor([[0.6932]]) tensor([[0.4441]]) tensor([[0.6415]])
123
- ```
124
- - Multilingual Multi-Modal Retrieval
125
- ``` python
126
- ##### Use M3 doing Multilingual Multi-Modal Retrieval
127
- import torch
128
- from visual_bge.modeling import Visualized_BGE
129
-
130
- model = Visualized_BGE(model_name_bge = "BAAI/bge-m3", model_weight="path: Visualized_m3.pth")
131
- model.eval()
132
- with torch.no_grad():
133
- query_emb = model.encode(image="./imgs/cir_query.png", text="δΈ€εŒΉι©¬η‰΅η€θΏ™θΎ†θ½¦")
134
- candi_emb_1 = model.encode(image="./imgs/cir_candi_1.png")
135
- candi_emb_2 = model.encode(image="./imgs/cir_candi_2.png")
136
-
137
- sim_1 = query_emb @ candi_emb_1.T
138
- sim_2 = query_emb @ candi_emb_2.T
139
- print(sim_1, sim_2) # tensor([[0.7026]]) tensor([[0.8075]])
140
- ```
141
- ## Downstream Application Cases
142
- - [Huixiangdou](https://github.com/InternLM/HuixiangDou): Using Visualized BGE for the group chat assistant.
143
-
144
- ## Evaluation Result
145
- Visualized BGE delivers outstanding zero-shot performance across multiple hybrid modal retrieval tasks. It can also serve as a base model for downstream fine-tuning for hybrid modal retrieval tasks.
146
- #### Zero-shot Performance
147
- - Statistical information of the zero-shot multi-modal retrieval benchmark datasets. During the zero-shot evaluation, we utilize the queries from the validation or test set of each dataset to perform retrieval assessments within the entire corpus of the respective dataset.
148
- ![Statistical information for the zero-shot multi-modal retrieval benchmark datasets.](./imgs/zs-benchmark.png)
149
-
150
- - Zero-shot evaluation results with Recall@5 on various hybrid multi-modal retrieval benchmarks. The -MM notation indicates baseline models that have undergone multi-modal training on our generated data.
151
- ![Zero-shot evaluation results with Recall@5 on various hybrid multi-modal retrieval benchmarks.](./imgs/zs-performance.png)
152
-
153
- #### Fine-tuning on Downstream Tasks
154
- - Supervised fine-tuning performance on the WebQA dataset. All retrievals are performed on the entire deduplicated corpus.
155
- ![image.png](./imgs/SFT-WebQA.png)
156
- - Supervised fine-tuning performance on the CIRR test set.
157
- ![image.png](./imgs/SFT-CIRR.png)
158
- - Supervised fine-tuning performance on the ReMuQ test set.
159
- ![image.png](./imgs/SFT-ReMuQ.png)
160
-
161
-
162
-
163
- ## FAQ
164
-
165
- **Q1: Can Visualized BGE be used for cross-modal retrieval (text to image)?**
166
-
167
- A1: While it is technically possible, it's not the recommended use case. Our model focus on augmenting hybrid modal retrieval tasks with visual capabilities.
168
-
169
- ## Acknowledgement
170
- The image token embedding model in this project is built upon the foundations laid by [EVA-CLIP](https://github.com/baaivision/EVA/tree/master/EVA-CLIP).
171
-
172
- ## Citation
173
- If you find this repository useful, please consider giving a star ⭐ and citation
174
- ```
175
- @article{zhou2024vista,
176
- title={VISTA: Visualized Text Embedding For Universal Multi-Modal Retrieval},
177
- author={Zhou, Junjie and Liu, Zheng and Xiao, Shitao and Zhao, Bo and Xiong, Yongping},
178
- journal={arXiv preprint arXiv:2406.04292},
179
- year={2024}
180
- }
181
- ```
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
visual_bge/__init__.py DELETED
@@ -1 +0,0 @@
1
- from .modeling import Visualized_BGE
 
 
visual_bge/{visual_bge/eva_clip β†’ eva_clip}/__init__.py RENAMED
File without changes
visual_bge/{visual_bge/eva_clip β†’ eva_clip}/bpe_simple_vocab_16e6.txt.gz RENAMED
File without changes
visual_bge/{visual_bge/eva_clip β†’ eva_clip}/constants.py RENAMED
File without changes
visual_bge/{visual_bge/eva_clip β†’ eva_clip}/eva_vit_model.py RENAMED
File without changes
visual_bge/{visual_bge/eva_clip β†’ eva_clip}/factory.py RENAMED
File without changes
visual_bge/{visual_bge/eva_clip β†’ eva_clip}/hf_configs.py RENAMED
File without changes
visual_bge/{visual_bge/eva_clip β†’ eva_clip}/hf_model.py RENAMED
File without changes
visual_bge/{visual_bge/eva_clip β†’ eva_clip}/loss.py RENAMED
File without changes
visual_bge/{visual_bge/eva_clip β†’ eva_clip}/model.py RENAMED
File without changes
visual_bge/{visual_bge/eva_clip β†’ eva_clip}/model_configs/EVA01-CLIP-B-16.json RENAMED
File without changes
visual_bge/{visual_bge/eva_clip β†’ eva_clip}/model_configs/EVA01-CLIP-g-14-plus.json RENAMED
File without changes
visual_bge/{visual_bge/eva_clip β†’ eva_clip}/model_configs/EVA01-CLIP-g-14.json RENAMED
File without changes
visual_bge/{visual_bge/eva_clip β†’ eva_clip}/model_configs/EVA02-CLIP-B-16.json RENAMED
File without changes
visual_bge/{visual_bge/eva_clip β†’ eva_clip}/model_configs/EVA02-CLIP-L-14-336.json RENAMED
File without changes
visual_bge/{visual_bge/eva_clip β†’ eva_clip}/model_configs/EVA02-CLIP-L-14.json RENAMED
File without changes
visual_bge/{visual_bge/eva_clip β†’ eva_clip}/model_configs/EVA02-CLIP-bigE-14-plus.json RENAMED
File without changes
visual_bge/{visual_bge/eva_clip β†’ eva_clip}/model_configs/EVA02-CLIP-bigE-14.json RENAMED
File without changes
visual_bge/{visual_bge/eva_clip β†’ eva_clip}/modified_resnet.py RENAMED
File without changes
visual_bge/{visual_bge/eva_clip β†’ eva_clip}/openai.py RENAMED
File without changes
visual_bge/{visual_bge/eva_clip β†’ eva_clip}/pretrained.py RENAMED
File without changes
visual_bge/{visual_bge/eva_clip β†’ eva_clip}/rope.py RENAMED
File without changes
visual_bge/{visual_bge/eva_clip β†’ eva_clip}/timm_model.py RENAMED
File without changes
visual_bge/{visual_bge/eva_clip β†’ eva_clip}/tokenizer.py RENAMED
File without changes
visual_bge/{visual_bge/eva_clip β†’ eva_clip}/transform.py RENAMED
File without changes
visual_bge/{visual_bge/eva_clip β†’ eva_clip}/transformer.py RENAMED
File without changes
visual_bge/{visual_bge/eva_clip β†’ eva_clip}/utils.py RENAMED
File without changes
visual_bge/{visual_bge/modeling.py β†’ modeling.py} RENAMED
File without changes
visual_bge/setup.py DELETED
@@ -1,18 +0,0 @@
1
- from setuptools import setup, find_packages
2
-
3
- setup(
4
- name="visual_bge",
5
- version="0.1.0",
6
- description='visual_bge',
7
- long_description="./README.md",
8
- long_description_content_type="text/markdown",
9
- url='https://github.com/FlagOpen/FlagEmbedding/tree/master/research/visual_bge',
10
- packages=find_packages(),
11
- install_requires=[
12
- 'torchvision',
13
- 'timm',
14
- 'einops',
15
- 'ftfy'
16
- ],
17
- python_requires='>=3.6',
18
- )