Spaces:
				
			
			
	
			
			
		Running
		
			on 
			
			Zero
	
	
	
			
			
	
	
	
	
		
		
		Running
		
			on 
			
			Zero
	| # GLIGEN: Open-Set Grounded Text-to-Image Generation | |
| These scripts contain the code to prepare the grounding data and train the GLIGEN model on COCO dataset. | |
| ### Install the requirements | |
| ```bash | |
| conda create -n diffusers python==3.10 | |
| conda activate diffusers | |
| pip install -r requirements.txt | |
| ``` | |
| And initialize an [🤗Accelerate](https://github.com/huggingface/accelerate/) environment with: | |
| ```bash | |
| accelerate config | |
| ``` | |
| Or for a default accelerate configuration without answering questions about your environment | |
| ```bash | |
| accelerate config default | |
| ``` | |
| Or if your environment doesn't support an interactive shell e.g. a notebook | |
| ```python | |
| from accelerate.utils import write_basic_config | |
| write_basic_config() | |
| ``` | |
| ### Prepare the training data | |
| If you want to make your own grounding data, you need to install the requirements. | |
| I used [RAM](https://github.com/xinyu1205/recognize-anything) to tag | |
| images, [Grounding DINO](https://github.com/IDEA-Research/GroundingDINO/issues?q=refer) to detect objects, | |
| and [BLIP2](https://huggingface.co/docs/transformers/en/model_doc/blip-2) to caption instances. | |
| Only RAM needs to be installed manually: | |
| ```bash | |
| pip install git+https://github.com/xinyu1205/recognize-anything.git --no-deps | |
| ``` | |
| Download the pre-trained model: | |
| ```bash | |
| huggingface-cli download --resume-download xinyu1205/recognize_anything_model ram_swin_large_14m.pth | |
| huggingface-cli download --resume-download IDEA-Research/grounding-dino-base | |
| huggingface-cli download --resume-download Salesforce/blip2-flan-t5-xxl | |
| huggingface-cli download --resume-download clip-vit-large-patch14 | |
| huggingface-cli download --resume-download masterful/gligen-1-4-generation-text-box | |
| ``` | |
| Make the training data on 8 GPUs: | |
| ```bash | |
| torchrun --master_port 17673 --nproc_per_node=8 make_datasets.py \ | |
| --data_root /mnt/workspace/workgroup/zhizhonghuang/dataset/COCO/train2017 \ | |
| --save_root /root/gligen_data \ | |
| --ram_checkpoint /root/.cache/huggingface/hub/models--xinyu1205--recognize_anything_model/snapshots/ebc52dc741e86466202a5ab8ab22eae6e7d48bf1/ram_swin_large_14m.pth | |
| ``` | |
| You can download the COCO training data from | |
| ```bash | |
| huggingface-cli download --resume-download Hzzone/GLIGEN_COCO coco_train2017.pth | |
| ``` | |
| It's in the format of | |
| ```json | |
| [ | |
| ... | |
| { | |
| 'file_path': Path, | |
| 'annos': [ | |
| { | |
| 'caption': Instance | |
| Caption, | |
| 'bbox': bbox | |
| in | |
| xyxy, | |
| 'text_embeddings_before_projection': CLIP | |
| text | |
| embedding | |
| before | |
| linear | |
| projection | |
| } | |
| ] | |
| } | |
| ... | |
| ] | |
| ``` | |
| ### Training commands | |
| The training script is heavily based | |
| on https://github.com/huggingface/diffusers/blob/main/examples/controlnet/train_controlnet.py | |
| ```bash | |
| accelerate launch train_gligen_text.py \ | |
| --data_path /root/data/zhizhonghuang/coco_train2017.pth \ | |
| --image_path /mnt/workspace/workgroup/zhizhonghuang/dataset/COCO/train2017 \ | |
| --train_batch_size 8 \ | |
| --max_train_steps 100000 \ | |
| --checkpointing_steps 1000 \ | |
| --checkpoints_total_limit 10 \ | |
| --learning_rate 5e-5 \ | |
| --dataloader_num_workers 16 \ | |
| --mixed_precision fp16 \ | |
| --report_to wandb \ | |
| --tracker_project_name gligen \ | |
| --output_dir /root/data/zhizhonghuang/ckpt/GLIGEN_Text_Retrain_COCO | |
| ``` | |
| I trained the model on 8 A100 GPUs for about 11 hours (at least 24GB GPU memory). The generated images will follow the | |
| layout possibly at 50k iterations. | |
| Note that although the pre-trained GLIGEN model has been loaded, the parameters of `fuser` and `position_net` have been reset (see line 420 in `train_gligen_text.py`) | |
| The trained model can be downloaded from | |
| ```bash | |
| huggingface-cli download --resume-download Hzzone/GLIGEN_COCO config.json diffusion_pytorch_model.safetensors | |
| ``` | |
| You can run `demo.ipynb` to visualize the generated images. | |
| Example prompts: | |
| ```python | |
| prompt = 'A realistic image of landscape scene depicting a green car parking on the left of a blue truck, with a red air balloon and a bird in the sky' | |
| boxes = [[0.041015625, 0.548828125, 0.453125, 0.859375], | |
| [0.525390625, 0.552734375, 0.93359375, 0.865234375], | |
| [0.12890625, 0.015625, 0.412109375, 0.279296875], | |
| [0.578125, 0.08203125, 0.857421875, 0.27734375]] | |
| gligen_phrases = ['a green car', 'a blue truck', 'a red air balloon', 'a bird'] | |
| ``` | |
| Example images: | |
|  | |
| ### Citation | |
| ``` | |
| @article{li2023gligen, | |
| title={GLIGEN: Open-Set Grounded Text-to-Image Generation}, | |
| author={Li, Yuheng and Liu, Haotian and Wu, Qingyang and Mu, Fangzhou and Yang, Jianwei and Gao, Jianfeng and Li, Chunyuan and Lee, Yong Jae}, | |
| journal={CVPR}, | |
| year={2023} | |
| } | |
| ``` | 
