Instructions to use microsoft/udop-large-512 with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use microsoft/udop-large-512 with Transformers:

# Use a pipeline as a high-level helper
from transformers import pipeline

pipe = pipeline("image-text-to-text", model="microsoft/udop-large-512")

# Load model directly
from transformers import AutoProcessor, AutoModelForImageTextToText

processor = AutoProcessor.from_pretrained("microsoft/udop-large-512")
model = AutoModelForImageTextToText.from_pretrained("microsoft/udop-large-512")

Notebooks
Google Colab
Kaggle
Local Apps Settings

vLLM

How to use microsoft/udop-large-512 with vLLM:

Install from pip and serve model

# Install vLLM from pip:
pip install vllm
# Start the vLLM server:
vllm serve "microsoft/udop-large-512"
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:8000/v1/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "microsoft/udop-large-512",
		"prompt": "Once upon a time,",
		"max_tokens": 512,
		"temperature": 0.5
	}'

Use Docker

docker model run hf.co/microsoft/udop-large-512

SGLang

How to use microsoft/udop-large-512 with SGLang:

Install from pip and serve model

# Install SGLang from pip:
pip install sglang
# Start the SGLang server:
python3 -m sglang.launch_server \
    --model-path "microsoft/udop-large-512" \
    --host 0.0.0.0 \
    --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "microsoft/udop-large-512",
		"prompt": "Once upon a time,",
		"max_tokens": 512,
		"temperature": 0.5
	}'

Use Docker images

docker run --gpus all \
    --shm-size 32g \
    -p 30000:30000 \
    -v ~/.cache/huggingface:/root/.cache/huggingface \
    --env "HF_TOKEN=<secret>" \
    --ipc=host \
    lmsysorg/sglang:latest \
    python3 -m sglang.launch_server \
        --model-path "microsoft/udop-large-512" \
        --host 0.0.0.0 \
        --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "microsoft/udop-large-512",
		"prompt": "Once upon a time,",
		"max_tokens": 512,
		"temperature": 0.5
	}'

Docker Model Runner
How to use microsoft/udop-large-512 with Docker Model Runner:
```
docker model run hf.co/microsoft/udop-large-512
```

udop-large-512 / README.md

tnaumann

Update README.md (#3)

52d8a71 verified 6 months ago

preview code

raw

history blame contribute delete

2.45 kB

	---
	license: mit
	tags:
	- vision
	inference: false
	pipeline_tag: image-text-to-text
	---

	# UDOP model

	The UDOP model was proposed in [Unifying Vision, Text, and Layout for Universal Document Processing](https://arxiv.org/abs/2212.02623) by Zineng Tang, Ziyi Yang, Guoxin Wang, Yuwei Fang, Yang Liu, Chenguang Zhu, Michael Zeng, Cha Zhang, Mohit Bansal.

	## Model description

	UDOP adopts an encoder-decoder Transformer architecture based on T5 for document AI tasks like document image classification, document parsing and document visual question answering.

	## Intended uses & limitations

	You can use the model for document image classification, document parsing and document visual question answering (DocVQA).

	### How to use

	Here's how to use the model for one-shot semantic segmentation:

	```python
	from transformers import AutoProcessor, UdopForConditionalGeneration
	from datasets import load_dataset

	# load model and processor
	# in this case, we already have performed OCR ourselves
	# so we initialize the processor with `apply_ocr=False`
	processor = AutoProcessor.from_pretrained("microsoft/udop-large", apply_ocr=False)
	model = UdopForConditionalGeneration.from_pretrained("microsoft/udop-large")

	# load an example image, along with the words and coordinates
	# which were extracted using an OCR engine
	dataset = load_dataset("nielsr/funsd-layoutlmv3", split="train")
	example = dataset[0]
	image = example["image"]
	words = example["tokens"]
	boxes = example["bboxes"]
	question = "Question answering. What is the date on the form?"

	# prepare everything for the model
	encoding = processor(image, question, words, boxes=boxes, return_tensors="pt")

	# autoregressive generation
	predicted_ids = model.generate(**encoding)
	print(processor.batch_decode(predicted_ids, skip_special_tokens=True)[0])
	9/30/92
	```

	Refer to the [demo notebooks](https://github.com/NielsRogge/Transformers-Tutorials/tree/master/UDOP) for fine-tuning/inference.

	### BibTeX entry and citation info

	```bibtex
	@misc{tang2023unifying,
	title={Unifying Vision, Text, and Layout for Universal Document Processing},
	author={Zineng Tang and Ziyi Yang and Guoxin Wang and Yuwei Fang and Yang Liu and Chenguang Zhu and Michael Zeng and Cha Zhang and Mohit Bansal},
	year={2023},
	eprint={2212.02623},
	archivePrefix={arXiv},
	primaryClass={cs.CV}
	}
	```
	### Data Summary
	https://huggingface.co/microsoft/udop-large-512/blob/main/data_summary_card.md