Instructions to use microsoft/udop-large-512 with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use microsoft/udop-large-512 with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("image-text-to-text", model="microsoft/udop-large-512")# Load model directly from transformers import AutoProcessor, AutoModelForImageTextToText processor = AutoProcessor.from_pretrained("microsoft/udop-large-512") model = AutoModelForImageTextToText.from_pretrained("microsoft/udop-large-512") - Notebooks
- Google Colab
- Kaggle
- Local Apps Settings
- vLLM
How to use microsoft/udop-large-512 with vLLM:
Install from pip and serve model
# Install vLLM from pip: pip install vllm # Start the vLLM server: vllm serve "microsoft/udop-large-512" # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:8000/v1/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "microsoft/udop-large-512", "prompt": "Once upon a time,", "max_tokens": 512, "temperature": 0.5 }'Use Docker
docker model run hf.co/microsoft/udop-large-512
- SGLang
How to use microsoft/udop-large-512 with SGLang:
Install from pip and serve model
# Install SGLang from pip: pip install sglang # Start the SGLang server: python3 -m sglang.launch_server \ --model-path "microsoft/udop-large-512" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "microsoft/udop-large-512", "prompt": "Once upon a time,", "max_tokens": 512, "temperature": 0.5 }'Use Docker images
docker run --gpus all \ --shm-size 32g \ -p 30000:30000 \ -v ~/.cache/huggingface:/root/.cache/huggingface \ --env "HF_TOKEN=<secret>" \ --ipc=host \ lmsysorg/sglang:latest \ python3 -m sglang.launch_server \ --model-path "microsoft/udop-large-512" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "microsoft/udop-large-512", "prompt": "Once upon a time,", "max_tokens": 512, "temperature": 0.5 }' - Docker Model Runner
How to use microsoft/udop-large-512 with Docker Model Runner:
docker model run hf.co/microsoft/udop-large-512
| license: mit | |
| tags: | |
| - vision | |
| inference: false | |
| pipeline_tag: image-text-to-text | |
| # UDOP model | |
| The UDOP model was proposed in [Unifying Vision, Text, and Layout for Universal Document Processing](https://arxiv.org/abs/2212.02623) by Zineng Tang, Ziyi Yang, Guoxin Wang, Yuwei Fang, Yang Liu, Chenguang Zhu, Michael Zeng, Cha Zhang, Mohit Bansal. | |
| ## Model description | |
| UDOP adopts an encoder-decoder Transformer architecture based on T5 for document AI tasks like document image classification, document parsing and document visual question answering. | |
| ## Intended uses & limitations | |
| You can use the model for document image classification, document parsing and document visual question answering (DocVQA). | |
| ### How to use | |
| Here's how to use the model for one-shot semantic segmentation: | |
| ```python | |
| from transformers import AutoProcessor, UdopForConditionalGeneration | |
| from datasets import load_dataset | |
| # load model and processor | |
| # in this case, we already have performed OCR ourselves | |
| # so we initialize the processor with `apply_ocr=False` | |
| processor = AutoProcessor.from_pretrained("microsoft/udop-large", apply_ocr=False) | |
| model = UdopForConditionalGeneration.from_pretrained("microsoft/udop-large") | |
| # load an example image, along with the words and coordinates | |
| # which were extracted using an OCR engine | |
| dataset = load_dataset("nielsr/funsd-layoutlmv3", split="train") | |
| example = dataset[0] | |
| image = example["image"] | |
| words = example["tokens"] | |
| boxes = example["bboxes"] | |
| question = "Question answering. What is the date on the form?" | |
| # prepare everything for the model | |
| encoding = processor(image, question, words, boxes=boxes, return_tensors="pt") | |
| # autoregressive generation | |
| predicted_ids = model.generate(**encoding) | |
| print(processor.batch_decode(predicted_ids, skip_special_tokens=True)[0]) | |
| 9/30/92 | |
| ``` | |
| Refer to the [demo notebooks](https://github.com/NielsRogge/Transformers-Tutorials/tree/master/UDOP) for fine-tuning/inference. | |
| ### BibTeX entry and citation info | |
| ```bibtex | |
| @misc{tang2023unifying, | |
| title={Unifying Vision, Text, and Layout for Universal Document Processing}, | |
| author={Zineng Tang and Ziyi Yang and Guoxin Wang and Yuwei Fang and Yang Liu and Chenguang Zhu and Michael Zeng and Cha Zhang and Mohit Bansal}, | |
| year={2023}, | |
| eprint={2212.02623}, | |
| archivePrefix={arXiv}, | |
| primaryClass={cs.CV} | |
| } | |
| ``` | |
| ### Data Summary | |
| https://huggingface.co/microsoft/udop-large-512/blob/main/data_summary_card.md |