File size: 8,432 Bytes
082de5b
 
5d02f88
 
 
 
 
 
 
 
 
082de5b
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
5d02f88
082de5b
 
5d02f88
082de5b
 
 
 
 
5d02f88
 
 
 
082de5b
 
 
 
 
 
 
 
 
 
5d02f88
082de5b
 
 
 
 
5d02f88
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
082de5b
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
5d02f88
082de5b
 
 
 
 
5d02f88
082de5b
 
 
 
5d02f88
082de5b
 
 
 
 
 
 
 
 
 
 
5d02f88
 
082de5b
 
 
 
 
 
 
5d02f88
082de5b
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
---
license: apache-2.0
library_name: PaddleOCR
language:
- en
- zh
pipeline_tag: image-to-text
tags:
- OCR
- PaddlePaddle
- PaddleOCR
---

# PP-DocBee2-3B

## Introduction

The PaddleOCR team has developed PP-DocBee2-3B, a multimodal large model that significantly enhances Chinese document understanding. Building upon the original PP-DocBee, this new iteration introduces an improved data optimization scheme that boosts data quality. PP-DocBee2 achieves superior performance in Chinese document understanding tasks by leveraging a relatively small dataset of 470,000 synthetic data points, generated through a proprietary data synthesis strategy. Internally, PP-DocBee2 demonstrates an impressive 11.4% improvement over its predecessor, PP-DocBee, in Chinese business scenario metrics. Furthermore, it outperforms other popular open-source and closed-source models of comparable scale in key accuracy metrics. The key accuracy metrics are as follow:

| Model | Model Storage Size(GB) | Total Score |
|-------|--------------------------|-----------|
| PP-DocBee-2B | 4.2 | 765 |
| PP-DocBee-7B | 15.8 | - |
| **PP-DocBee2-3B** | 7.6 | 852 |


**Note**: The total scores of the above models are test results from an internal evaluation set, where all images have a resolution (height, width) of (1680, 1204), with a total of 1196 data entries, covering scenarios such as financial reports, laws and regulations, scientific and technical papers, manuals, humanities papers, contracts, research reports, etc. There are no plans for public release at the moment.

## Quick Start

### Installation

1. PaddlePaddle

Please refer to the following commands to install PaddlePaddle using pip:

```bash
# for CUDA11.8
python -m pip install paddlepaddle-gpu==3.0.0 -i https://www.paddlepaddle.org.cn/packages/stable/cu118/

# for CUDA12.6
python -m pip install paddlepaddle-gpu==3.0.0 -i https://www.paddlepaddle.org.cn/packages/stable/cu126/

# for CPU
python -m pip install paddlepaddle==3.0.0 -i https://www.paddlepaddle.org.cn/packages/stable/cpu/
```

For details about PaddlePaddle installation, please refer to the [PaddlePaddle official website](https://www.paddlepaddle.org.cn/en/install/quick).

2. PaddleOCR

Install the latest version of the PaddleOCR inference package from PyPI:

```bash
python -m pip install paddleocr
```

### Model Usage

You can quickly experience the functionality with a single command:

```bash
paddleocr doc_vlm \
    --model_name PP-DocBee2-3B \
    -i "{'image': 'https://cdn-uploads.huggingface.co/production/uploads/684acf07de103b2d44c85531/l5xpHbfLn75dKInhQZ84I.png', 'query': 'Recognize the content of this table and output it in markdown format.'}"
```

You can also integrate the model inference of the document visual-language module into your project. Before running the following code, please download the sample image to your local machine.

```python
from paddleocr import DocVLM
model = DocVLM(model_name="PP-DocBee2-3B")
results = model.predict(
    input={
        "image": "https://cdn-uploads.huggingface.co/production/uploads/684acf07de103b2d44c85531/l5xpHbfLn75dKInhQZ84I.png", 
        "query": "Recognize the content of this table and output it in markdown format."
    },
    batch_size=1
)
for res in results:
    res.print()
    res.save_to_json(f"./output/res.json")
```

After running, the obtained result is as follows:

```bash
{'res': {'image': 'medal_table_en.png', 'query': 'Recognize the content of this table and output it in markdown format', 'result': '| Rank | Country/Region | Gold | Silver | Bronze | Total Medals |\n|---|---|---|---|---|---|\n| 1 | China (CHN) | 48 | 22 | 30 | 100 |\n| 2 | United States (USA) | 36 | 39 | 37 | 112 |\n| 3 | Russia (RUS) | 24 | 13 | 23 | 60 |\n| 4 | Great Britain (GBR) | 19 | 13 | 19 | 51 |\n| 5 | Germany (GER) | 16 | 11 | 14 | 41 |\n| 6 | Australia (AUS) | 14 | 15 | 17 | 46 |\n| 7 | South Korea (KOR) | 13 | 11 | 8 | 32 |\n| 8 | Japan (JPN) | 9 | 8 | 8 | 25 |\n| 9 | Italy (ITA) | 8 | 9 | 10 | 27 |\n| 10 | France (FRA) | 7 | 16 | 20 | 43 |\n| 11 | Netherlands (NED) | 7 | 5 | 4 | 16 |\n| 12 | Ukraine (UKR) | 7 | 4 | 11 | 22 |\n| 13 | Kenya (KEN) | 6 | 4 | 6 | 16 |\n| 14 | Spain (ESP) | 5 | 11 | 3 | 19 |\n| 15 | Jamaica (JAM) | 5 | 4 | 2 | 11 |\n'}}
```

The visualized result is as follows:

```bash
| Rank | Country/Region | Gold | Silver | Bronze | Total Medals |
|---|---|---|---|---|---|
| 1 | China (CHN) | 48 | 22 | 30 | 100 |
| 2 | United States (USA) | 36 | 39 | 37 | 112 |
| 3 | Russia (RUS) | 24 | 13 | 23 | 60 |
| 4 | Great Britain (GBR) | 19 | 13 | 19 | 51 |
| 5 | Germany (GER) | 16 | 11 | 14 | 41 |
| 6 | Australia (AUS) | 14 | 15 | 17 | 46 |
| 7 | South Korea (KOR) | 13 | 11 | 8 | 32 |
| 8 | Japan (JPN) | 9 | 8 | 8 | 25 |
| 9 | Italy (ITA) | 8 | 9 | 10 | 27 |
| 10 | France (FRA) | 7 | 16 | 20 | 43 |
| 11 | Netherlands (NED) | 7 | 5 | 4 | 16 |
| 12 | Ukraine (UKR) | 7 | 4 | 11 | 22 |
| 13 | Kenya (KEN) | 6 | 4 | 6 | 16 |
| 14 | Spain (ESP) | 5 | 11 | 3 | 19 |
| 15 | Jamaica (JAM) | 5 | 4 | 2 | 11 |
```

For details about usage command and descriptions of parameters, please refer to the [Document](https://paddlepaddle.github.io/PaddleOCR/latest/en/version3.x/module_usage/doc_vlm.html#iii-quick-start).

### Pipeline Usage

The ability of a single model is limited. But the pipeline consists of several models can provide more capacity to resolve difficult problems in real-world scenarios.

#### doc_understanding

The document understanding pipeline is an advanced document processing technology based on Visual-Language Models (VLM), designed to overcome the limitations of traditional document processing. And there is only 1 module in the pipeline: 
* Document Visual Language Module

Run a single command to quickly experience the OCR pipeline:

```bash
paddleocr doc_understanding -i "{'image': 'https://cdn-uploads.huggingface.co/production/uploads/684acf07de103b2d44c85531/l5xpHbfLn75dKInhQZ84I.png', 'query': 'Recognize the content of this table and output it in markdown format.'}"
```

Results are printed to the terminal:

```json
{'res': {'image': 'medal_table_en.png', 'query': 'Recognize the content of this table and output it in markdown format', 'result': '| Rank | Country/Region | Gold | Silver | Bronze | Total Medals |\n|---|---|---|---|---|---|\n| 1 | China (CHN) | 48 | 22 | 30 | 100 |\n| 2 | United States (USA) | 36 | 39 | 37 | 112 |\n| 3 | Russia (RUS) | 24 | 13 | 23 | 60 |\n| 4 | Great Britain (GBR) | 19 | 13 | 19 | 51 |\n| 5 | Germany (GER) | 16 | 11 | 14 | 41 |\n| 6 | Australia (AUS) | 14 | 15 | 17 | 46 |\n| 7 | South Korea (KOR) | 13 | 11 | 8 | 32 |\n| 8 | Japan (JPN) | 9 | 8 | 8 | 25 |\n| 9 | Italy (ITA) | 8 | 9 | 10 | 27 |\n| 10 | France (FRA) | 7 | 16 | 20 | 43 |\n| 11 | Netherlands (NED) | 7 | 5 | 4 | 16 |\n| 12 | Ukraine (UKR) | 7 | 4 | 11 | 22 |\n| 13 | Kenya (KEN) | 6 | 4 | 6 | 16 |\n| 14 | Spain (ESP) | 5 | 11 | 3 | 19 |\n| 15 | Jamaica (JAM) | 5 | 4 | 2 | 11 |\n'}}
```

If save_path is specified, the visualization results will be saved under `save_path`. The visualization output is shown below:

![image/png](https://cdn-uploads.huggingface.co/production/uploads/684acf07de103b2d44c85531/kFGo9nlHuHs2uyN1voSTg.png)

The command-line method is for quick experience. For project integration, also only a few codes are needed as well:

```python
from paddleocr import DocUnderstanding

pipeline = DocUnderstanding(
    doc_understanding_model_name="PP-DocBee2-3B"
)
output = pipeline.predict(
    {
        "image": "https://cdn-uploads.huggingface.co/production/uploads/684acf07de103b2d44c85531/l5xpHbfLn75dKInhQZ84I.png",
        "query": "Recognize the content of this table and output it in markdown format."
    }
)
for res in output:
    res.print() ## Print the structured output of the prediction
    res.save_to_json("./output/")
```

The default model used in pipeline is `PP-DocBee2-3B`, so you don't have to specify `PP-DocBee2-3B` for the `doc_understanding_model_name argument`, but you can use the local model file by argument `doc_understanding_model_dir`. For details about usage command and descriptions of parameters, please refer to the [Document](https://paddlepaddle.github.io/PaddleOCR/latest/en/version3.x/pipeline_usage/doc_understanding.html#2-quick-start).

## Links

[PaddleOCR Repo](https://github.com/paddlepaddle/paddleocr)

[PaddleOCR Documentation](https://paddlepaddle.github.io/PaddleOCR/latest/en/index.html)