levalencia commited on
Commit
1c43e67
·
1 Parent(s): a06ecbd

Add detailed documentation for Deep-Research PDF Field Extractor, including usage instructions, features, and support resources in README.md

Browse files
Files changed (3) hide show
  1. ARCHITECTURE.md +143 -0
  2. DEVELOPER.md +291 -0
  3. README.md +42 -0
ARCHITECTURE.md ADDED
@@ -0,0 +1,143 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # Architecture Overview
2
+
3
+ ## System Design
4
+
5
+ The application is built using a multi-agent architecture with the following components:
6
+
7
+ ### Core Components
8
+
9
+ 1. **Planner (`orchestrator/planner.py`)**
10
+ - Generates execution plans using Azure OpenAI
11
+ - Determines the sequence of operations
12
+ - Manages task dependencies
13
+
14
+ 2. **Executor (`orchestrator/executor.py`)**
15
+ - Executes the generated plan
16
+ - Manages agent execution flow
17
+ - Handles context and result management
18
+ - Coordinates parallel agent execution
19
+
20
+ 3. **Agents**
21
+ - `TableAgent`: Extracts both text and tables from PDFs using Azure Document Intelligence
22
+ - `FieldMapper`: Maps fields to values using extracted content
23
+ - `ForEachField`: Control flow for field iteration
24
+
25
+ ## System Flow
26
+
27
+ ```mermaid
28
+ graph TD
29
+ A[User Input] --> B[Planner]
30
+ B --> C[Execution Plan]
31
+ C --> D[Executor]
32
+ D --> E[TableAgent]
33
+ E -->|Text & Tables| F[FieldMapper]
34
+ F --> G[Results]
35
+
36
+ subgraph "Document Intelligence"
37
+ E
38
+ end
39
+ ```
40
+
41
+ ### Document Processing Pipeline
42
+
43
+ 1. **Document Processing**
44
+ ```python
45
+ # Document is processed in stages:
46
+ 1. Text and Table extraction (TableAgent)
47
+ # Uses Azure Document Intelligence for comprehensive extraction
48
+ 2. Field mapping (FieldMapper)
49
+ # Uses extracted content for field value identification
50
+ ```
51
+
52
+ 2. **Field Extraction Process**
53
+ - Document type inference
54
+ - User profile determination
55
+ - Content processing:
56
+ - Text content analysis
57
+ - Table structure analysis
58
+ - Value extraction and validation
59
+
60
+ 3. **Context Building**
61
+ - Document metadata
62
+ - Field descriptions
63
+ - User context
64
+ - Execution history
65
+ - Combined text and table content
66
+
67
+ ## Key Features
68
+
69
+ ### Document Type Inference
70
+ The system automatically infers document type and user profile:
71
+ ```python
72
+ # Example inference:
73
+ "Document type: Analytical report
74
+ User profile: Data analysts or researchers working with document analysis"
75
+ ```
76
+
77
+ ### Field Mapping
78
+ The FieldMapper agent uses a sophisticated approach:
79
+ 1. Document context analysis
80
+ 2. Page-by-page scanning
81
+ 3. Value extraction using LLM
82
+ 4. Result validation
83
+
84
+ ### Execution Traces
85
+ The system maintains detailed execution traces:
86
+ - Tool execution history
87
+ - Success/failure status
88
+ - Detailed logs
89
+ - Result storage
90
+
91
+ ## Technical Setup
92
+
93
+ 1. **Dependencies**
94
+ ```python
95
+ # Key dependencies:
96
+ - streamlit
97
+ - pandas
98
+ - Azure OpenAI
99
+ - Azure Document Intelligence
100
+ ```
101
+
102
+ 2. **Configuration**
103
+ - Environment variables for API keys
104
+ - Prompt templates in `config/prompts.yaml`
105
+ - Settings in `config/settings.py`
106
+
107
+ 3. **Logging System**
108
+ ```python
109
+ # Custom logging setup:
110
+ - LogCaptureHandler for UI display
111
+ - Structured logging format
112
+ - Execution history storage
113
+ ```
114
+
115
+ ## Development Guidelines
116
+
117
+ 1. **Adding New Agents**
118
+ - Inherit from base agent class
119
+ - Implement required methods
120
+ - Add to planner configuration
121
+
122
+ 2. **Modifying Extraction Logic**
123
+ - Update prompt templates
124
+ - Modify field mapping logic
125
+ - Adjust validation rules
126
+
127
+ 3. **Extending Functionality**
128
+ - Add new field types
129
+ - Implement custom validators
130
+ - Create new output formats
131
+
132
+ ## Testing
133
+ - Unit tests for agents
134
+ - Integration tests for pipeline
135
+ - End-to-end testing with sample PDFs
136
+
137
+ ## Deployment
138
+ - Streamlit app deployment
139
+ - Environment configuration
140
+ - API key management
141
+ - Logging setup
142
+
143
+ For detailed technical implementation and AI-specific details, please refer to [DEVELOPER.md](DEVELOPER.md).
DEVELOPER.md ADDED
@@ -0,0 +1,291 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # Technical Deep Dive: AI Implementation
2
+
3
+ ## System Architecture
4
+
5
+ The system is built around several main components that work together to process documents and extract fields:
6
+
7
+ ### Main Processing Flow
8
+ ```
9
+ +------------------+ +------------------+ +------------------+
10
+ | | | | | |
11
+ | User Input |---->| Planner |---->| Execution Plan |
12
+ | (PDF + Fields) | | (Azure OpenAI) | | (JSON) |
13
+ +------------------+ +------------------+ +------------------+
14
+ |
15
+ v
16
+ +------------------+ +------------------+ +------------------+
17
+ | | | | | |
18
+ | Results |<----| FieldMapper |<----| TableAgent |
19
+ | (DataFrame) | | (Field | | (Text + Tables) |
20
+ +------------------+ | Extraction) | +------------------+
21
+ +------------------+
22
+ ```
23
+
24
+ ### Supporting Components
25
+ ```
26
+ +------------------+ +------------------+ +------------------+
27
+ | | | | | |
28
+ | Azure OpenAI | | Azure DI | | Context |
29
+ | (LLM Service) | | (Document AI) | | (State) |
30
+ +------------------+ +------------------+ +------------------+
31
+ ^ ^ ^
32
+ | | |
33
+ | | |
34
+ +------------------+ +------------------+ +------------------+
35
+ | | | | | |
36
+ | Planner | | TableAgent | | Executor |
37
+ | (Planning) | | (Extraction) | | (Orchestration) |
38
+ +------------------+ +------------------+ +------------------+
39
+ ```
40
+
41
+ The system follows this flow:
42
+ 1. User provides PDF and field requirements
43
+ 2. Planner generates execution plan using Azure OpenAI
44
+ 3. TableAgent extracts text and tables using Azure Document Intelligence
45
+ 4. FieldMapper processes the extracted content to find field values
46
+ 5. Results are returned as a structured DataFrame
47
+
48
+ The Executor orchestrates this process while maintaining state in the Context, and Azure Document Intelligence provides the document processing capabilities.
49
+
50
+ ## Core Components
51
+
52
+ ### State Management
53
+ The state management is implemented through a shared context dictionary in the Executor class:
54
+
55
+ ```python
56
+ class Executor:
57
+ def run(self, plan: Dict[str, Any], pdf_file) -> tuple[pd.DataFrame, List[Dict[str, Any]]]:
58
+ ctx: Dict[str, Any] = {
59
+ "pdf_file": pdf_file,
60
+ "fields": fields,
61
+ "results": [],
62
+ "conf": 1.0,
63
+ "pdf_meta": plan.get("pdf_meta", {}),
64
+ }
65
+ ```
66
+
67
+ This context dictionary maintains:
68
+ - **pdf_file**: The input PDF file
69
+ - **fields**: List of fields to extract
70
+ - **results**: Accumulated extraction results
71
+ - **conf**: Confidence score for extractions
72
+ - **pdf_meta**: PDF metadata and processing information
73
+
74
+ ### Planning System
75
+ The Planner uses Azure OpenAI to generate execution plans based on the document content and user requirements:
76
+
77
+ ```python
78
+ class Planner:
79
+ def __init__(self) -> None:
80
+ self.prompt_template = self._load_prompt("planner")
81
+ self.llm = LLMClient(settings)
82
+
83
+ def build_plan(
84
+ self,
85
+ pdf_meta: Dict[str, Any],
86
+ fields: List[str],
87
+ doc_preview: str | None = None,
88
+ field_descs: Dict | None = None,
89
+ ) -> Dict[str, Any]:
90
+ """Generate an execution plan using Azure OpenAI."""
91
+ ```
92
+
93
+ The generated plan follows a strict JSON schema:
94
+ ```json
95
+ {
96
+ "fields": ["field1", "field2", ...],
97
+ "steps": [
98
+ {"tool": "TableAgent", "args": {}},
99
+ {
100
+ "tool": "ForEachField",
101
+ "loop": [
102
+ {"tool": "FieldMapper", "args": {"field": "$field"}}
103
+ ]
104
+ }
105
+ ]
106
+ }
107
+ ```
108
+
109
+ ### Document Intelligence
110
+ The core document processing is handled by Azure Document Intelligence:
111
+
112
+ ```python
113
+ class AzureDIService:
114
+ def __init__(self, endpoint: str, key: str):
115
+ self.client = DocumentIntelligenceClient(
116
+ endpoint=endpoint,
117
+ credential=AzureKeyCredential(key)
118
+ )
119
+
120
+ def extract_content(self, pdf_bytes: bytes):
121
+ # Analyze document using Azure DI
122
+ poller = self.client.begin_analyze_document(
123
+ "prebuilt-layout",
124
+ body=pdf_bytes
125
+ )
126
+ result = poller.result()
127
+
128
+ # Extract both text and tables
129
+ text_content = result.content if hasattr(result, "content") else ""
130
+ tables = self._extract_tables(result) if hasattr(result, "tables") else []
131
+
132
+ return {
133
+ "text": text_content,
134
+ "tables": tables
135
+ }
136
+ ```
137
+
138
+ ### Field Mapping
139
+ The field mapping process is implemented through a dedicated class:
140
+
141
+ ```python
142
+ class FieldMapper:
143
+ def __init__(self):
144
+ self.llm = LLMClient()
145
+ self.embedding_client = EmbeddingClient()
146
+
147
+ def extract_field(self, field: str, content: Dict[str, Any]):
148
+ # Combine text and tables for context
149
+ context = self._build_context(content)
150
+
151
+ # Extract field value using LLM
152
+ value = self._extract_value(field, context)
153
+
154
+ return value
155
+ ```
156
+
157
+ ## API Implementation
158
+
159
+ The system uses Azure OpenAI's Responses API:
160
+
161
+ ```python
162
+ class LLMClient:
163
+ """Thin wrapper around openai.responses using Azure endpoints."""
164
+
165
+ def __init__(self, settings):
166
+ # Configure the global client for Azure
167
+ openai.api_type = "azure"
168
+ openai.api_key = settings.OPENAI_API_KEY or settings.AZURE_OPENAI_API_KEY
169
+ openai.api_base = settings.AZURE_OPENAI_ENDPOINT
170
+ openai.api_version = settings.AZURE_OPENAI_API_VERSION
171
+ self._deployment = settings.AZURE_OPENAI_DEPLOYMENT
172
+
173
+ def responses(self, prompt: str, tools: List[dict] | None = None, **kwargs: Any) -> str:
174
+ """Call the Responses API and return the assistant content as string."""
175
+ resp = openai.responses.create(
176
+ input=prompt,
177
+ model=self._deployment,
178
+ tools=tools or [],
179
+ **kwargs,
180
+ )
181
+ # Extract the text content from the response
182
+ if hasattr(resp, "output") and isinstance(resp.output, list):
183
+ for message in resp.output:
184
+ if hasattr(message, "content") and isinstance(message.content, list):
185
+ for content in message.content:
186
+ if hasattr(content, "text"):
187
+ return content.text
188
+ return str(resp)
189
+ ```
190
+
191
+ Key features of our implementation:
192
+ 1. **Responses API**: Uses Azure OpenAI's Responses API for structured interactions
193
+ 2. **Tool Support**: Optional tools parameter for function calling
194
+ 3. **Flexible Response Handling**: Multiple fallback methods for response extraction
195
+ 4. **Azure Integration**: Configured for Azure OpenAI endpoints
196
+
197
+ The choice of Responses API provides:
198
+ - Structured output capabilities
199
+ - Built-in tool support
200
+ - Consistent response format
201
+ - Azure-specific optimizations
202
+
203
+ ## Error Handling
204
+
205
+ The system implements basic error handling through try/except blocks and logging:
206
+
207
+ 1. **Azure Document Intelligence Errors**
208
+ ```python
209
+ try:
210
+ # Document analysis
211
+ result = self.client.begin_analyze_document(...)
212
+ except HttpResponseError as e:
213
+ self.logger.error(f"Azure Document Intelligence API error: {str(e)}")
214
+ # Log detailed error information if available
215
+ if hasattr(e, 'response') and hasattr(e.response, 'json'):
216
+ try:
217
+ error_details = e.response.json()
218
+ self.logger.error(f"Error details: {error_details}")
219
+ except:
220
+ pass
221
+ raise
222
+ except Exception as e:
223
+ self.logger.error(f"Unexpected error during document analysis: {str(e)}")
224
+ self.logger.exception("Full traceback:")
225
+ raise
226
+ ```
227
+
228
+ 2. **Field Mapping Errors**
229
+ ```python
230
+ try:
231
+ value = self.llm.responses(prompt, temperature=0.0)
232
+ # Process and validate value
233
+ except Exception as e:
234
+ self.logger.error(f"Error extracting field value: {str(e)}", exc_info=True)
235
+ return None
236
+ ```
237
+
238
+ 3. **Execution Errors**
239
+ ```python
240
+ try:
241
+ for step in plan["steps"]:
242
+ self._execute_step(step, ctx, depth=0)
243
+ except Exception as e:
244
+ self.logger.error(f"Error during execution: {str(e)}")
245
+ self.logger.error(traceback.format_exc())
246
+ # Don't re-raise, let the UI show the partial results
247
+ ```
248
+
249
+ ## Performance
250
+
251
+ Currently, the system processes documents without caching. Each request is processed independently, which ensures:
252
+ - Fresh results for each extraction
253
+ - No stale data issues
254
+ - Simple and straightforward implementation
255
+ - Predictable resource usage
256
+
257
+ ## Future Improvements
258
+
259
+ 1. **Advanced Field Mapping**
260
+ - Validation rules
261
+ - Multi-field extraction optimization
262
+ - Cross-field validation rules
263
+ - Context-aware mapping improvements
264
+ - Better handling of ambiguous cases
265
+
266
+ 2. **Performance Enhancements**
267
+ - Implementation of caching system for:
268
+ - Document content caching
269
+ - Field extraction results caching
270
+ - Context data caching
271
+ - Batch processing capabilities
272
+ - Resource usage optimization
273
+
274
+ 3. **Testing and Debugging Infrastructure**
275
+ - Comprehensive test suite:
276
+ - Unit tests for each agent and service
277
+ - Integration tests for the complete pipeline
278
+ - End-to-end tests with sample documents
279
+ - Performance benchmarks
280
+ - Debugging tools:
281
+ - Real-time execution monitoring
282
+ - Detailed logging and tracing
283
+ - Breakpoint management
284
+ - Error tracking and analysis
285
+
286
+ 4. **Error Handling Improvements**
287
+ - Custom error classes for different error types
288
+ - More sophisticated recovery strategies
289
+ - Retry mechanisms for transient failures
290
+ - Better error reporting to users
291
+ ```
README.md CHANGED
@@ -17,3 +17,45 @@ Edit `/src/streamlit_app.py` to customize this app to your heart's desire. :hear
17
 
18
  If you have any questions, checkout our [documentation](https://docs.streamlit.io) and [community
19
  forums](https://discuss.streamlit.io).
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
17
 
18
  If you have any questions, checkout our [documentation](https://docs.streamlit.io) and [community
19
  forums](https://discuss.streamlit.io).
20
+
21
+ # Deep-Research PDF Field Extractor
22
+
23
+ A powerful tool for extracting structured data from PDF documents, designed to handle various document types and extract specific fields of interest.
24
+
25
+ ## Overview
26
+ The PDF Field Extractor helps you extract specific information from PDF documents. It can extract any fields you specify, such as dates, names, values, locations, and more. The tool is particularly useful for converting unstructured PDF data into structured, analyzable formats.
27
+
28
+ ## How to Use
29
+
30
+ 1. **Upload Your PDF**
31
+ - Click the "Upload PDF" button
32
+ - Select your PDF file from your computer
33
+
34
+ 2. **Specify Fields to Extract**
35
+ - Enter the fields you want to extract, separated by commas
36
+ - Example: `Date, Name, Value, Location, Page, FileName`
37
+
38
+ 3. **Optional: Add Field Descriptions**
39
+ - You can provide additional context about the fields
40
+ - This helps the system better understand what to look for
41
+
42
+ 4. **Run Extraction**
43
+ - Click the "Run extraction" button
44
+ - Wait for the process to complete
45
+ - View your results in a table format
46
+
47
+ 5. **Download Results**
48
+ - Download your extracted data as a CSV file
49
+ - View execution traces and logs if needed
50
+
51
+ ## Features
52
+ - Automatic document type detection
53
+ - Smart field extraction
54
+ - Support for tables and text
55
+ - Detailed execution traces
56
+ - Downloadable results and logs
57
+
58
+ ## Support
59
+ For technical documentation and architecture details, please refer to:
60
+ - [Architecture Overview](ARCHITECTURE.md)
61
+ - [Developer Documentation](DEVELOPER.md)