Spaces:

levalencia
/

doctorecord

Running

App Files Files Community

levalencia commited on Jun 2

Commit

1c43e67

1 Parent(s): a06ecbd

Add detailed documentation for Deep-Research PDF Field Extractor, including usage instructions, features, and support resources in README.md

Browse files

Files changed (3) hide show

ARCHITECTURE.md +143 -0
DEVELOPER.md +291 -0
README.md +42 -0

ARCHITECTURE.md ADDED Viewed

	@@ -0,0 +1,143 @@

+# Architecture Overview
+## System Design
+The application is built using a multi-agent architecture with the following components:
+### Core Components
+1. **Planner (`orchestrator/planner.py`)**
+   - Generates execution plans using Azure OpenAI
+   - Determines the sequence of operations
+   - Manages task dependencies
+2. **Executor (`orchestrator/executor.py`)**
+   - Executes the generated plan
+   - Manages agent execution flow
+   - Handles context and result management
+   - Coordinates parallel agent execution
+3. **Agents**
+   - `TableAgent`: Extracts both text and tables from PDFs using Azure Document Intelligence
+   - `FieldMapper`: Maps fields to values using extracted content
+   - `ForEachField`: Control flow for field iteration
+## System Flow
+```mermaid
+graph TD
+    A[User Input] --> B[Planner]
+    B --> C[Execution Plan]
+    C --> D[Executor]
+    D --> E[TableAgent]
+    E -->|Text & Tables| F[FieldMapper]
+    F --> G[Results]
+    subgraph "Document Intelligence"
+        E
+    end
+```
+### Document Processing Pipeline
+1. **Document Processing**
+   ```python
+   # Document is processed in stages:
+   1. Text and Table extraction (TableAgent)
+      # Uses Azure Document Intelligence for comprehensive extraction
+   2. Field mapping (FieldMapper)
+      # Uses extracted content for field value identification
+   ```
+2. **Field Extraction Process**
+   - Document type inference
+   - User profile determination
+   - Content processing:
+     - Text content analysis
+     - Table structure analysis
+   - Value extraction and validation
+3. **Context Building**
+   - Document metadata
+   - Field descriptions
+   - User context
+   - Execution history
+   - Combined text and table content
+## Key Features
+### Document Type Inference
+The system automatically infers document type and user profile:
+```python
+# Example inference:
+"Document type: Analytical report
+User profile: Data analysts or researchers working with document analysis"
+```
+### Field Mapping
+The FieldMapper agent uses a sophisticated approach:
+1. Document context analysis
+2. Page-by-page scanning
+3. Value extraction using LLM
+4. Result validation
+### Execution Traces
+The system maintains detailed execution traces:
+- Tool execution history
+- Success/failure status
+- Detailed logs
+- Result storage
+## Technical Setup
+1. **Dependencies**
+   ```python
+   # Key dependencies:
+   - streamlit
+   - pandas
+   - Azure OpenAI
+   - Azure Document Intelligence
+   ```
+2. **Configuration**
+   - Environment variables for API keys
+   - Prompt templates in `config/prompts.yaml`
+   - Settings in `config/settings.py`
+3. **Logging System**
+   ```python
+   # Custom logging setup:
+   - LogCaptureHandler for UI display
+   - Structured logging format
+   - Execution history storage
+   ```
+## Development Guidelines
+1. **Adding New Agents**
+   - Inherit from base agent class
+   - Implement required methods
+   - Add to planner configuration
+2. **Modifying Extraction Logic**
+   - Update prompt templates
+   - Modify field mapping logic
+   - Adjust validation rules
+3. **Extending Functionality**
+   - Add new field types
+   - Implement custom validators
+   - Create new output formats
+## Testing
+- Unit tests for agents
+- Integration tests for pipeline
+- End-to-end testing with sample PDFs
+## Deployment
+- Streamlit app deployment
+- Environment configuration
+- API key management
+- Logging setup
+For detailed technical implementation and AI-specific details, please refer to [DEVELOPER.md](DEVELOPER.md).

DEVELOPER.md ADDED Viewed

	@@ -0,0 +1,291 @@

+# Technical Deep Dive: AI Implementation
+## System Architecture
+The system is built around several main components that work together to process documents and extract fields:
+### Main Processing Flow
+```
++------------------+     +------------------+     +------------------+
+|                  |     |                  |     |                  |
+|  User Input      |---->|  Planner         |---->|  Execution Plan  |
+|  (PDF + Fields)  |     |  (Azure OpenAI)  |     |  (JSON)          |
++------------------+     +------------------+     +------------------+
+                                                          |
+                                                          v
++------------------+     +------------------+     +------------------+
+|                  |     |                  |     |                  |
+|  Results         |<----|  FieldMapper     |<----|  TableAgent      |
+|  (DataFrame)     |     |  (Field          |     |  (Text + Tables) |
++------------------+     |   Extraction)    |     +------------------+
+                        +------------------+
+```
+### Supporting Components
+```
++------------------+     +------------------+     +------------------+
+|                  |     |                  |     |                  |
+|  Azure OpenAI    |     |  Azure DI        |     |  Context         |
+|  (LLM Service)   |     |  (Document AI)   |     |  (State)         |
++------------------+     +------------------+     +------------------+
+        ^                        ^                        ^
+        |                        |                        |
+        |                        |                        |
++------------------+     +------------------+     +------------------+
+|                  |     |                  |     |                  |
+|  Planner         |     |  TableAgent      |     |  Executor        |
+|  (Planning)      |     |  (Extraction)    |     |  (Orchestration) |
++------------------+     +------------------+     +------------------+
+```
+The system follows this flow:
+1. User provides PDF and field requirements
+2. Planner generates execution plan using Azure OpenAI
+3. TableAgent extracts text and tables using Azure Document Intelligence
+4. FieldMapper processes the extracted content to find field values
+5. Results are returned as a structured DataFrame
+The Executor orchestrates this process while maintaining state in the Context, and Azure Document Intelligence provides the document processing capabilities.
+## Core Components
+### State Management
+The state management is implemented through a shared context dictionary in the Executor class:
+```python
+class Executor:
+    def run(self, plan: Dict[str, Any], pdf_file) -> tuple[pd.DataFrame, List[Dict[str, Any]]]:
+        ctx: Dict[str, Any] = {
+            "pdf_file": pdf_file,
+            "fields": fields,
+            "results": [],
+            "conf": 1.0,
+            "pdf_meta": plan.get("pdf_meta", {}),
+        }
+```
+This context dictionary maintains:
+- **pdf_file**: The input PDF file
+- **fields**: List of fields to extract
+- **results**: Accumulated extraction results
+- **conf**: Confidence score for extractions
+- **pdf_meta**: PDF metadata and processing information
+### Planning System
+The Planner uses Azure OpenAI to generate execution plans based on the document content and user requirements:
+```python
+class Planner:
+    def __init__(self) -> None:
+        self.prompt_template = self._load_prompt("planner")
+        self.llm = LLMClient(settings)
+    def build_plan(
+        self,
+        pdf_meta: Dict[str, Any],
+        fields: List[str],
+        doc_preview: str | None = None,
+        field_descs: Dict | None = None,
+    ) -> Dict[str, Any]:
+        """Generate an execution plan using Azure OpenAI."""
+```
+The generated plan follows a strict JSON schema:
+```json
+{
+    "fields": ["field1", "field2", ...],
+    "steps": [
+        {"tool": "TableAgent", "args": {}},
+        {
+            "tool": "ForEachField",
+            "loop": [
+                {"tool": "FieldMapper", "args": {"field": "$field"}}
+            ]
+        }
+    ]
+}
+```
+### Document Intelligence
+The core document processing is handled by Azure Document Intelligence:
+```python
+class AzureDIService:
+    def __init__(self, endpoint: str, key: str):
+        self.client = DocumentIntelligenceClient(
+            endpoint=endpoint,
+            credential=AzureKeyCredential(key)
+        )
+    def extract_content(self, pdf_bytes: bytes):
+        # Analyze document using Azure DI
+        poller = self.client.begin_analyze_document(
+            "prebuilt-layout",
+            body=pdf_bytes
+        )
+        result = poller.result()
+        # Extract both text and tables
+        text_content = result.content if hasattr(result, "content") else ""
+        tables = self._extract_tables(result) if hasattr(result, "tables") else []
+        return {
+            "text": text_content,
+            "tables": tables
+        }
+```
+### Field Mapping
+The field mapping process is implemented through a dedicated class:
+```python
+class FieldMapper:
+    def __init__(self):
+        self.llm = LLMClient()
+        self.embedding_client = EmbeddingClient()
+    def extract_field(self, field: str, content: Dict[str, Any]):
+        # Combine text and tables for context
+        context = self._build_context(content)
+        # Extract field value using LLM
+        value = self._extract_value(field, context)
+        return value
+```
+## API Implementation
+The system uses Azure OpenAI's Responses API:
+```python
+class LLMClient:
+    """Thin wrapper around openai.responses using Azure endpoints."""
+    def __init__(self, settings):
+        # Configure the global client for Azure
+        openai.api_type = "azure"
+        openai.api_key = settings.OPENAI_API_KEY or settings.AZURE_OPENAI_API_KEY
+        openai.api_base = settings.AZURE_OPENAI_ENDPOINT
+        openai.api_version = settings.AZURE_OPENAI_API_VERSION
+        self._deployment = settings.AZURE_OPENAI_DEPLOYMENT
+    def responses(self, prompt: str, tools: List[dict] | None = None, **kwargs: Any) -> str:
+        """Call the Responses API and return the assistant content as string."""
+        resp = openai.responses.create(
+            input=prompt,
+            model=self._deployment,
+            tools=tools or [],
+            **kwargs,
+        )
+        # Extract the text content from the response
+        if hasattr(resp, "output") and isinstance(resp.output, list):
+            for message in resp.output:
+                if hasattr(message, "content") and isinstance(message.content, list):
+                    for content in message.content:
+                        if hasattr(content, "text"):
+                            return content.text
+        return str(resp)
+```
+Key features of our implementation:
+1. **Responses API**: Uses Azure OpenAI's Responses API for structured interactions
+2. **Tool Support**: Optional tools parameter for function calling
+3. **Flexible Response Handling**: Multiple fallback methods for response extraction
+4. **Azure Integration**: Configured for Azure OpenAI endpoints
+The choice of Responses API provides:
+- Structured output capabilities
+- Built-in tool support
+- Consistent response format
+- Azure-specific optimizations
+## Error Handling
+The system implements basic error handling through try/except blocks and logging:
+1. **Azure Document Intelligence Errors**
+   ```python
+   try:
+       # Document analysis
+       result = self.client.begin_analyze_document(...)
+   except HttpResponseError as e:
+       self.logger.error(f"Azure Document Intelligence API error: {str(e)}")
+       # Log detailed error information if available
+       if hasattr(e, 'response') and hasattr(e.response, 'json'):
+           try:
+               error_details = e.response.json()
+               self.logger.error(f"Error details: {error_details}")
+           except:
+               pass
+       raise
+   except Exception as e:
+       self.logger.error(f"Unexpected error during document analysis: {str(e)}")
+       self.logger.exception("Full traceback:")
+       raise
+   ```
+2. **Field Mapping Errors**
+   ```python
+   try:
+       value = self.llm.responses(prompt, temperature=0.0)
+       # Process and validate value
+   except Exception as e:
+       self.logger.error(f"Error extracting field value: {str(e)}", exc_info=True)
+       return None
+   ```
+3. **Execution Errors**
+   ```python
+   try:
+       for step in plan["steps"]:
+           self._execute_step(step, ctx, depth=0)
+   except Exception as e:
+       self.logger.error(f"Error during execution: {str(e)}")
+       self.logger.error(traceback.format_exc())
+       # Don't re-raise, let the UI show the partial results
+   ```
+## Performance
+Currently, the system processes documents without caching. Each request is processed independently, which ensures:
+- Fresh results for each extraction
+- No stale data issues
+- Simple and straightforward implementation
+- Predictable resource usage
+## Future Improvements
+1. **Advanced Field Mapping**
+   - Validation rules
+   - Multi-field extraction optimization
+   - Cross-field validation rules
+   - Context-aware mapping improvements
+   - Better handling of ambiguous cases
+2. **Performance Enhancements**
+   - Implementation of caching system for:
+     - Document content caching
+     - Field extraction results caching
+     - Context data caching
+   - Batch processing capabilities
+   - Resource usage optimization
+3. **Testing and Debugging Infrastructure**
+   - Comprehensive test suite:
+     - Unit tests for each agent and service
+     - Integration tests for the complete pipeline
+     - End-to-end tests with sample documents
+     - Performance benchmarks
+   - Debugging tools:
+     - Real-time execution monitoring
+     - Detailed logging and tracing
+     - Breakpoint management
+     - Error tracking and analysis
+4. **Error Handling Improvements**
+   - Custom error classes for different error types
+   - More sophisticated recovery strategies
+   - Retry mechanisms for transient failures
+   - Better error reporting to users
+```

README.md CHANGED Viewed

@@ -17,3 +17,45 @@ Edit `/src/streamlit_app.py` to customize this app to your heart's desire. :hear
 If you have any questions, checkout our [documentation](https://docs.streamlit.io) and [community
 forums](https://discuss.streamlit.io).

 If you have any questions, checkout our [documentation](https://docs.streamlit.io) and [community
 forums](https://discuss.streamlit.io).
+# Deep-Research PDF Field Extractor
+A powerful tool for extracting structured data from PDF documents, designed to handle various document types and extract specific fields of interest.
+## Overview
+The PDF Field Extractor helps you extract specific information from PDF documents. It can extract any fields you specify, such as dates, names, values, locations, and more. The tool is particularly useful for converting unstructured PDF data into structured, analyzable formats.
+## How to Use
+1. **Upload Your PDF**
+   - Click the "Upload PDF" button
+   - Select your PDF file from your computer
+2. **Specify Fields to Extract**
+   - Enter the fields you want to extract, separated by commas
+   - Example: `Date, Name, Value, Location, Page, FileName`
+3. **Optional: Add Field Descriptions**
+   - You can provide additional context about the fields
+   - This helps the system better understand what to look for
+4. **Run Extraction**
+   - Click the "Run extraction" button
+   - Wait for the process to complete
+   - View your results in a table format
+5. **Download Results**
+   - Download your extracted data as a CSV file
+   - View execution traces and logs if needed
+## Features
+- Automatic document type detection
+- Smart field extraction
+- Support for tables and text
+- Detailed execution traces
+- Downloadable results and logs
+## Support
+For technical documentation and architecture details, please refer to:
+- [Architecture Overview](ARCHITECTURE.md)
+- [Developer Documentation](DEVELOPER.md)