Spaces:
Sleeping
Sleeping
Architecture Documentation
System Overview
The Deep-Research PDF Field Extractor is a multi-agent system designed to extract structured data from biotech-related PDFs. The system uses Azure Document Intelligence for document processing and Azure OpenAI for intelligent field extraction.
Core Architecture
Multi-Agent Design
The system follows a multi-agent architecture where each agent has a specific responsibility:
βββββββββββββββββββ βββββββββββββββββββ βββββββββββββββββββ
β PDFAgent β β TableAgent β β IndexAgent β
β β β β β β
β β’ PDF Text βββββΆβ β’ Table βββββΆβ β’ Semantic β
β Extraction β β Processing β β Indexing β
βββββββββββββββββββ βββββββββββββββββββ βββββββββββββββββββ
β
βΌ
βββββββββββββββββββ βββββββββββββββββββ βββββββββββββββββββ
βUniqueIndices β βUniqueIndices β βFieldMapper β
βCombinator β βLoopAgent β βAgent β
β β β β β β
β β’ Extract βββββΆβ β’ Loop through β β β’ Extract β
β combinations β β combinations β β individual β
β β β β’ Add fields β β fields β
βββββββββββββββββββ βββββββββββββββββββ βββββββββββββββββββ
Execution Flow
Original Strategy Flow
1. PDFAgent β Extract text from PDF
2. TableAgent β Process tables with Azure DI
3. IndexAgent β Create semantic search index
4. ForEachField β Iterate through fields
5. FieldMapperAgent β Extract each field value
Unique Indices Strategy Flow
1. PDFAgent β Extract text from PDF
2. TableAgent β Process tables with Azure DI
3. UniqueIndicesCombinator β Extract unique combinations
4. UniqueIndicesLoopAgent β Extract additional fields for each combination
Agent Details
PDFAgent
- Purpose: Extract text content from PDF files
- Technology: PyMuPDF (fitz)
- Output: Raw text content
- Error Handling: Graceful handling of corrupted PDFs
TableAgent
- Purpose: Process tables using Azure Document Intelligence
- Technology: Azure DI Layout Analysis
- Features:
- Table structure preservation
- Rowspan/colspan handling
- HTML table generation for debugging
- Output: Processed table data
UniqueIndicesCombinator
- Purpose: Extract unique combinations of specified indices
- Input: Document text, unique indices descriptions
- LLM Prompt: Structured prompt for combination extraction
- Output: JSON array of unique combinations
- Cost Tracking: Tracks input/output tokens
UniqueIndicesLoopAgent
- Purpose: Extract additional fields for each unique combination
- Input: Unique combinations, field descriptions
- Process: Loops through each combination
- LLM Calls: One call per combination
- Error Handling: Continues with partial failures
- Output: Complete data with all fields
FieldMapperAgent
- Purpose: Extract individual field values
- Strategies:
- Page-by-page analysis
- Semantic search fallback
- Unique indices strategy
- Features: Context-aware extraction
- Output: Field values with confidence scores
IndexAgent
- Purpose: Create semantic search indices
- Technology: Azure OpenAI Embeddings
- Features: Chunk-based indexing
- Output: Searchable document index
Services
LLMClient
class LLMClient:
def __init__(self, settings):
# Azure OpenAI configuration
self._deployment = settings.AZURE_OPENAI_DEPLOYMENT
self._max_retries = settings.LLM_MAX_RETRIES
self._base_delay = settings.LLM_BASE_DELAY
def responses(self, prompt, **kwargs):
# Retry logic with exponential backoff
# Cost tracking integration
# Error handling
Key Features:
- Retry logic with exponential backoff
- Cost tracking integration
- Error classification (retryable vs non-retryable)
- Jitter to prevent thundering herd
CostTracker
class CostTracker:
def __init__(self):
self.llm_calls: List[LLMCall] = []
self.current_file_costs = {}
self.total_costs = {}
def add_llm_tokens(self, input_tokens, output_tokens, description):
# Track individual LLM calls
# Calculate costs
# Store detailed information
Key Features:
- Individual call tracking
- Cost calculation based on Azure pricing
- Detailed breakdown by operation
- Session and total cost tracking
AzureDIService
class AzureDIService:
def extract_tables(self, pdf_bytes):
# Azure DI Layout Analysis
# Table structure preservation
# HTML debugging output
Key Features:
- Layout analysis for complex documents
- Table structure preservation
- Debug output generation
- Error handling for DI operations
Data Flow
Context Management
The system uses a context dictionary to pass data between agents:
ctx = {
"pdf_file": pdf_file,
"text": extracted_text,
"fields": field_list,
"unique_indices": unique_indices,
"field_descriptions": field_descriptions,
"cost_tracker": cost_tracker,
"results": [],
"strategy": strategy
}
Result Processing
Results are processed through multiple stages:
- Raw Extraction: LLM responses in JSON format
- Validation: JSON parsing and structure validation
- Flattening: Convert to tabular format
- DataFrame: Final structured output
Error Handling Strategy
Retry Logic
def _should_retry(self, exception) -> bool:
# Retry on 5xx errors
if hasattr(exception, 'status_code'):
return exception.status_code >= 500
# Retry on connection errors
return any(error in str(exception) for error in ['Timeout', 'Connection'])
Graceful Degradation
- Continue processing with partial failures
- Return null values for failed extractions
- Log detailed error information
- Maintain cost tracking during failures
Error Classification
- Retryable: 503, 500, connection timeouts
- Non-retryable: 400, 401, validation errors
- Fatal: Configuration errors, missing dependencies
Performance Considerations
Optimization Strategies
- Parallel Processing: Independent field extraction
- Caching: Session state for field descriptions
- Batching: Group similar operations
- Early Termination: Stop on critical failures
Resource Management
- Memory: Efficient text processing
- API Limits: Respect Azure rate limits
- Cost Control: Detailed tracking and alerts
- Timeout Handling: Configurable timeouts
Security
Data Protection
- No persistent storage of sensitive data
- Secure API key management
- Session-based data handling
- Log sanitization
Access Control
- Environment variable configuration
- API key validation
- Error message sanitization
Monitoring and Observability
Logging Strategy
# Structured logging with levels
logger.info(f"Processing {len(combinations)} combinations")
logger.debug(f"LLM response: {response[:200]}...")
logger.error(f"Failed to extract field: {field}")
Metrics Collection
- LLM call counts and durations
- Token usage and costs
- Success/failure rates
- Processing times
Debug Information
- Detailed execution traces
- Cost breakdown tables
- Error context and stack traces
- Performance metrics
Configuration Management
Settings Structure
class Settings(BaseSettings):
# Azure OpenAI
AZURE_OPENAI_ENDPOINT: str
AZURE_OPENAI_API_KEY: str
AZURE_OPENAI_DEPLOYMENT: str
# Azure Document Intelligence
AZURE_DI_ENDPOINT: str
AZURE_DI_KEY: str
# Retry Configuration
LLM_MAX_RETRIES: int = 5
LLM_BASE_DELAY: float = 1.0
LLM_MAX_DELAY: float = 60.0
Environment Variables
.env
file support- Environment variable override
- Validation and defaults
- Secure key management
Testing Strategy
Unit Tests
- Individual agent testing
- Service layer testing
- Mock external dependencies
- Cost tracking validation
Integration Tests
- End-to-end workflows
- Error scenario testing
- Performance benchmarking
- Cost accuracy validation
Test Coverage
- Core functionality: 90%+
- Error handling: 100%
- Cost tracking: 100%
- Retry logic: 100%
Deployment
Requirements
- Python 3.9+
- Azure OpenAI access
- Azure Document Intelligence access
- Streamlit for UI
Dependencies
azure-ai-documentintelligence
openai
streamlit
pandas
pymupdf
pydantic-settings
Environment Setup
- Install dependencies
- Configure environment variables
- Set up Azure resources
- Test connectivity
- Deploy application
Future Enhancements
Planned Features
- Batch Processing: Multiple document processing
- Custom Models: Domain-specific extraction
- Advanced Caching: Redis-based caching
- API Endpoints: REST API for integration
- Real-time Processing: Streaming document processing
Scalability Improvements
- Microservices: Agent separation
- Queue System: Asynchronous processing
- Load Balancing: Multiple instances
- Database Integration: Persistent storage
Performance Optimizations
- Vector Search: Enhanced semantic search
- Model Optimization: Smaller, faster models
- Parallel Processing: Multi-threaded extraction
- Memory Optimization: Efficient data structures