Spaces:
Sleeping
Sleeping
File size: 9,939 Bytes
29604c0 0b7edde 29604c0 0b7edde 29604c0 1c43e67 665cc97 1c43e67 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 311 312 |
---
title: Doctorecord
emoji: π
colorFrom: red
colorTo: red
sdk: docker
app_port: 8501
tags:
- streamlit
pinned: false
short_description: Streamlit template space
---
# Welcome to Streamlit!
Edit `/src/streamlit_app.py` to customize this app to your heart's desire. :heart:
If you have any questions, checkout our [documentation](https://docs.streamlit.io) and [community
forums](https://discuss.streamlit.io).
# Deep-Research PDF Field Extractor
A multi-agent system for extracting structured data from biotech-related PDFs using Azure Document Intelligence and Azure OpenAI.
## Features
- **Multi-Agent Architecture**: Uses specialized agents for different extraction tasks
- **Azure Integration**: Leverages Azure Document Intelligence and Azure OpenAI
- **Flexible Extraction Strategies**: Supports both original and unique indices strategies
- **Robust Error Handling**: Implements retry logic with exponential backoff
- **Comprehensive Cost Tracking**: Monitors API usage and costs for all LLM calls
- **Streamlit UI**: User-friendly interface for document processing
- **Graceful Degradation**: Continues processing even with partial failures
## Installation
1. Clone the repository
2. Install dependencies: `pip install -r requirements.txt`
3. Set up environment variables (see Configuration section)
4. Run the application: `streamlit run src/app.py`
## Configuration
### Environment Variables
Create a `.env` file with the following variables:
```env
# Azure OpenAI
AZURE_OPENAI_ENDPOINT=your_endpoint
AZURE_OPENAI_API_KEY=your_api_key
AZURE_OPENAI_DEPLOYMENT=your_deployment_name
AZURE_OPENAI_API_VERSION=2025-03-01-preview
# Azure Document Intelligence
AZURE_DI_ENDPOINT=your_di_endpoint
AZURE_DI_KEY=your_di_key
# Retry Configuration (Optional)
LLM_MAX_RETRIES=5
LLM_BASE_DELAY=1.0
LLM_MAX_DELAY=60.0
```
### Retry Configuration
The system implements robust retry logic to handle transient service errors:
- **LLM_MAX_RETRIES**: Maximum number of retry attempts (default: 5)
- **LLM_BASE_DELAY**: Base delay in seconds for exponential backoff (default: 1.0)
- **LLM_MAX_DELAY**: Maximum delay in seconds (default: 60.0)
The retry logic automatically handles:
- 503 Service Unavailable errors
- 500 Internal Server Error
- Connection timeouts
- Network errors
Retries use exponential backoff with jitter to prevent thundering herd problems.
## Usage
### Original Strategy
Processes documents page by page, extracting fields individually using semantic search and LLM-based extraction.
**Workflow:**
```
PDFAgent β TableAgent β ForEachField β FieldMapperAgent
```
### Unique Indices Strategy
Extracts data based on unique combinations of specified indices, then loops through each combination to extract additional fields.
**Workflow:**
```
PDFAgent β TableAgent β UniqueIndicesCombinator β UniqueIndicesLoopAgent
```
**Step-by-step process:**
1. **PDFAgent**: Extracts text from PDF files
2. **TableAgent**: Processes tables using Azure Document Intelligence
3. **UniqueIndicesCombinator**: Extracts unique combinations of specified indices (e.g., Protein Lot, Peptide, Timepoint, Modification)
4. **UniqueIndicesLoopAgent**: Loops through each combination to extract additional fields (e.g., Chain, Percentage, Seq Loc)
**Example Output:**
```json
[
{
"Protein Lot": "P066_L14_H31_0-hulgG-LALAPG-FJB",
"Peptide": "PLTFGAGTK",
"Timepoint": "0w",
"Modification": "Clipping",
"Chain": "Heavy",
"Percentage": "90.0",
"Seq Loc": "HC(1-31)"
},
{
"Protein Lot": "P066_L14_H31_0-hulgG-LALAPG-FJB",
"Peptide": "PLTFGAGTK",
"Timepoint": "4w",
"Modification": "Clipping",
"Chain": "Heavy",
"Percentage": "85.0",
"Seq Loc": "HC(1-31)"
}
]
```
## Architecture
### Agents
- **PDFAgent**: Extracts text from PDF files using PyMuPDF
- **TableAgent**: Processes tables using Azure Document Intelligence with layout analysis
- **UniqueIndicesCombinator**: Extracts unique combinations of specified indices from documents
- **UniqueIndicesLoopAgent**: Loops through combinations to extract additional field values
- **FieldMapperAgent**: Maps individual fields to values using LLM-based extraction
- **IndexAgent**: Creates semantic search indices for improved field extraction
### Services
- **LLMClient**: Azure OpenAI wrapper with retry logic and cost tracking
- **AzureDIService**: Azure Document Intelligence integration with table processing
- **CostTracker**: Comprehensive API usage and cost monitoring
- **EmbeddingClient**: Semantic search capabilities
### Data Flow
1. **Document Processing**: PDF text and table extraction
2. **Strategy Selection**: Choose between original or unique indices approach
3. **Field Extraction**: LLM-based extraction with detailed field descriptions
4. **Cost Tracking**: Monitor all API usage and calculate costs
5. **Result Processing**: Convert to structured format (DataFrame/CSV)
## Cost Tracking
The system provides comprehensive cost tracking for all operations:
### LLM Costs
- **Input Tokens**: Tracked for each LLM call with descriptions
- **Output Tokens**: Tracked for each LLM call with descriptions
- **Cost Calculation**: Based on Azure OpenAI pricing
- **Detailed Breakdown**: Individual call costs in the UI
### Document Intelligence Costs
- **Pages Processed**: Tracked per operation
- **Operation Types**: Layout analysis, custom models, etc.
- **Cost Calculation**: Based on Azure DI pricing
### Cost Display
- **Real-time Updates**: Costs shown during execution
- **Detailed Table**: Breakdown of all LLM calls
- **Total Summary**: Combined costs for the entire operation
## Error Handling
The system implements comprehensive error handling:
1. **Retry Logic**: Automatic retries for transient errors with exponential backoff
2. **Graceful Degradation**: Continues processing even if some combinations fail
3. **Partial Results**: Returns data for successful extractions with null values for failures
4. **Detailed Logging**: Comprehensive logging for debugging and monitoring
5. **Cost Tracking**: Monitors API usage even during failures
### Error Types Handled
- β
**503 Service Unavailable** (Azure service overload)
- β
**500 Internal Server Error** (Server-side issues)
- β
**Connection timeouts** (Network issues)
- β
**Network errors** (Infrastructure problems)
- β **400 Bad Request** (Client errors - not retried)
- β **401 Unauthorized** (Authentication errors - not retried)
## Field Descriptions
The system supports detailed field descriptions to improve extraction accuracy:
### Field Description Format
```json
{
"field_name": {
"description": "Detailed description of the field",
"format": "Expected format (String, Float, etc.)",
"examples": "Example values",
"possible_values": "Comma-separated list of possible values"
}
}
```
### UI Support
- **Editable Tables**: Add, edit, and remove field descriptions
- **Session State**: Persists descriptions during the session
- **Validation**: Ensures proper format and structure
## Testing
The system includes comprehensive test suites:
### Test Scripts
- **test_retry.py**: Verifies retry logic with simulated failures
- **test_cost_tracking.py**: Validates cost tracking functionality
### Running Tests
```bash
python test_retry.py
python test_cost_tracking.py
```
## Performance
### Optimization Features
- **Retry Logic**: Handles transient failures automatically
- **Cost Optimization**: Detailed tracking to monitor usage
- **Graceful Degradation**: Continues with partial results
- **Caching**: Session state for field descriptions
### Expected Performance
- **Small Documents**: 30-60 seconds
- **Large Documents**: 2-5 minutes
- **Cost Efficiency**: ~$0.01-0.10 per document (depending on size)
## Contributing
1. Fork the repository
2. Create a feature branch
3. Make your changes
4. Add tests if applicable
5. Submit a pull request
## Troubleshooting
### Common Issues
**503 Service Unavailable Errors**
- The system automatically retries with exponential backoff
- Check Azure service status if persistent
- Adjust retry configuration if needed
**Cost Tracking Shows Zero**
- Ensure cost tracker is properly initialized
- Check that agents are passing context correctly
- Verify LLM calls are being made
**Partial Results**
- Some combinations may fail due to document structure
- Check execution logs for specific failures
- Results include null values for failed extractions
### Debug Mode
Enable detailed logging by setting log level to DEBUG in the application.
## License
[Add your license information here]
## Overview
The PDF Field Extractor helps you extract specific information from PDF documents. It can extract any fields you specify, such as dates, names, values, locations, and more. The tool is particularly useful for converting unstructured PDF data into structured, analyzable formats.
## How to Use
1. **Upload Your PDF**
- Click the "Upload PDF" button
- Select your PDF file from your computer
2. **Specify Fields to Extract**
- Enter the fields you want to extract, separated by commas
- Example: `Date, Name, Value, Location, Page, FileName`
3. **Optional: Add Field Descriptions**
- You can provide additional context about the fields
- This helps the system better understand what to look for
4. **Run Extraction**
- Click the "Run extraction" button
- Wait for the process to complete
- View your results in a table format
5. **Download Results**
- Download your extracted data as a CSV file
- View execution traces and logs if needed
## Features
- Automatic document type detection
- Smart field extraction
- Support for tables and text
- Detailed execution traces
- Downloadable results and logs
## Support
For technical documentation and architecture details, please refer to:
- [Architecture Overview](ARCHITECTURE.md)
- [Developer Documentation](DEVELOPER.md)
|