Spaces:
Running
Running
File size: 4,357 Bytes
5d3ebd9 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 |
# Jupyter Notebook Usage
This document shows how to use the document processing function in Jupyter notebooks for integration into larger processing pipelines.
## Simple Usage
```python
from processing.document_processor import process_document_with_redaction
# Process a single document
result = process_document_with_redaction(
file_path="path/to/your/document.pdf",
endpoint="your-azure-openai-endpoint",
api_key="your-azure-openai-key",
api_version="2024-02-15-preview",
deployment="o3-mini" # or "o4-mini", "o3", "o4"
)
# Access the results
original_md = result.original_document_md
redacted_md = result.redacted_document_md
input_tokens = result.input_tokens
output_tokens = result.output_tokens
cost = result.cost
print(f"Processing complete!")
print(f"Input tokens: {input_tokens:,}")
print(f"Output tokens: {output_tokens:,}")
print(f"Total cost: ${cost:.4f}")
```
## Batch Processing
```python
import os
from processing.document_processor import process_document_with_redaction
# Configuration
AZURE_OPENAI_ENDPOINT = "your-azure-openai-endpoint"
AZURE_OPENAI_KEY = "your-azure-openai-key"
AZURE_OPENAI_VERSION = "2024-02-15-preview"
AZURE_OPENAI_DEPLOYMENT = "o3-mini"
# Process multiple documents
pdf_directory = "path/to/pdf/files"
results = []
for filename in os.listdir(pdf_directory):
if filename.endswith('.pdf'):
file_path = os.path.join(pdf_directory, filename)
print(f"Processing {filename}...")
try:
result = process_document_with_redaction(
file_path=file_path,
endpoint=AZURE_OPENAI_ENDPOINT,
api_key=AZURE_OPENAI_KEY,
api_version=AZURE_OPENAI_VERSION,
deployment=AZURE_OPENAI_DEPLOYMENT
)
results.append({
'filename': filename,
'original_md': result.original_document_md,
'redacted_md': result.redacted_document_md,
'input_tokens': result.input_tokens,
'output_tokens': result.output_tokens,
'cost': result.cost
})
print(f" β Completed - Cost: ${result.cost:.4f}")
except Exception as e:
print(f" β Error processing {filename}: {e}")
# Summary
total_cost = sum(r['cost'] for r in results)
total_input_tokens = sum(r['input_tokens'] for r in results)
total_output_tokens = sum(r['output_tokens'] for r in results)
print(f"\nBatch processing complete!")
print(f"Documents processed: {len(results)}")
print(f"Total input tokens: {total_input_tokens:,}")
print(f"Total output tokens: {total_output_tokens:,}")
print(f"Total cost: ${total_cost:.4f}")
```
## Environment Variables
You can also use environment variables for configuration:
```python
import os
from dotenv import load_dotenv
from processing.document_processor import process_document_with_redaction
# Load environment variables
load_dotenv()
# Get configuration from environment
AZURE_OPENAI_ENDPOINT = os.getenv("AZURE_OPENAI_ENDPOINT")
AZURE_OPENAI_KEY = os.getenv("AZURE_OPENAI_KEY")
AZURE_OPENAI_VERSION = os.getenv("AZURE_OPENAI_VERSION")
AZURE_OPENAI_DEPLOYMENT = os.getenv("AZURE_OPENAI_DEPLOYMENT")
# Process document
result = process_document_with_redaction(
file_path="document.pdf",
endpoint=AZURE_OPENAI_ENDPOINT,
api_key=AZURE_OPENAI_KEY,
api_version=AZURE_OPENAI_VERSION,
deployment=AZURE_OPENAI_DEPLOYMENT
)
```
## Return Value
The function returns a `ProcessingResult` object with the following attributes:
- `original_document_md`: Markdown version of the original document
- `redacted_document_md`: Markdown version with medication sections removed
- `input_tokens`: Number of input tokens used
- `output_tokens`: Number of output tokens generated
- `cost`: Total cost in USD
## Supported Models
The function supports the following Azure OpenAI deployment names:
- `o3-mini` (GPT-4o Mini) - Cheapest option
- `o4-mini` (GPT-4o Mini) - Same as o3-mini
- `o3` (GPT-3.5 Turbo) - Medium cost
- `o4` (GPT-4o) - Most expensive but most capable
## Error Handling
The function will raise exceptions for:
- File not found
- Invalid Azure OpenAI credentials
- API rate limits
- Network errors
Make sure to handle these appropriately in your pipeline. |