File size: 4,357 Bytes
5d3ebd9
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
# Jupyter Notebook Usage

This document shows how to use the document processing function in Jupyter notebooks for integration into larger processing pipelines.

## Simple Usage

```python
from processing.document_processor import process_document_with_redaction

# Process a single document
result = process_document_with_redaction(
    file_path="path/to/your/document.pdf",
    endpoint="your-azure-openai-endpoint",
    api_key="your-azure-openai-key",
    api_version="2024-02-15-preview",
    deployment="o3-mini"  # or "o4-mini", "o3", "o4"
)

# Access the results
original_md = result.original_document_md
redacted_md = result.redacted_document_md
input_tokens = result.input_tokens
output_tokens = result.output_tokens
cost = result.cost

print(f"Processing complete!")
print(f"Input tokens: {input_tokens:,}")
print(f"Output tokens: {output_tokens:,}")
print(f"Total cost: ${cost:.4f}")
```

## Batch Processing

```python
import os
from processing.document_processor import process_document_with_redaction

# Configuration
AZURE_OPENAI_ENDPOINT = "your-azure-openai-endpoint"
AZURE_OPENAI_KEY = "your-azure-openai-key"
AZURE_OPENAI_VERSION = "2024-02-15-preview"
AZURE_OPENAI_DEPLOYMENT = "o3-mini"

# Process multiple documents
pdf_directory = "path/to/pdf/files"
results = []

for filename in os.listdir(pdf_directory):
    if filename.endswith('.pdf'):
        file_path = os.path.join(pdf_directory, filename)
        
        print(f"Processing {filename}...")
        
        try:
            result = process_document_with_redaction(
                file_path=file_path,
                endpoint=AZURE_OPENAI_ENDPOINT,
                api_key=AZURE_OPENAI_KEY,
                api_version=AZURE_OPENAI_VERSION,
                deployment=AZURE_OPENAI_DEPLOYMENT
            )
            
            results.append({
                'filename': filename,
                'original_md': result.original_document_md,
                'redacted_md': result.redacted_document_md,
                'input_tokens': result.input_tokens,
                'output_tokens': result.output_tokens,
                'cost': result.cost
            })
            
            print(f"  βœ“ Completed - Cost: ${result.cost:.4f}")
            
        except Exception as e:
            print(f"  βœ— Error processing {filename}: {e}")

# Summary
total_cost = sum(r['cost'] for r in results)
total_input_tokens = sum(r['input_tokens'] for r in results)
total_output_tokens = sum(r['output_tokens'] for r in results)

print(f"\nBatch processing complete!")
print(f"Documents processed: {len(results)}")
print(f"Total input tokens: {total_input_tokens:,}")
print(f"Total output tokens: {total_output_tokens:,}")
print(f"Total cost: ${total_cost:.4f}")
```

## Environment Variables

You can also use environment variables for configuration:

```python
import os
from dotenv import load_dotenv
from processing.document_processor import process_document_with_redaction

# Load environment variables
load_dotenv()

# Get configuration from environment
AZURE_OPENAI_ENDPOINT = os.getenv("AZURE_OPENAI_ENDPOINT")
AZURE_OPENAI_KEY = os.getenv("AZURE_OPENAI_KEY")
AZURE_OPENAI_VERSION = os.getenv("AZURE_OPENAI_VERSION")
AZURE_OPENAI_DEPLOYMENT = os.getenv("AZURE_OPENAI_DEPLOYMENT")

# Process document
result = process_document_with_redaction(
    file_path="document.pdf",
    endpoint=AZURE_OPENAI_ENDPOINT,
    api_key=AZURE_OPENAI_KEY,
    api_version=AZURE_OPENAI_VERSION,
    deployment=AZURE_OPENAI_DEPLOYMENT
)
```

## Return Value

The function returns a `ProcessingResult` object with the following attributes:

- `original_document_md`: Markdown version of the original document
- `redacted_document_md`: Markdown version with medication sections removed
- `input_tokens`: Number of input tokens used
- `output_tokens`: Number of output tokens generated
- `cost`: Total cost in USD

## Supported Models

The function supports the following Azure OpenAI deployment names:
- `o3-mini` (GPT-4o Mini) - Cheapest option
- `o4-mini` (GPT-4o Mini) - Same as o3-mini
- `o3` (GPT-3.5 Turbo) - Medium cost
- `o4` (GPT-4o) - Most expensive but most capable

## Error Handling

The function will raise exceptions for:
- File not found
- Invalid Azure OpenAI credentials
- API rate limits
- Network errors

Make sure to handle these appropriately in your pipeline.