File size: 3,762 Bytes
640bf74
aa1faa7
 
 
 
640bf74
74609eb
640bf74
 
aa1faa7
640bf74
 
aa1faa7
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
74609eb
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
---
title: Unstructured to Structured JSON Converter
emoji: πŸ”„
colorFrom: blue
colorTo: purple
sdk: gradio
sdk_version: 5.38.0
app_file: app.py
pinned: false
license: mit
---

# Unstructured to Structured JSON Converter

A production-ready system for extracting structured data from unstructured text following complex JSON schemas.

## Key Features

- **Schema Agnostic**: Handles unlimited complexity (6+ levels, 250+ fields, 500+ enums)
- **Large Document Support**: Processes 50+ page documents and 10MB+ files
- **Dynamic Resource Allocation**: Scales from $0.01 to $5.00 based on complexity
- **Confidence-Based Review**: Automatic quality assessment with human review routing
- **Multi-Stage Processing**: Hierarchical extraction for complex schemas

## Performance Metrics

| Complexity Tier | Max Depth | Fields | Cost | Time | Accuracy |
|-----------------|-----------|--------|------|------|----------|
| **Tier 1** (Simple) | ≀2 levels | ≀20 | $0.01-0.05 | 5-15s | 95-98% |
| **Tier 2** (Medium) | ≀4 levels | ≀100 | $0.08-0.25 | 15-45s | 90-95% |
| **Tier 3** (Complex) | >4 levels | >100 | $0.30-2.00 | 45-120s | 85-90% |

## How to Use

1. **Paste your unstructured content** (documents, emails, contracts, etc.)
2. **Define your target JSON schema** (or use the provided examples)
3. **Click "Extract Structured Data"** to process
4. **Review the results** with confidence scores and quality assessment

## Example Use Cases

### GitHub Actions Metadata
Extract action configuration from documentation:
- Inputs, outputs, steps, branding
- **Complexity**: Medium (4 levels, 22 fields)
- **Time**: ~25 seconds, **Cost**: ~$0.15

### Resume/CV Processing  
Structure personal profiles:
- Work experience, education, skills
- **Complexity**: Complex (5 levels, 85+ fields)
- **Time**: ~45 seconds, **Cost**: ~$0.35

### Email Chain Analysis
Extract requirements from stakeholder communications:
- Participants, decisions, timelines
- **Complexity**: Complex (4 levels, 50+ fields)  
- **Time**: ~30 seconds, **Cost**: ~$0.25

### Legal Contract Processing
Structure contract terms and conditions:
- Parties, terms, deliverables, timelines
- **Complexity**: Complex (4 levels, 60+ fields)
- **Time**: ~35 seconds, **Cost**: ~$0.30

## How It Works

### 1. Schema Analysis
- Analyzes JSON schema complexity (depth, fields, objects, enums)
- Creates optimal extraction strategy
- Estimates cost and processing time

### 2. Document Processing  
- Handles large documents with semantic chunking
- Preserves context across chunk boundaries
- Supports multiple input formats

### 3. Multi-Stage Extraction
- **Stage 1**: Simple fields (strings, numbers, booleans)
- **Stage 2**: Enums and choice fields
- **Stage 3**: Arrays and lists
- **Stage 4**: Complex nested objects

### 4. Quality Assessment
- Field-level confidence scoring
- Schema compliance validation
- Human review routing for uncertain extractions

## Technical Innovation

### Schema-Agnostic Processing
Unlike traditional systems that impose rigid constraints, this system:
- **Analyzes** any schema complexity dynamically
- **Decomposes** complex schemas into manageable stages
- **Allocates** resources based on actual complexity
- **Scales** from simple forms to research papers

### Confidence-Based Review Routing
- **High Confidence** (>90%): No review needed
- **Medium Confidence** (70-90%): Quick validation
- **Low Confidence** (<70%): Detailed human review

### Dynamic Model Selection
- **GPT-4o-mini**: Simple fields, cost-effective
- **GPT-4o**: Complex structures, high quality
- **Adaptive routing**: Based on field complexity

## Configuration

This space requires an OpenAI API key to function. The key should be added to the space secrets as `OPENAI_API_KEY`.