|
--- |
|
title: Unstructured to Structured JSON Converter |
|
emoji: π |
|
colorFrom: blue |
|
colorTo: purple |
|
sdk: gradio |
|
sdk_version: 5.38.0 |
|
app_file: app.py |
|
pinned: false |
|
license: mit |
|
--- |
|
|
|
# Unstructured to Structured JSON Converter |
|
|
|
A production-ready system for extracting structured data from unstructured text following complex JSON schemas. |
|
|
|
## Key Features |
|
|
|
- **Schema Agnostic**: Handles unlimited complexity (6+ levels, 250+ fields, 500+ enums) |
|
- **Large Document Support**: Processes 50+ page documents and 10MB+ files |
|
- **Dynamic Resource Allocation**: Scales from $0.01 to $5.00 based on complexity |
|
- **Confidence-Based Review**: Automatic quality assessment with human review routing |
|
- **Multi-Stage Processing**: Hierarchical extraction for complex schemas |
|
|
|
## Performance Metrics |
|
|
|
| Complexity Tier | Max Depth | Fields | Cost | Time | Accuracy | |
|
|-----------------|-----------|--------|------|------|----------| |
|
| **Tier 1** (Simple) | β€2 levels | β€20 | $0.01-0.05 | 5-15s | 95-98% | |
|
| **Tier 2** (Medium) | β€4 levels | β€100 | $0.08-0.25 | 15-45s | 90-95% | |
|
| **Tier 3** (Complex) | >4 levels | >100 | $0.30-2.00 | 45-120s | 85-90% | |
|
|
|
## How to Use |
|
|
|
1. **Paste your unstructured content** (documents, emails, contracts, etc.) |
|
2. **Define your target JSON schema** (or use the provided examples) |
|
3. **Click "Extract Structured Data"** to process |
|
4. **Review the results** with confidence scores and quality assessment |
|
|
|
## Example Use Cases |
|
|
|
### GitHub Actions Metadata |
|
Extract action configuration from documentation: |
|
- Inputs, outputs, steps, branding |
|
- **Complexity**: Medium (4 levels, 22 fields) |
|
- **Time**: ~25 seconds, **Cost**: ~$0.15 |
|
|
|
### Resume/CV Processing |
|
Structure personal profiles: |
|
- Work experience, education, skills |
|
- **Complexity**: Complex (5 levels, 85+ fields) |
|
- **Time**: ~45 seconds, **Cost**: ~$0.35 |
|
|
|
### Email Chain Analysis |
|
Extract requirements from stakeholder communications: |
|
- Participants, decisions, timelines |
|
- **Complexity**: Complex (4 levels, 50+ fields) |
|
- **Time**: ~30 seconds, **Cost**: ~$0.25 |
|
|
|
### Legal Contract Processing |
|
Structure contract terms and conditions: |
|
- Parties, terms, deliverables, timelines |
|
- **Complexity**: Complex (4 levels, 60+ fields) |
|
- **Time**: ~35 seconds, **Cost**: ~$0.30 |
|
|
|
## How It Works |
|
|
|
### 1. Schema Analysis |
|
- Analyzes JSON schema complexity (depth, fields, objects, enums) |
|
- Creates optimal extraction strategy |
|
- Estimates cost and processing time |
|
|
|
### 2. Document Processing |
|
- Handles large documents with semantic chunking |
|
- Preserves context across chunk boundaries |
|
- Supports multiple input formats |
|
|
|
### 3. Multi-Stage Extraction |
|
- **Stage 1**: Simple fields (strings, numbers, booleans) |
|
- **Stage 2**: Enums and choice fields |
|
- **Stage 3**: Arrays and lists |
|
- **Stage 4**: Complex nested objects |
|
|
|
### 4. Quality Assessment |
|
- Field-level confidence scoring |
|
- Schema compliance validation |
|
- Human review routing for uncertain extractions |
|
|
|
## Technical Innovation |
|
|
|
### Schema-Agnostic Processing |
|
Unlike traditional systems that impose rigid constraints, this system: |
|
- **Analyzes** any schema complexity dynamically |
|
- **Decomposes** complex schemas into manageable stages |
|
- **Allocates** resources based on actual complexity |
|
- **Scales** from simple forms to research papers |
|
|
|
### Confidence-Based Review Routing |
|
- **High Confidence** (>90%): No review needed |
|
- **Medium Confidence** (70-90%): Quick validation |
|
- **Low Confidence** (<70%): Detailed human review |
|
|
|
### Dynamic Model Selection |
|
- **GPT-4o-mini**: Simple fields, cost-effective |
|
- **GPT-4o**: Complex structures, high quality |
|
- **Adaptive routing**: Based on field complexity |
|
|
|
## Configuration |
|
|
|
This space requires an OpenAI API key to function. The key should be added to the space secrets as `OPENAI_API_KEY`. |