arjunanand13's picture
Update README.md
74609eb verified
---
title: Unstructured to Structured JSON Converter
emoji: πŸ”„
colorFrom: blue
colorTo: purple
sdk: gradio
sdk_version: 5.38.0
app_file: app.py
pinned: false
license: mit
---
# Unstructured to Structured JSON Converter
A production-ready system for extracting structured data from unstructured text following complex JSON schemas.
## Key Features
- **Schema Agnostic**: Handles unlimited complexity (6+ levels, 250+ fields, 500+ enums)
- **Large Document Support**: Processes 50+ page documents and 10MB+ files
- **Dynamic Resource Allocation**: Scales from $0.01 to $5.00 based on complexity
- **Confidence-Based Review**: Automatic quality assessment with human review routing
- **Multi-Stage Processing**: Hierarchical extraction for complex schemas
## Performance Metrics
| Complexity Tier | Max Depth | Fields | Cost | Time | Accuracy |
|-----------------|-----------|--------|------|------|----------|
| **Tier 1** (Simple) | ≀2 levels | ≀20 | $0.01-0.05 | 5-15s | 95-98% |
| **Tier 2** (Medium) | ≀4 levels | ≀100 | $0.08-0.25 | 15-45s | 90-95% |
| **Tier 3** (Complex) | >4 levels | >100 | $0.30-2.00 | 45-120s | 85-90% |
## How to Use
1. **Paste your unstructured content** (documents, emails, contracts, etc.)
2. **Define your target JSON schema** (or use the provided examples)
3. **Click "Extract Structured Data"** to process
4. **Review the results** with confidence scores and quality assessment
## Example Use Cases
### GitHub Actions Metadata
Extract action configuration from documentation:
- Inputs, outputs, steps, branding
- **Complexity**: Medium (4 levels, 22 fields)
- **Time**: ~25 seconds, **Cost**: ~$0.15
### Resume/CV Processing
Structure personal profiles:
- Work experience, education, skills
- **Complexity**: Complex (5 levels, 85+ fields)
- **Time**: ~45 seconds, **Cost**: ~$0.35
### Email Chain Analysis
Extract requirements from stakeholder communications:
- Participants, decisions, timelines
- **Complexity**: Complex (4 levels, 50+ fields)
- **Time**: ~30 seconds, **Cost**: ~$0.25
### Legal Contract Processing
Structure contract terms and conditions:
- Parties, terms, deliverables, timelines
- **Complexity**: Complex (4 levels, 60+ fields)
- **Time**: ~35 seconds, **Cost**: ~$0.30
## How It Works
### 1. Schema Analysis
- Analyzes JSON schema complexity (depth, fields, objects, enums)
- Creates optimal extraction strategy
- Estimates cost and processing time
### 2. Document Processing
- Handles large documents with semantic chunking
- Preserves context across chunk boundaries
- Supports multiple input formats
### 3. Multi-Stage Extraction
- **Stage 1**: Simple fields (strings, numbers, booleans)
- **Stage 2**: Enums and choice fields
- **Stage 3**: Arrays and lists
- **Stage 4**: Complex nested objects
### 4. Quality Assessment
- Field-level confidence scoring
- Schema compliance validation
- Human review routing for uncertain extractions
## Technical Innovation
### Schema-Agnostic Processing
Unlike traditional systems that impose rigid constraints, this system:
- **Analyzes** any schema complexity dynamically
- **Decomposes** complex schemas into manageable stages
- **Allocates** resources based on actual complexity
- **Scales** from simple forms to research papers
### Confidence-Based Review Routing
- **High Confidence** (>90%): No review needed
- **Medium Confidence** (70-90%): Quick validation
- **Low Confidence** (<70%): Detailed human review
### Dynamic Model Selection
- **GPT-4o-mini**: Simple fields, cost-effective
- **GPT-4o**: Complex structures, high quality
- **Adaptive routing**: Based on field complexity
## Configuration
This space requires an OpenAI API key to function. The key should be added to the space secrets as `OPENAI_API_KEY`.