arjunanand13's picture
Update README.md
74609eb verified

A newer version of the Gradio SDK is available: 5.42.0

Upgrade
metadata
title: Unstructured to Structured JSON Converter
emoji: πŸ”„
colorFrom: blue
colorTo: purple
sdk: gradio
sdk_version: 5.38.0
app_file: app.py
pinned: false
license: mit

Unstructured to Structured JSON Converter

A production-ready system for extracting structured data from unstructured text following complex JSON schemas.

Key Features

  • Schema Agnostic: Handles unlimited complexity (6+ levels, 250+ fields, 500+ enums)
  • Large Document Support: Processes 50+ page documents and 10MB+ files
  • Dynamic Resource Allocation: Scales from $0.01 to $5.00 based on complexity
  • Confidence-Based Review: Automatic quality assessment with human review routing
  • Multi-Stage Processing: Hierarchical extraction for complex schemas

Performance Metrics

Complexity Tier Max Depth Fields Cost Time Accuracy
Tier 1 (Simple) ≀2 levels ≀20 $0.01-0.05 5-15s 95-98%
Tier 2 (Medium) ≀4 levels ≀100 $0.08-0.25 15-45s 90-95%
Tier 3 (Complex) >4 levels >100 $0.30-2.00 45-120s 85-90%

How to Use

  1. Paste your unstructured content (documents, emails, contracts, etc.)
  2. Define your target JSON schema (or use the provided examples)
  3. Click "Extract Structured Data" to process
  4. Review the results with confidence scores and quality assessment

Example Use Cases

GitHub Actions Metadata

Extract action configuration from documentation:

  • Inputs, outputs, steps, branding
  • Complexity: Medium (4 levels, 22 fields)
  • Time: ~25 seconds, Cost: ~$0.15

Resume/CV Processing

Structure personal profiles:

  • Work experience, education, skills
  • Complexity: Complex (5 levels, 85+ fields)
  • Time: ~45 seconds, Cost: ~$0.35

Email Chain Analysis

Extract requirements from stakeholder communications:

  • Participants, decisions, timelines
  • Complexity: Complex (4 levels, 50+ fields)
  • Time: ~30 seconds, Cost: ~$0.25

Legal Contract Processing

Structure contract terms and conditions:

  • Parties, terms, deliverables, timelines
  • Complexity: Complex (4 levels, 60+ fields)
  • Time: ~35 seconds, Cost: ~$0.30

How It Works

1. Schema Analysis

  • Analyzes JSON schema complexity (depth, fields, objects, enums)
  • Creates optimal extraction strategy
  • Estimates cost and processing time

2. Document Processing

  • Handles large documents with semantic chunking
  • Preserves context across chunk boundaries
  • Supports multiple input formats

3. Multi-Stage Extraction

  • Stage 1: Simple fields (strings, numbers, booleans)
  • Stage 2: Enums and choice fields
  • Stage 3: Arrays and lists
  • Stage 4: Complex nested objects

4. Quality Assessment

  • Field-level confidence scoring
  • Schema compliance validation
  • Human review routing for uncertain extractions

Technical Innovation

Schema-Agnostic Processing

Unlike traditional systems that impose rigid constraints, this system:

  • Analyzes any schema complexity dynamically
  • Decomposes complex schemas into manageable stages
  • Allocates resources based on actual complexity
  • Scales from simple forms to research papers

Confidence-Based Review Routing

  • High Confidence (>90%): No review needed
  • Medium Confidence (70-90%): Quick validation
  • Low Confidence (<70%): Detailed human review

Dynamic Model Selection

  • GPT-4o-mini: Simple fields, cost-effective
  • GPT-4o: Complex structures, high quality
  • Adaptive routing: Based on field complexity

Configuration

This space requires an OpenAI API key to function. The key should be added to the space secrets as OPENAI_API_KEY.