metadata

license: mit
title: Generate Knowledge Graphs
sdk: streamlit
emoji: 📉
colorFrom: indigo
colorTo: pink
short_description: Use LLM to generate a knowledge graph from your input data.

🕸️ Knowledge Graph Extraction App

A complete knowledge graph extraction application using LLMs via OpenRouter, available in both Gradio and Streamlit versions.

🚀 Features

Multi-format Document Support: PDF, TXT, DOCX, JSON files up to 10MB
LLM-powered Extraction: Uses OpenRouter API with free models (Gemma-2-9B, Llama-3.1-8B)
Smart Entity Detection: Automatically identifies people, organizations, locations, concepts, events, and objects
Importance Scoring: LLM evaluates entity importance from 0.0 to 1.0
Interactive Visualization: Multiple graph layout algorithms with filtering options
Batch Processing: Optional processing of multiple documents together
Export Capabilities: JSON, GraphML, and GEXF formats
Real-time Statistics: Graph metrics and centrality analysis

📁 Project Structure

knowledge-graphs/
├── app.py                    # Main Gradio application (legacy)
├── app_streamlit.py          # Main Streamlit application (recommended)
├── run_streamlit.py          # Simple launcher script
├── requirements.txt          # Python dependencies
├── README.md                # Project documentation
├── .env.example             # Environment variables template
├── config/
│   └── settings.py          # Configuration management
└── src/
    ├── document_processor.py # Document loading and chunking
    ├── llm_extractor.py      # LLM-based entity extraction
    ├── graph_builder.py      # NetworkX graph construction
    └── visualizer.py         # Graph visualization and export

🔧 Installation & Setup

Option 1: Streamlit Version (Recommended)

The Streamlit version is more stable and has better file handling.

Quick Start:

python run_streamlit.py

Manual Setup:

Install dependencies:

pip install -r requirements.txt

Run the Streamlit app:

streamlit run app_streamlit.py --server.address 0.0.0.0 --server.port 8501

The app will be available at http://localhost:8501

Option 2: Gradio Version (Legacy)

The Gradio version may have some file caching issues but is provided for compatibility.

Install dependencies:

pip install -r requirements.txt

Set up environment variables (optional):

cp .env.example .env
# Edit .env and add your OpenRouter API key

Run the application:

python app.py

The app will be available at http://localhost:7860

HuggingFace Spaces Deployment

For Streamlit deployment:

Create a new Space on HuggingFace Spaces
Choose "Streamlit" as the SDK
Upload app_streamlit.py as app.py (HF Spaces expects this name)
Upload all other project files maintaining directory structure

For Gradio deployment:

Create a new Space with "Gradio" as the SDK
Upload app.py and all other files
Note: May experience file handling issues

🔑 API Configuration

Getting OpenRouter API Key

Visit OpenRouter.ai
Sign up for a free account
Navigate to API Keys section
Generate a new API key
Copy the key and use it in the application

Free Models Used

Primary: google/gemma-2-9b-it:free
Backup: meta-llama/llama-3.1-8b-instruct:free

These models are specifically chosen to minimize API costs while maintaining quality.

📖 Usage Guide

Basic Workflow

Upload Documents:
- Select one or more files (PDF, TXT, DOCX, JSON)
- Toggle batch mode for multiple document processing
Configure API:
- Enter your OpenRouter API key
- Key is stored temporarily for the session
Customize Settings:
- Choose graph layout algorithm
- Toggle label visibility options
- Set minimum importance threshold
- Select entity types to include
Extract Knowledge Graph:
- Click "Extract Knowledge Graph" button
- Monitor progress through the status updates
- View results in multiple tabs
Explore Results:
- Graph Visualization: Interactive graph with colored nodes by entity type
- Statistics: Detailed metrics about the graph structure
- Entities: Complete list of extracted entities with details
- Central Nodes: Most important entities based on centrality measures
Export Data:
- Choose export format (JSON, GraphML, GEXF)
- Download structured graph data

Advanced Features

Entity Types

PERSON: Individuals mentioned in the text
ORGANIZATION: Companies, institutions, groups
LOCATION: Places, addresses, geographical entities
CONCEPT: Abstract ideas, theories, methodologies
EVENT: Specific occurrences, meetings, incidents
OBJECT: Physical items, products, artifacts

Relationship Types

works_at: Employment relationships
located_in: Geographical associations
part_of: Hierarchical relationships
causes: Causal relationships
related_to: General associations

Filtering Options

Importance Threshold: Show only entities above specified importance score
Entity Types: Filter by specific entity categories
Layout Algorithms: Spring, circular, shell, Kamada-Kawai, random

🛠️ Technical Details

Architecture Components

Document Processing:
- Multi-format file parsing
- Intelligent text chunking with overlap
- File size validation
LLM Integration:
- OpenRouter API integration
- Structured prompt engineering
- Error handling and fallback models
Graph Processing:
- NetworkX-based graph construction
- Entity deduplication and standardization
- Relationship validation
Visualization:
- Matplotlib-based static graphs
- Interactive HTML visualizations
- Multiple export formats

Configuration Options

All settings can be modified in config/settings.py:

Chunk Size: Default 2000 characters
Chunk Overlap: Default 200 characters
Max File Size: Default 10MB
Max Entities: Default 100 per extraction
Max Relationships: Default 200 per extraction
Importance Threshold: Default 0.3

Differences Between Versions

Streamlit Version Advantages:

More reliable file handling
Better progress indicators
Cleaner UI with sidebar configuration
More stable caching system
Built-in download functionality

Gradio Version Advantages:

Simpler deployment to HF Spaces
More compact interface
Familiar for ML practitioners

🔒 Security & Privacy

API keys are not stored permanently
Files are processed temporarily and discarded
No data is retained between sessions
All processing happens server-side

🐛 Troubleshooting

Common Issues

"OpenRouter API key is required":
- Ensure you've entered a valid API key
- Check the key has sufficient credits
"No entities extracted":
- Document may be too short or unstructured
- Try lowering the importance threshold
- Check if the document contains meaningful text
File upload issues (Gradio version):
- Known issue with Gradio's file caching system
- Try the Streamlit version instead
- Ensure files are valid and not corrupted
Segmentation fault (local development):
- Usually related to matplotlib backend
- Try setting MPLBACKEND=Agg environment variable
- Install GUI toolkit if running locally with display
Module import errors:
- Ensure all requirements are installed: pip install -r requirements.txt
- Check Python version compatibility (3.8+)

Performance Tips

Use batch mode for related documents
Adjust chunk size for very long documents
Lower importance threshold for sparse documents
Use simpler layout algorithms for large graphs

🤝 Contributing

Fork the repository
Create a feature branch
Make your changes
Test with both Streamlit and Gradio versions if applicable
Add tests if applicable
Submit a pull request

📄 License

This project is licensed under the MIT License - see the LICENSE file for details.

🙏 Acknowledgments

OpenRouter for LLM API access
Streamlit for the modern web interface framework
Gradio for the ML-focused web interface
NetworkX for graph processing
HuggingFace Spaces for hosting