license: mit
title: Generate Knowledge Graphs
sdk: streamlit
emoji: π
colorFrom: indigo
colorTo: pink
short_description: Use LLM to generate a knowledge graph from your input data.
πΈοΈ Knowledge Graph Extraction App
A complete knowledge graph extraction application using LLMs via OpenRouter, available in both Gradio and Streamlit versions.
π Features
- Multi-format Document Support: PDF, TXT, DOCX, JSON files up to 10MB
- LLM-powered Extraction: Uses OpenRouter API with free models (Gemma-2-9B, Llama-3.1-8B)
- Smart Entity Detection: Automatically identifies people, organizations, locations, concepts, events, and objects
- Importance Scoring: LLM evaluates entity importance from 0.0 to 1.0
- Interactive Visualization: Multiple graph layout algorithms with filtering options
- Batch Processing: Optional processing of multiple documents together
- Export Capabilities: JSON, GraphML, and GEXF formats
- Real-time Statistics: Graph metrics and centrality analysis
π Project Structure
knowledge-graphs/
βββ app.py # Main Gradio application (legacy)
βββ app_streamlit.py # Main Streamlit application (recommended)
βββ run_streamlit.py # Simple launcher script
βββ requirements.txt # Python dependencies
βββ README.md # Project documentation
βββ .env.example # Environment variables template
βββ config/
β βββ settings.py # Configuration management
βββ src/
βββ document_processor.py # Document loading and chunking
βββ llm_extractor.py # LLM-based entity extraction
βββ graph_builder.py # NetworkX graph construction
βββ visualizer.py # Graph visualization and export
π§ Installation & Setup
Option 1: Streamlit Version (Recommended)
The Streamlit version is more stable and has better file handling.
Quick Start:
python run_streamlit.py
Manual Setup:
- Install dependencies:
pip install -r requirements.txt
- Run the Streamlit app:
streamlit run app_streamlit.py --server.address 0.0.0.0 --server.port 8501
The app will be available at http://localhost:8501
Option 2: Gradio Version (Legacy)
The Gradio version may have some file caching issues but is provided for compatibility.
- Install dependencies:
pip install -r requirements.txt
- Set up environment variables (optional):
cp .env.example .env
# Edit .env and add your OpenRouter API key
- Run the application:
python app.py
The app will be available at http://localhost:7860
HuggingFace Spaces Deployment
For Streamlit deployment:
- Create a new Space on HuggingFace Spaces
- Choose "Streamlit" as the SDK
- Upload
app_streamlit.py
asapp.py
(HF Spaces expects this name) - Upload all other project files maintaining directory structure
For Gradio deployment:
- Create a new Space with "Gradio" as the SDK
- Upload
app.py
and all other files - Note: May experience file handling issues
π API Configuration
Getting OpenRouter API Key
- Visit OpenRouter.ai
- Sign up for a free account
- Navigate to API Keys section
- Generate a new API key
- Copy the key and use it in the application
Free Models Used
- Primary:
google/gemma-2-9b-it:free
- Backup:
meta-llama/llama-3.1-8b-instruct:free
These models are specifically chosen to minimize API costs while maintaining quality.
π Usage Guide
Basic Workflow
Upload Documents:
- Select one or more files (PDF, TXT, DOCX, JSON)
- Toggle batch mode for multiple document processing
Configure API:
- Enter your OpenRouter API key
- Key is stored temporarily for the session
Customize Settings:
- Choose graph layout algorithm
- Toggle label visibility options
- Set minimum importance threshold
- Select entity types to include
Extract Knowledge Graph:
- Click "Extract Knowledge Graph" button
- Monitor progress through the status updates
- View results in multiple tabs
Explore Results:
- Graph Visualization: Interactive graph with colored nodes by entity type
- Statistics: Detailed metrics about the graph structure
- Entities: Complete list of extracted entities with details
- Central Nodes: Most important entities based on centrality measures
Export Data:
- Choose export format (JSON, GraphML, GEXF)
- Download structured graph data
Advanced Features
Entity Types
- PERSON: Individuals mentioned in the text
- ORGANIZATION: Companies, institutions, groups
- LOCATION: Places, addresses, geographical entities
- CONCEPT: Abstract ideas, theories, methodologies
- EVENT: Specific occurrences, meetings, incidents
- OBJECT: Physical items, products, artifacts
Relationship Types
- works_at: Employment relationships
- located_in: Geographical associations
- part_of: Hierarchical relationships
- causes: Causal relationships
- related_to: General associations
Filtering Options
- Importance Threshold: Show only entities above specified importance score
- Entity Types: Filter by specific entity categories
- Layout Algorithms: Spring, circular, shell, Kamada-Kawai, random
π οΈ Technical Details
Architecture Components
Document Processing:
- Multi-format file parsing
- Intelligent text chunking with overlap
- File size validation
LLM Integration:
- OpenRouter API integration
- Structured prompt engineering
- Error handling and fallback models
Graph Processing:
- NetworkX-based graph construction
- Entity deduplication and standardization
- Relationship validation
Visualization:
- Matplotlib-based static graphs
- Interactive HTML visualizations
- Multiple export formats
Configuration Options
All settings can be modified in config/settings.py
:
- Chunk Size: Default 2000 characters
- Chunk Overlap: Default 200 characters
- Max File Size: Default 10MB
- Max Entities: Default 100 per extraction
- Max Relationships: Default 200 per extraction
- Importance Threshold: Default 0.3
Differences Between Versions
Streamlit Version Advantages:
- More reliable file handling
- Better progress indicators
- Cleaner UI with sidebar configuration
- More stable caching system
- Built-in download functionality
Gradio Version Advantages:
- Simpler deployment to HF Spaces
- More compact interface
- Familiar for ML practitioners
π Security & Privacy
- API keys are not stored permanently
- Files are processed temporarily and discarded
- No data is retained between sessions
- All processing happens server-side
π Troubleshooting
Common Issues
"OpenRouter API key is required":
- Ensure you've entered a valid API key
- Check the key has sufficient credits
"No entities extracted":
- Document may be too short or unstructured
- Try lowering the importance threshold
- Check if the document contains meaningful text
File upload issues (Gradio version):
- Known issue with Gradio's file caching system
- Try the Streamlit version instead
- Ensure files are valid and not corrupted
Segmentation fault (local development):
- Usually related to matplotlib backend
- Try setting
MPLBACKEND=Agg
environment variable - Install GUI toolkit if running locally with display
Module import errors:
- Ensure all requirements are installed:
pip install -r requirements.txt
- Check Python version compatibility (3.8+)
- Ensure all requirements are installed:
Performance Tips
- Use batch mode for related documents
- Adjust chunk size for very long documents
- Lower importance threshold for sparse documents
- Use simpler layout algorithms for large graphs
π€ Contributing
- Fork the repository
- Create a feature branch
- Make your changes
- Test with both Streamlit and Gradio versions if applicable
- Add tests if applicable
- Submit a pull request
π License
This project is licensed under the MIT License - see the LICENSE file for details.
π Acknowledgments
- OpenRouter for LLM API access
- Streamlit for the modern web interface framework
- Gradio for the ML-focused web interface
- NetworkX for graph processing
- HuggingFace Spaces for hosting