CultriX's picture
First commit
e86199a
---
license: mit
title: Generate Knowledge Graphs
sdk: streamlit
emoji: πŸ“‰
colorFrom: indigo
colorTo: pink
short_description: Use LLM to generate a knowledge graph from your input data.
---
# πŸ•ΈοΈ Knowledge Graph Extraction App
A complete knowledge graph extraction application using LLMs via OpenRouter, available in both Gradio and Streamlit versions.
## πŸš€ Features
- **Multi-format Document Support**: PDF, TXT, DOCX, JSON files up to 10MB
- **LLM-powered Extraction**: Uses OpenRouter API with free models (Gemma-2-9B, Llama-3.1-8B)
- **Smart Entity Detection**: Automatically identifies people, organizations, locations, concepts, events, and objects
- **Importance Scoring**: LLM evaluates entity importance from 0.0 to 1.0
- **Interactive Visualization**: Multiple graph layout algorithms with filtering options
- **Batch Processing**: Optional processing of multiple documents together
- **Export Capabilities**: JSON, GraphML, and GEXF formats
- **Real-time Statistics**: Graph metrics and centrality analysis
## πŸ“ Project Structure
```
knowledge-graphs/
β”œβ”€β”€ app.py # Main Gradio application (legacy)
β”œβ”€β”€ app_streamlit.py # Main Streamlit application (recommended)
β”œβ”€β”€ run_streamlit.py # Simple launcher script
β”œβ”€β”€ requirements.txt # Python dependencies
β”œβ”€β”€ README.md # Project documentation
β”œβ”€β”€ .env.example # Environment variables template
β”œβ”€β”€ config/
β”‚ └── settings.py # Configuration management
└── src/
β”œβ”€β”€ document_processor.py # Document loading and chunking
β”œβ”€β”€ llm_extractor.py # LLM-based entity extraction
β”œβ”€β”€ graph_builder.py # NetworkX graph construction
└── visualizer.py # Graph visualization and export
```
## πŸ”§ Installation & Setup
### Option 1: Streamlit Version (Recommended)
The Streamlit version is more stable and has better file handling.
**Quick Start:**
```bash
python run_streamlit.py
```
**Manual Setup:**
1. **Install dependencies**:
```bash
pip install -r requirements.txt
```
2. **Run the Streamlit app**:
```bash
streamlit run app_streamlit.py --server.address 0.0.0.0 --server.port 8501
```
The app will be available at `http://localhost:8501`
### Option 2: Gradio Version (Legacy)
The Gradio version may have some file caching issues but is provided for compatibility.
1. **Install dependencies**:
```bash
pip install -r requirements.txt
```
2. **Set up environment variables** (optional):
```bash
cp .env.example .env
# Edit .env and add your OpenRouter API key
```
3. **Run the application**:
```bash
python app.py
```
The app will be available at `http://localhost:7860`
### HuggingFace Spaces Deployment
For **Streamlit deployment**:
1. Create a new Space on [HuggingFace Spaces](https://huggingface.co/spaces)
2. Choose "Streamlit" as the SDK
3. Upload `app_streamlit.py` as `app.py` (HF Spaces expects this name)
4. Upload all other project files maintaining directory structure
For **Gradio deployment**:
1. Create a new Space with "Gradio" as the SDK
2. Upload `app.py` and all other files
3. Note: May experience file handling issues
## πŸ”‘ API Configuration
### Getting OpenRouter API Key
1. Visit [OpenRouter.ai](https://openrouter.ai)
2. Sign up for a free account
3. Navigate to API Keys section
4. Generate a new API key
5. Copy the key and use it in the application
### Free Models Used
- **Primary**: `google/gemma-2-9b-it:free`
- **Backup**: `meta-llama/llama-3.1-8b-instruct:free`
These models are specifically chosen to minimize API costs while maintaining quality.
## πŸ“– Usage Guide
### Basic Workflow
1. **Upload Documents**:
- Select one or more files (PDF, TXT, DOCX, JSON)
- Toggle batch mode for multiple document processing
2. **Configure API**:
- Enter your OpenRouter API key
- Key is stored temporarily for the session
3. **Customize Settings**:
- Choose graph layout algorithm
- Toggle label visibility options
- Set minimum importance threshold
- Select entity types to include
4. **Extract Knowledge Graph**:
- Click "Extract Knowledge Graph" button
- Monitor progress through the status updates
- View results in multiple tabs
5. **Explore Results**:
- **Graph Visualization**: Interactive graph with colored nodes by entity type
- **Statistics**: Detailed metrics about the graph structure
- **Entities**: Complete list of extracted entities with details
- **Central Nodes**: Most important entities based on centrality measures
6. **Export Data**:
- Choose export format (JSON, GraphML, GEXF)
- Download structured graph data
### Advanced Features
#### Entity Types
- **PERSON**: Individuals mentioned in the text
- **ORGANIZATION**: Companies, institutions, groups
- **LOCATION**: Places, addresses, geographical entities
- **CONCEPT**: Abstract ideas, theories, methodologies
- **EVENT**: Specific occurrences, meetings, incidents
- **OBJECT**: Physical items, products, artifacts
#### Relationship Types
- **works_at**: Employment relationships
- **located_in**: Geographical associations
- **part_of**: Hierarchical relationships
- **causes**: Causal relationships
- **related_to**: General associations
#### Filtering Options
- **Importance Threshold**: Show only entities above specified importance score
- **Entity Types**: Filter by specific entity categories
- **Layout Algorithms**: Spring, circular, shell, Kamada-Kawai, random
## πŸ› οΈ Technical Details
### Architecture Components
1. **Document Processing**:
- Multi-format file parsing
- Intelligent text chunking with overlap
- File size validation
2. **LLM Integration**:
- OpenRouter API integration
- Structured prompt engineering
- Error handling and fallback models
3. **Graph Processing**:
- NetworkX-based graph construction
- Entity deduplication and standardization
- Relationship validation
4. **Visualization**:
- Matplotlib-based static graphs
- Interactive HTML visualizations
- Multiple export formats
### Configuration Options
All settings can be modified in `config/settings.py`:
- **Chunk Size**: Default 2000 characters
- **Chunk Overlap**: Default 200 characters
- **Max File Size**: Default 10MB
- **Max Entities**: Default 100 per extraction
- **Max Relationships**: Default 200 per extraction
- **Importance Threshold**: Default 0.3
### Differences Between Versions
**Streamlit Version Advantages:**
- More reliable file handling
- Better progress indicators
- Cleaner UI with sidebar configuration
- More stable caching system
- Built-in download functionality
**Gradio Version Advantages:**
- Simpler deployment to HF Spaces
- More compact interface
- Familiar for ML practitioners
## πŸ”’ Security & Privacy
- API keys are not stored permanently
- Files are processed temporarily and discarded
- No data is retained between sessions
- All processing happens server-side
## πŸ› Troubleshooting
### Common Issues
1. **"OpenRouter API key is required"**:
- Ensure you've entered a valid API key
- Check the key has sufficient credits
2. **"No entities extracted"**:
- Document may be too short or unstructured
- Try lowering the importance threshold
- Check if the document contains meaningful text
3. **File upload issues (Gradio version)**:
- Known issue with Gradio's file caching system
- Try the Streamlit version instead
- Ensure files are valid and not corrupted
4. **Segmentation fault (local development)**:
- Usually related to matplotlib backend
- Try setting `MPLBACKEND=Agg` environment variable
- Install GUI toolkit if running locally with display
5. **Module import errors**:
- Ensure all requirements are installed: `pip install -r requirements.txt`
- Check Python version compatibility (3.8+)
### Performance Tips
- Use batch mode for related documents
- Adjust chunk size for very long documents
- Lower importance threshold for sparse documents
- Use simpler layout algorithms for large graphs
## 🀝 Contributing
1. Fork the repository
2. Create a feature branch
3. Make your changes
4. Test with both Streamlit and Gradio versions if applicable
5. Add tests if applicable
6. Submit a pull request
## πŸ“„ License
This project is licensed under the MIT License - see the LICENSE file for details.
## πŸ™ Acknowledgments
- [OpenRouter](https://openrouter.ai) for LLM API access
- [Streamlit](https://streamlit.io) for the modern web interface framework
- [Gradio](https://gradio.app) for the ML-focused web interface
- [NetworkX](https://networkx.org) for graph processing
- [HuggingFace Spaces](https://huggingface.co/spaces) for hosting