|
--- |
|
license: mit |
|
title: Generate Knowledge Graphs |
|
sdk: streamlit |
|
emoji: π |
|
colorFrom: indigo |
|
colorTo: pink |
|
short_description: Use LLM to generate a knowledge graph from your input data. |
|
--- |
|
# πΈοΈ Knowledge Graph Extraction App |
|
|
|
A complete knowledge graph extraction application using LLMs via OpenRouter, available in both Gradio and Streamlit versions. |
|
|
|
## π Features |
|
|
|
- **Multi-format Document Support**: PDF, TXT, DOCX, JSON files up to 10MB |
|
- **LLM-powered Extraction**: Uses OpenRouter API with free models (Gemma-2-9B, Llama-3.1-8B) |
|
- **Smart Entity Detection**: Automatically identifies people, organizations, locations, concepts, events, and objects |
|
- **Importance Scoring**: LLM evaluates entity importance from 0.0 to 1.0 |
|
- **Interactive Visualization**: Multiple graph layout algorithms with filtering options |
|
- **Batch Processing**: Optional processing of multiple documents together |
|
- **Export Capabilities**: JSON, GraphML, and GEXF formats |
|
- **Real-time Statistics**: Graph metrics and centrality analysis |
|
|
|
## π Project Structure |
|
|
|
``` |
|
knowledge-graphs/ |
|
βββ app.py # Main Gradio application (legacy) |
|
βββ app_streamlit.py # Main Streamlit application (recommended) |
|
βββ run_streamlit.py # Simple launcher script |
|
βββ requirements.txt # Python dependencies |
|
βββ README.md # Project documentation |
|
βββ .env.example # Environment variables template |
|
βββ config/ |
|
β βββ settings.py # Configuration management |
|
βββ src/ |
|
βββ document_processor.py # Document loading and chunking |
|
βββ llm_extractor.py # LLM-based entity extraction |
|
βββ graph_builder.py # NetworkX graph construction |
|
βββ visualizer.py # Graph visualization and export |
|
``` |
|
|
|
## π§ Installation & Setup |
|
|
|
### Option 1: Streamlit Version (Recommended) |
|
|
|
The Streamlit version is more stable and has better file handling. |
|
|
|
**Quick Start:** |
|
```bash |
|
python run_streamlit.py |
|
``` |
|
|
|
**Manual Setup:** |
|
1. **Install dependencies**: |
|
```bash |
|
pip install -r requirements.txt |
|
``` |
|
|
|
2. **Run the Streamlit app**: |
|
```bash |
|
streamlit run app_streamlit.py --server.address 0.0.0.0 --server.port 8501 |
|
``` |
|
|
|
The app will be available at `http://localhost:8501` |
|
|
|
### Option 2: Gradio Version (Legacy) |
|
|
|
The Gradio version may have some file caching issues but is provided for compatibility. |
|
|
|
1. **Install dependencies**: |
|
```bash |
|
pip install -r requirements.txt |
|
``` |
|
|
|
2. **Set up environment variables** (optional): |
|
```bash |
|
cp .env.example .env |
|
# Edit .env and add your OpenRouter API key |
|
``` |
|
|
|
3. **Run the application**: |
|
```bash |
|
python app.py |
|
``` |
|
|
|
The app will be available at `http://localhost:7860` |
|
|
|
### HuggingFace Spaces Deployment |
|
|
|
For **Streamlit deployment**: |
|
1. Create a new Space on [HuggingFace Spaces](https://huggingface.co/spaces) |
|
2. Choose "Streamlit" as the SDK |
|
3. Upload `app_streamlit.py` as `app.py` (HF Spaces expects this name) |
|
4. Upload all other project files maintaining directory structure |
|
|
|
For **Gradio deployment**: |
|
1. Create a new Space with "Gradio" as the SDK |
|
2. Upload `app.py` and all other files |
|
3. Note: May experience file handling issues |
|
|
|
## π API Configuration |
|
|
|
### Getting OpenRouter API Key |
|
|
|
1. Visit [OpenRouter.ai](https://openrouter.ai) |
|
2. Sign up for a free account |
|
3. Navigate to API Keys section |
|
4. Generate a new API key |
|
5. Copy the key and use it in the application |
|
|
|
### Free Models Used |
|
|
|
- **Primary**: `google/gemma-2-9b-it:free` |
|
- **Backup**: `meta-llama/llama-3.1-8b-instruct:free` |
|
|
|
These models are specifically chosen to minimize API costs while maintaining quality. |
|
|
|
## π Usage Guide |
|
|
|
### Basic Workflow |
|
|
|
1. **Upload Documents**: |
|
- Select one or more files (PDF, TXT, DOCX, JSON) |
|
- Toggle batch mode for multiple document processing |
|
|
|
2. **Configure API**: |
|
- Enter your OpenRouter API key |
|
- Key is stored temporarily for the session |
|
|
|
3. **Customize Settings**: |
|
- Choose graph layout algorithm |
|
- Toggle label visibility options |
|
- Set minimum importance threshold |
|
- Select entity types to include |
|
|
|
4. **Extract Knowledge Graph**: |
|
- Click "Extract Knowledge Graph" button |
|
- Monitor progress through the status updates |
|
- View results in multiple tabs |
|
|
|
5. **Explore Results**: |
|
- **Graph Visualization**: Interactive graph with colored nodes by entity type |
|
- **Statistics**: Detailed metrics about the graph structure |
|
- **Entities**: Complete list of extracted entities with details |
|
- **Central Nodes**: Most important entities based on centrality measures |
|
|
|
6. **Export Data**: |
|
- Choose export format (JSON, GraphML, GEXF) |
|
- Download structured graph data |
|
|
|
### Advanced Features |
|
|
|
#### Entity Types |
|
- **PERSON**: Individuals mentioned in the text |
|
- **ORGANIZATION**: Companies, institutions, groups |
|
- **LOCATION**: Places, addresses, geographical entities |
|
- **CONCEPT**: Abstract ideas, theories, methodologies |
|
- **EVENT**: Specific occurrences, meetings, incidents |
|
- **OBJECT**: Physical items, products, artifacts |
|
|
|
#### Relationship Types |
|
- **works_at**: Employment relationships |
|
- **located_in**: Geographical associations |
|
- **part_of**: Hierarchical relationships |
|
- **causes**: Causal relationships |
|
- **related_to**: General associations |
|
|
|
#### Filtering Options |
|
- **Importance Threshold**: Show only entities above specified importance score |
|
- **Entity Types**: Filter by specific entity categories |
|
- **Layout Algorithms**: Spring, circular, shell, Kamada-Kawai, random |
|
|
|
## π οΈ Technical Details |
|
|
|
### Architecture Components |
|
|
|
1. **Document Processing**: |
|
- Multi-format file parsing |
|
- Intelligent text chunking with overlap |
|
- File size validation |
|
|
|
2. **LLM Integration**: |
|
- OpenRouter API integration |
|
- Structured prompt engineering |
|
- Error handling and fallback models |
|
|
|
3. **Graph Processing**: |
|
- NetworkX-based graph construction |
|
- Entity deduplication and standardization |
|
- Relationship validation |
|
|
|
4. **Visualization**: |
|
- Matplotlib-based static graphs |
|
- Interactive HTML visualizations |
|
- Multiple export formats |
|
|
|
### Configuration Options |
|
|
|
All settings can be modified in `config/settings.py`: |
|
|
|
- **Chunk Size**: Default 2000 characters |
|
- **Chunk Overlap**: Default 200 characters |
|
- **Max File Size**: Default 10MB |
|
- **Max Entities**: Default 100 per extraction |
|
- **Max Relationships**: Default 200 per extraction |
|
- **Importance Threshold**: Default 0.3 |
|
|
|
### Differences Between Versions |
|
|
|
**Streamlit Version Advantages:** |
|
- More reliable file handling |
|
- Better progress indicators |
|
- Cleaner UI with sidebar configuration |
|
- More stable caching system |
|
- Built-in download functionality |
|
|
|
**Gradio Version Advantages:** |
|
- Simpler deployment to HF Spaces |
|
- More compact interface |
|
- Familiar for ML practitioners |
|
|
|
## π Security & Privacy |
|
|
|
- API keys are not stored permanently |
|
- Files are processed temporarily and discarded |
|
- No data is retained between sessions |
|
- All processing happens server-side |
|
|
|
## π Troubleshooting |
|
|
|
### Common Issues |
|
|
|
1. **"OpenRouter API key is required"**: |
|
- Ensure you've entered a valid API key |
|
- Check the key has sufficient credits |
|
|
|
2. **"No entities extracted"**: |
|
- Document may be too short or unstructured |
|
- Try lowering the importance threshold |
|
- Check if the document contains meaningful text |
|
|
|
3. **File upload issues (Gradio version)**: |
|
- Known issue with Gradio's file caching system |
|
- Try the Streamlit version instead |
|
- Ensure files are valid and not corrupted |
|
|
|
4. **Segmentation fault (local development)**: |
|
- Usually related to matplotlib backend |
|
- Try setting `MPLBACKEND=Agg` environment variable |
|
- Install GUI toolkit if running locally with display |
|
|
|
5. **Module import errors**: |
|
- Ensure all requirements are installed: `pip install -r requirements.txt` |
|
- Check Python version compatibility (3.8+) |
|
|
|
### Performance Tips |
|
|
|
- Use batch mode for related documents |
|
- Adjust chunk size for very long documents |
|
- Lower importance threshold for sparse documents |
|
- Use simpler layout algorithms for large graphs |
|
|
|
## π€ Contributing |
|
|
|
1. Fork the repository |
|
2. Create a feature branch |
|
3. Make your changes |
|
4. Test with both Streamlit and Gradio versions if applicable |
|
5. Add tests if applicable |
|
6. Submit a pull request |
|
|
|
## π License |
|
|
|
This project is licensed under the MIT License - see the LICENSE file for details. |
|
|
|
## π Acknowledgments |
|
|
|
- [OpenRouter](https://openrouter.ai) for LLM API access |
|
- [Streamlit](https://streamlit.io) for the modern web interface framework |
|
- [Gradio](https://gradio.app) for the ML-focused web interface |
|
- [NetworkX](https://networkx.org) for graph processing |
|
- [HuggingFace Spaces](https://huggingface.co/spaces) for hosting |