File size: 8,740 Bytes
c99d6eb be24bab e86199a c99d6eb e86199a c99d6eb e86199a c99d6eb e86199a c99d6eb e86199a |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 |
---
license: mit
title: Generate Knowledge Graphs
sdk: streamlit
emoji: π
colorFrom: indigo
colorTo: pink
short_description: Use LLM to generate a knowledge graph from your input data.
---
# πΈοΈ Knowledge Graph Extraction App
A complete knowledge graph extraction application using LLMs via OpenRouter, available in both Gradio and Streamlit versions.
## π Features
- **Multi-format Document Support**: PDF, TXT, DOCX, JSON files up to 10MB
- **LLM-powered Extraction**: Uses OpenRouter API with free models (Gemma-2-9B, Llama-3.1-8B)
- **Smart Entity Detection**: Automatically identifies people, organizations, locations, concepts, events, and objects
- **Importance Scoring**: LLM evaluates entity importance from 0.0 to 1.0
- **Interactive Visualization**: Multiple graph layout algorithms with filtering options
- **Batch Processing**: Optional processing of multiple documents together
- **Export Capabilities**: JSON, GraphML, and GEXF formats
- **Real-time Statistics**: Graph metrics and centrality analysis
## π Project Structure
```
knowledge-graphs/
βββ app.py # Main Gradio application (legacy)
βββ app_streamlit.py # Main Streamlit application (recommended)
βββ run_streamlit.py # Simple launcher script
βββ requirements.txt # Python dependencies
βββ README.md # Project documentation
βββ .env.example # Environment variables template
βββ config/
β βββ settings.py # Configuration management
βββ src/
βββ document_processor.py # Document loading and chunking
βββ llm_extractor.py # LLM-based entity extraction
βββ graph_builder.py # NetworkX graph construction
βββ visualizer.py # Graph visualization and export
```
## π§ Installation & Setup
### Option 1: Streamlit Version (Recommended)
The Streamlit version is more stable and has better file handling.
**Quick Start:**
```bash
python run_streamlit.py
```
**Manual Setup:**
1. **Install dependencies**:
```bash
pip install -r requirements.txt
```
2. **Run the Streamlit app**:
```bash
streamlit run app_streamlit.py --server.address 0.0.0.0 --server.port 8501
```
The app will be available at `http://localhost:8501`
### Option 2: Gradio Version (Legacy)
The Gradio version may have some file caching issues but is provided for compatibility.
1. **Install dependencies**:
```bash
pip install -r requirements.txt
```
2. **Set up environment variables** (optional):
```bash
cp .env.example .env
# Edit .env and add your OpenRouter API key
```
3. **Run the application**:
```bash
python app.py
```
The app will be available at `http://localhost:7860`
### HuggingFace Spaces Deployment
For **Streamlit deployment**:
1. Create a new Space on [HuggingFace Spaces](https://huggingface.co/spaces)
2. Choose "Streamlit" as the SDK
3. Upload `app_streamlit.py` as `app.py` (HF Spaces expects this name)
4. Upload all other project files maintaining directory structure
For **Gradio deployment**:
1. Create a new Space with "Gradio" as the SDK
2. Upload `app.py` and all other files
3. Note: May experience file handling issues
## π API Configuration
### Getting OpenRouter API Key
1. Visit [OpenRouter.ai](https://openrouter.ai)
2. Sign up for a free account
3. Navigate to API Keys section
4. Generate a new API key
5. Copy the key and use it in the application
### Free Models Used
- **Primary**: `google/gemma-2-9b-it:free`
- **Backup**: `meta-llama/llama-3.1-8b-instruct:free`
These models are specifically chosen to minimize API costs while maintaining quality.
## π Usage Guide
### Basic Workflow
1. **Upload Documents**:
- Select one or more files (PDF, TXT, DOCX, JSON)
- Toggle batch mode for multiple document processing
2. **Configure API**:
- Enter your OpenRouter API key
- Key is stored temporarily for the session
3. **Customize Settings**:
- Choose graph layout algorithm
- Toggle label visibility options
- Set minimum importance threshold
- Select entity types to include
4. **Extract Knowledge Graph**:
- Click "Extract Knowledge Graph" button
- Monitor progress through the status updates
- View results in multiple tabs
5. **Explore Results**:
- **Graph Visualization**: Interactive graph with colored nodes by entity type
- **Statistics**: Detailed metrics about the graph structure
- **Entities**: Complete list of extracted entities with details
- **Central Nodes**: Most important entities based on centrality measures
6. **Export Data**:
- Choose export format (JSON, GraphML, GEXF)
- Download structured graph data
### Advanced Features
#### Entity Types
- **PERSON**: Individuals mentioned in the text
- **ORGANIZATION**: Companies, institutions, groups
- **LOCATION**: Places, addresses, geographical entities
- **CONCEPT**: Abstract ideas, theories, methodologies
- **EVENT**: Specific occurrences, meetings, incidents
- **OBJECT**: Physical items, products, artifacts
#### Relationship Types
- **works_at**: Employment relationships
- **located_in**: Geographical associations
- **part_of**: Hierarchical relationships
- **causes**: Causal relationships
- **related_to**: General associations
#### Filtering Options
- **Importance Threshold**: Show only entities above specified importance score
- **Entity Types**: Filter by specific entity categories
- **Layout Algorithms**: Spring, circular, shell, Kamada-Kawai, random
## π οΈ Technical Details
### Architecture Components
1. **Document Processing**:
- Multi-format file parsing
- Intelligent text chunking with overlap
- File size validation
2. **LLM Integration**:
- OpenRouter API integration
- Structured prompt engineering
- Error handling and fallback models
3. **Graph Processing**:
- NetworkX-based graph construction
- Entity deduplication and standardization
- Relationship validation
4. **Visualization**:
- Matplotlib-based static graphs
- Interactive HTML visualizations
- Multiple export formats
### Configuration Options
All settings can be modified in `config/settings.py`:
- **Chunk Size**: Default 2000 characters
- **Chunk Overlap**: Default 200 characters
- **Max File Size**: Default 10MB
- **Max Entities**: Default 100 per extraction
- **Max Relationships**: Default 200 per extraction
- **Importance Threshold**: Default 0.3
### Differences Between Versions
**Streamlit Version Advantages:**
- More reliable file handling
- Better progress indicators
- Cleaner UI with sidebar configuration
- More stable caching system
- Built-in download functionality
**Gradio Version Advantages:**
- Simpler deployment to HF Spaces
- More compact interface
- Familiar for ML practitioners
## π Security & Privacy
- API keys are not stored permanently
- Files are processed temporarily and discarded
- No data is retained between sessions
- All processing happens server-side
## π Troubleshooting
### Common Issues
1. **"OpenRouter API key is required"**:
- Ensure you've entered a valid API key
- Check the key has sufficient credits
2. **"No entities extracted"**:
- Document may be too short or unstructured
- Try lowering the importance threshold
- Check if the document contains meaningful text
3. **File upload issues (Gradio version)**:
- Known issue with Gradio's file caching system
- Try the Streamlit version instead
- Ensure files are valid and not corrupted
4. **Segmentation fault (local development)**:
- Usually related to matplotlib backend
- Try setting `MPLBACKEND=Agg` environment variable
- Install GUI toolkit if running locally with display
5. **Module import errors**:
- Ensure all requirements are installed: `pip install -r requirements.txt`
- Check Python version compatibility (3.8+)
### Performance Tips
- Use batch mode for related documents
- Adjust chunk size for very long documents
- Lower importance threshold for sparse documents
- Use simpler layout algorithms for large graphs
## π€ Contributing
1. Fork the repository
2. Create a feature branch
3. Make your changes
4. Test with both Streamlit and Gradio versions if applicable
5. Add tests if applicable
6. Submit a pull request
## π License
This project is licensed under the MIT License - see the LICENSE file for details.
## π Acknowledgments
- [OpenRouter](https://openrouter.ai) for LLM API access
- [Streamlit](https://streamlit.io) for the modern web interface framework
- [Gradio](https://gradio.app) for the ML-focused web interface
- [NetworkX](https://networkx.org) for graph processing
- [HuggingFace Spaces](https://huggingface.co/spaces) for hosting |