File size: 8,740 Bytes
c99d6eb
be24bab
e86199a
 
 
 
 
 
c99d6eb
e86199a
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
c99d6eb
e86199a
c99d6eb
e86199a
c99d6eb
e86199a
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
---
license: mit
title: Generate Knowledge Graphs
sdk: streamlit
emoji: πŸ“‰
colorFrom: indigo
colorTo: pink
short_description: Use LLM to generate a knowledge graph from your input data.
---
# πŸ•ΈοΈ Knowledge Graph Extraction App

A complete knowledge graph extraction application using LLMs via OpenRouter, available in both Gradio and Streamlit versions.

## πŸš€ Features

- **Multi-format Document Support**: PDF, TXT, DOCX, JSON files up to 10MB
- **LLM-powered Extraction**: Uses OpenRouter API with free models (Gemma-2-9B, Llama-3.1-8B)
- **Smart Entity Detection**: Automatically identifies people, organizations, locations, concepts, events, and objects
- **Importance Scoring**: LLM evaluates entity importance from 0.0 to 1.0
- **Interactive Visualization**: Multiple graph layout algorithms with filtering options
- **Batch Processing**: Optional processing of multiple documents together
- **Export Capabilities**: JSON, GraphML, and GEXF formats
- **Real-time Statistics**: Graph metrics and centrality analysis

## πŸ“ Project Structure

```
knowledge-graphs/
β”œβ”€β”€ app.py                    # Main Gradio application (legacy)
β”œβ”€β”€ app_streamlit.py          # Main Streamlit application (recommended)
β”œβ”€β”€ run_streamlit.py          # Simple launcher script
β”œβ”€β”€ requirements.txt          # Python dependencies
β”œβ”€β”€ README.md                # Project documentation
β”œβ”€β”€ .env.example             # Environment variables template
β”œβ”€β”€ config/
β”‚   └── settings.py          # Configuration management
└── src/
    β”œβ”€β”€ document_processor.py # Document loading and chunking
    β”œβ”€β”€ llm_extractor.py      # LLM-based entity extraction
    β”œβ”€β”€ graph_builder.py      # NetworkX graph construction
    └── visualizer.py         # Graph visualization and export
```

## πŸ”§ Installation & Setup

### Option 1: Streamlit Version (Recommended)

The Streamlit version is more stable and has better file handling.

**Quick Start:**
```bash
python run_streamlit.py
```

**Manual Setup:**
1. **Install dependencies**:
```bash
pip install -r requirements.txt
```

2. **Run the Streamlit app**:
```bash
streamlit run app_streamlit.py --server.address 0.0.0.0 --server.port 8501
```

The app will be available at `http://localhost:8501`

### Option 2: Gradio Version (Legacy)

The Gradio version may have some file caching issues but is provided for compatibility.

1. **Install dependencies**:
```bash
pip install -r requirements.txt
```

2. **Set up environment variables** (optional):
```bash
cp .env.example .env
# Edit .env and add your OpenRouter API key
```

3. **Run the application**:
```bash
python app.py
```

The app will be available at `http://localhost:7860`

### HuggingFace Spaces Deployment

For **Streamlit deployment**:
1. Create a new Space on [HuggingFace Spaces](https://huggingface.co/spaces)
2. Choose "Streamlit" as the SDK
3. Upload `app_streamlit.py` as `app.py` (HF Spaces expects this name)
4. Upload all other project files maintaining directory structure

For **Gradio deployment**:
1. Create a new Space with "Gradio" as the SDK
2. Upload `app.py` and all other files
3. Note: May experience file handling issues

## πŸ”‘ API Configuration

### Getting OpenRouter API Key

1. Visit [OpenRouter.ai](https://openrouter.ai)
2. Sign up for a free account
3. Navigate to API Keys section
4. Generate a new API key
5. Copy the key and use it in the application

### Free Models Used

- **Primary**: `google/gemma-2-9b-it:free`
- **Backup**: `meta-llama/llama-3.1-8b-instruct:free`

These models are specifically chosen to minimize API costs while maintaining quality.

## πŸ“– Usage Guide

### Basic Workflow

1. **Upload Documents**: 
   - Select one or more files (PDF, TXT, DOCX, JSON)
   - Toggle batch mode for multiple document processing

2. **Configure API**:
   - Enter your OpenRouter API key
   - Key is stored temporarily for the session

3. **Customize Settings**:
   - Choose graph layout algorithm
   - Toggle label visibility options
   - Set minimum importance threshold
   - Select entity types to include

4. **Extract Knowledge Graph**:
   - Click "Extract Knowledge Graph" button
   - Monitor progress through the status updates
   - View results in multiple tabs

5. **Explore Results**:
   - **Graph Visualization**: Interactive graph with colored nodes by entity type
   - **Statistics**: Detailed metrics about the graph structure
   - **Entities**: Complete list of extracted entities with details
   - **Central Nodes**: Most important entities based on centrality measures

6. **Export Data**:
   - Choose export format (JSON, GraphML, GEXF)
   - Download structured graph data

### Advanced Features

#### Entity Types
- **PERSON**: Individuals mentioned in the text
- **ORGANIZATION**: Companies, institutions, groups
- **LOCATION**: Places, addresses, geographical entities
- **CONCEPT**: Abstract ideas, theories, methodologies
- **EVENT**: Specific occurrences, meetings, incidents
- **OBJECT**: Physical items, products, artifacts

#### Relationship Types
- **works_at**: Employment relationships
- **located_in**: Geographical associations
- **part_of**: Hierarchical relationships
- **causes**: Causal relationships
- **related_to**: General associations

#### Filtering Options
- **Importance Threshold**: Show only entities above specified importance score
- **Entity Types**: Filter by specific entity categories
- **Layout Algorithms**: Spring, circular, shell, Kamada-Kawai, random

## πŸ› οΈ Technical Details

### Architecture Components

1. **Document Processing**: 
   - Multi-format file parsing
   - Intelligent text chunking with overlap
   - File size validation

2. **LLM Integration**:
   - OpenRouter API integration
   - Structured prompt engineering
   - Error handling and fallback models

3. **Graph Processing**:
   - NetworkX-based graph construction
   - Entity deduplication and standardization
   - Relationship validation

4. **Visualization**:
   - Matplotlib-based static graphs
   - Interactive HTML visualizations
   - Multiple export formats

### Configuration Options

All settings can be modified in `config/settings.py`:

- **Chunk Size**: Default 2000 characters
- **Chunk Overlap**: Default 200 characters  
- **Max File Size**: Default 10MB
- **Max Entities**: Default 100 per extraction
- **Max Relationships**: Default 200 per extraction
- **Importance Threshold**: Default 0.3

### Differences Between Versions

**Streamlit Version Advantages:**
- More reliable file handling
- Better progress indicators
- Cleaner UI with sidebar configuration
- More stable caching system
- Built-in download functionality

**Gradio Version Advantages:**
- Simpler deployment to HF Spaces
- More compact interface
- Familiar for ML practitioners

## πŸ”’ Security & Privacy

- API keys are not stored permanently
- Files are processed temporarily and discarded
- No data is retained between sessions
- All processing happens server-side

## πŸ› Troubleshooting

### Common Issues

1. **"OpenRouter API key is required"**:
   - Ensure you've entered a valid API key
   - Check the key has sufficient credits

2. **"No entities extracted"**:
   - Document may be too short or unstructured
   - Try lowering the importance threshold
   - Check if the document contains meaningful text

3. **File upload issues (Gradio version)**:
   - Known issue with Gradio's file caching system
   - Try the Streamlit version instead
   - Ensure files are valid and not corrupted

4. **Segmentation fault (local development)**:
   - Usually related to matplotlib backend
   - Try setting `MPLBACKEND=Agg` environment variable
   - Install GUI toolkit if running locally with display

5. **Module import errors**:
   - Ensure all requirements are installed: `pip install -r requirements.txt`
   - Check Python version compatibility (3.8+)

### Performance Tips

- Use batch mode for related documents
- Adjust chunk size for very long documents
- Lower importance threshold for sparse documents
- Use simpler layout algorithms for large graphs

## 🀝 Contributing

1. Fork the repository
2. Create a feature branch
3. Make your changes
4. Test with both Streamlit and Gradio versions if applicable
5. Add tests if applicable
6. Submit a pull request

## πŸ“„ License

This project is licensed under the MIT License - see the LICENSE file for details.

## πŸ™ Acknowledgments

- [OpenRouter](https://openrouter.ai) for LLM API access
- [Streamlit](https://streamlit.io) for the modern web interface framework
- [Gradio](https://gradio.app) for the ML-focused web interface
- [NetworkX](https://networkx.org) for graph processing
- [HuggingFace Spaces](https://huggingface.co/spaces) for hosting