Spaces:

CultriX
/

Generate-Knowledge-Graphs

Running

App Files Files Community

Generate-Knowledge-Graphs / README.md

CultriX

First commit

e86199a 4 days ago

preview code

raw

history blame contribute delete

8.74 kB

	---
	license: mit
	title: Generate Knowledge Graphs
	sdk: streamlit
	emoji: 📉
	colorFrom: indigo
	colorTo: pink
	short_description: Use LLM to generate a knowledge graph from your input data.
	---
	# 🕸️ Knowledge Graph Extraction App

	A complete knowledge graph extraction application using LLMs via OpenRouter, available in both Gradio and Streamlit versions.

	## 🚀 Features

	- Multi-format Document Support: PDF, TXT, DOCX, JSON files up to 10MB
	- LLM-powered Extraction: Uses OpenRouter API with free models (Gemma-2-9B, Llama-3.1-8B)
	- Smart Entity Detection: Automatically identifies people, organizations, locations, concepts, events, and objects
	- Importance Scoring: LLM evaluates entity importance from 0.0 to 1.0
	- Interactive Visualization: Multiple graph layout algorithms with filtering options
	- Batch Processing: Optional processing of multiple documents together
	- Export Capabilities: JSON, GraphML, and GEXF formats
	- Real-time Statistics: Graph metrics and centrality analysis

	## 📁 Project Structure

	```
	knowledge-graphs/
	├── app.py # Main Gradio application (legacy)
	├── app_streamlit.py # Main Streamlit application (recommended)
	├── run_streamlit.py # Simple launcher script
	├── requirements.txt # Python dependencies
	├── README.md # Project documentation
	├── .env.example # Environment variables template
	├── config/
	│ └── settings.py # Configuration management
	└── src/
	├── document_processor.py # Document loading and chunking
	├── llm_extractor.py # LLM-based entity extraction
	├── graph_builder.py # NetworkX graph construction
	└── visualizer.py # Graph visualization and export
	```

	## 🔧 Installation & Setup

	### Option 1: Streamlit Version (Recommended)

	The Streamlit version is more stable and has better file handling.

	Quick Start:
	```bash
	python run_streamlit.py
	```

	Manual Setup:
	1. Install dependencies:
	```bash
	pip install -r requirements.txt
	```

	2. Run the Streamlit app:
	```bash
	streamlit run app_streamlit.py --server.address 0.0.0.0 --server.port 8501
	```

	The app will be available at `http://localhost:8501`

	### Option 2: Gradio Version (Legacy)

	The Gradio version may have some file caching issues but is provided for compatibility.

	1. Install dependencies:
	```bash
	pip install -r requirements.txt
	```

	2. Set up environment variables (optional):
	```bash
	cp .env.example .env
	# Edit .env and add your OpenRouter API key
	```

	3. Run the application:
	```bash
	python app.py
	```

	The app will be available at `http://localhost:7860`

	### HuggingFace Spaces Deployment

	For Streamlit deployment:
	1. Create a new Space on [HuggingFace Spaces](https://huggingface.co/spaces)
	2. Choose "Streamlit" as the SDK
	3. Upload `app_streamlit.py` as `app.py` (HF Spaces expects this name)
	4. Upload all other project files maintaining directory structure

	For Gradio deployment:
	1. Create a new Space with "Gradio" as the SDK
	2. Upload `app.py` and all other files
	3. Note: May experience file handling issues

	## 🔑 API Configuration

	### Getting OpenRouter API Key

	1. Visit [OpenRouter.ai](https://openrouter.ai)
	2. Sign up for a free account
	3. Navigate to API Keys section
	4. Generate a new API key
	5. Copy the key and use it in the application

	### Free Models Used

	- Primary: `google/gemma-2-9b-it:free`
	- Backup: `meta-llama/llama-3.1-8b-instruct:free`

	These models are specifically chosen to minimize API costs while maintaining quality.

	## 📖 Usage Guide

	### Basic Workflow

	1. Upload Documents:
	- Select one or more files (PDF, TXT, DOCX, JSON)
	- Toggle batch mode for multiple document processing

	2. Configure API:
	- Enter your OpenRouter API key
	- Key is stored temporarily for the session

	3. Customize Settings:
	- Choose graph layout algorithm
	- Toggle label visibility options
	- Set minimum importance threshold
	- Select entity types to include

	4. Extract Knowledge Graph:
	- Click "Extract Knowledge Graph" button
	- Monitor progress through the status updates
	- View results in multiple tabs

	5. Explore Results:
	- Graph Visualization: Interactive graph with colored nodes by entity type
	- Statistics: Detailed metrics about the graph structure
	- Entities: Complete list of extracted entities with details
	- Central Nodes: Most important entities based on centrality measures

	6. Export Data:
	- Choose export format (JSON, GraphML, GEXF)
	- Download structured graph data

	### Advanced Features

	#### Entity Types
	- PERSON: Individuals mentioned in the text
	- ORGANIZATION: Companies, institutions, groups
	- LOCATION: Places, addresses, geographical entities
	- CONCEPT: Abstract ideas, theories, methodologies
	- EVENT: Specific occurrences, meetings, incidents
	- OBJECT: Physical items, products, artifacts

	#### Relationship Types
	- works_at: Employment relationships
	- located_in: Geographical associations
	- part_of: Hierarchical relationships
	- causes: Causal relationships
	- related_to: General associations

	#### Filtering Options
	- Importance Threshold: Show only entities above specified importance score
	- Entity Types: Filter by specific entity categories
	- Layout Algorithms: Spring, circular, shell, Kamada-Kawai, random

	## 🛠️ Technical Details

	### Architecture Components

	1. Document Processing:
	- Multi-format file parsing
	- Intelligent text chunking with overlap
	- File size validation

	2. LLM Integration:
	- OpenRouter API integration
	- Structured prompt engineering
	- Error handling and fallback models

	3. Graph Processing:
	- NetworkX-based graph construction
	- Entity deduplication and standardization
	- Relationship validation

	4. Visualization:
	- Matplotlib-based static graphs
	- Interactive HTML visualizations
	- Multiple export formats

	### Configuration Options

	All settings can be modified in `config/settings.py`:

	- Chunk Size: Default 2000 characters
	- Chunk Overlap: Default 200 characters
	- Max File Size: Default 10MB
	- Max Entities: Default 100 per extraction
	- Max Relationships: Default 200 per extraction
	- Importance Threshold: Default 0.3

	### Differences Between Versions

	Streamlit Version Advantages:
	- More reliable file handling
	- Better progress indicators
	- Cleaner UI with sidebar configuration
	- More stable caching system
	- Built-in download functionality

	Gradio Version Advantages:
	- Simpler deployment to HF Spaces
	- More compact interface
	- Familiar for ML practitioners

	## 🔒 Security & Privacy

	- API keys are not stored permanently
	- Files are processed temporarily and discarded
	- No data is retained between sessions
	- All processing happens server-side

	## 🐛 Troubleshooting

	### Common Issues

	1. "OpenRouter API key is required":
	- Ensure you've entered a valid API key
	- Check the key has sufficient credits

	2. "No entities extracted":
	- Document may be too short or unstructured
	- Try lowering the importance threshold
	- Check if the document contains meaningful text

	3. File upload issues (Gradio version):
	- Known issue with Gradio's file caching system
	- Try the Streamlit version instead
	- Ensure files are valid and not corrupted

	4. Segmentation fault (local development):
	- Usually related to matplotlib backend
	- Try setting `MPLBACKEND=Agg` environment variable
	- Install GUI toolkit if running locally with display

	5. Module import errors:
	- Ensure all requirements are installed: `pip install -r requirements.txt`
	- Check Python version compatibility (3.8+)

	### Performance Tips

	- Use batch mode for related documents
	- Adjust chunk size for very long documents
	- Lower importance threshold for sparse documents
	- Use simpler layout algorithms for large graphs

	## 🤝 Contributing

	1. Fork the repository
	2. Create a feature branch
	3. Make your changes
	4. Test with both Streamlit and Gradio versions if applicable
	5. Add tests if applicable
	6. Submit a pull request

	## 📄 License

	This project is licensed under the MIT License - see the LICENSE file for details.

	## 🙏 Acknowledgments

	- [OpenRouter](https://openrouter.ai) for LLM API access
	- [Streamlit](https://streamlit.io) for the modern web interface framework
	- [Gradio](https://gradio.app) for the ML-focused web interface
	- [NetworkX](https://networkx.org) for graph processing
	- [HuggingFace Spaces](https://huggingface.co/spaces) for hosting