--- license: mit title: Generate Knowledge Graphs sdk: streamlit emoji: πŸ“‰ colorFrom: indigo colorTo: pink short_description: Use LLM to generate a knowledge graph from your input data. --- # πŸ•ΈοΈ Knowledge Graph Extraction App A complete knowledge graph extraction application using LLMs via OpenRouter, available in both Gradio and Streamlit versions. ## πŸš€ Features - **Multi-format Document Support**: PDF, TXT, DOCX, JSON files up to 10MB - **LLM-powered Extraction**: Uses OpenRouter API with free models (Gemma-2-9B, Llama-3.1-8B) - **Smart Entity Detection**: Automatically identifies people, organizations, locations, concepts, events, and objects - **Importance Scoring**: LLM evaluates entity importance from 0.0 to 1.0 - **Interactive Visualization**: Multiple graph layout algorithms with filtering options - **Batch Processing**: Optional processing of multiple documents together - **Export Capabilities**: JSON, GraphML, and GEXF formats - **Real-time Statistics**: Graph metrics and centrality analysis ## πŸ“ Project Structure ``` knowledge-graphs/ β”œβ”€β”€ app.py # Main Gradio application (legacy) β”œβ”€β”€ app_streamlit.py # Main Streamlit application (recommended) β”œβ”€β”€ run_streamlit.py # Simple launcher script β”œβ”€β”€ requirements.txt # Python dependencies β”œβ”€β”€ README.md # Project documentation β”œβ”€β”€ .env.example # Environment variables template β”œβ”€β”€ config/ β”‚ └── settings.py # Configuration management └── src/ β”œβ”€β”€ document_processor.py # Document loading and chunking β”œβ”€β”€ llm_extractor.py # LLM-based entity extraction β”œβ”€β”€ graph_builder.py # NetworkX graph construction └── visualizer.py # Graph visualization and export ``` ## πŸ”§ Installation & Setup ### Option 1: Streamlit Version (Recommended) The Streamlit version is more stable and has better file handling. **Quick Start:** ```bash python run_streamlit.py ``` **Manual Setup:** 1. **Install dependencies**: ```bash pip install -r requirements.txt ``` 2. **Run the Streamlit app**: ```bash streamlit run app_streamlit.py --server.address 0.0.0.0 --server.port 8501 ``` The app will be available at `http://localhost:8501` ### Option 2: Gradio Version (Legacy) The Gradio version may have some file caching issues but is provided for compatibility. 1. **Install dependencies**: ```bash pip install -r requirements.txt ``` 2. **Set up environment variables** (optional): ```bash cp .env.example .env # Edit .env and add your OpenRouter API key ``` 3. **Run the application**: ```bash python app.py ``` The app will be available at `http://localhost:7860` ### HuggingFace Spaces Deployment For **Streamlit deployment**: 1. Create a new Space on [HuggingFace Spaces](https://huggingface.co/spaces) 2. Choose "Streamlit" as the SDK 3. Upload `app_streamlit.py` as `app.py` (HF Spaces expects this name) 4. Upload all other project files maintaining directory structure For **Gradio deployment**: 1. Create a new Space with "Gradio" as the SDK 2. Upload `app.py` and all other files 3. Note: May experience file handling issues ## πŸ”‘ API Configuration ### Getting OpenRouter API Key 1. Visit [OpenRouter.ai](https://openrouter.ai) 2. Sign up for a free account 3. Navigate to API Keys section 4. Generate a new API key 5. Copy the key and use it in the application ### Free Models Used - **Primary**: `google/gemma-2-9b-it:free` - **Backup**: `meta-llama/llama-3.1-8b-instruct:free` These models are specifically chosen to minimize API costs while maintaining quality. ## πŸ“– Usage Guide ### Basic Workflow 1. **Upload Documents**: - Select one or more files (PDF, TXT, DOCX, JSON) - Toggle batch mode for multiple document processing 2. **Configure API**: - Enter your OpenRouter API key - Key is stored temporarily for the session 3. **Customize Settings**: - Choose graph layout algorithm - Toggle label visibility options - Set minimum importance threshold - Select entity types to include 4. **Extract Knowledge Graph**: - Click "Extract Knowledge Graph" button - Monitor progress through the status updates - View results in multiple tabs 5. **Explore Results**: - **Graph Visualization**: Interactive graph with colored nodes by entity type - **Statistics**: Detailed metrics about the graph structure - **Entities**: Complete list of extracted entities with details - **Central Nodes**: Most important entities based on centrality measures 6. **Export Data**: - Choose export format (JSON, GraphML, GEXF) - Download structured graph data ### Advanced Features #### Entity Types - **PERSON**: Individuals mentioned in the text - **ORGANIZATION**: Companies, institutions, groups - **LOCATION**: Places, addresses, geographical entities - **CONCEPT**: Abstract ideas, theories, methodologies - **EVENT**: Specific occurrences, meetings, incidents - **OBJECT**: Physical items, products, artifacts #### Relationship Types - **works_at**: Employment relationships - **located_in**: Geographical associations - **part_of**: Hierarchical relationships - **causes**: Causal relationships - **related_to**: General associations #### Filtering Options - **Importance Threshold**: Show only entities above specified importance score - **Entity Types**: Filter by specific entity categories - **Layout Algorithms**: Spring, circular, shell, Kamada-Kawai, random ## πŸ› οΈ Technical Details ### Architecture Components 1. **Document Processing**: - Multi-format file parsing - Intelligent text chunking with overlap - File size validation 2. **LLM Integration**: - OpenRouter API integration - Structured prompt engineering - Error handling and fallback models 3. **Graph Processing**: - NetworkX-based graph construction - Entity deduplication and standardization - Relationship validation 4. **Visualization**: - Matplotlib-based static graphs - Interactive HTML visualizations - Multiple export formats ### Configuration Options All settings can be modified in `config/settings.py`: - **Chunk Size**: Default 2000 characters - **Chunk Overlap**: Default 200 characters - **Max File Size**: Default 10MB - **Max Entities**: Default 100 per extraction - **Max Relationships**: Default 200 per extraction - **Importance Threshold**: Default 0.3 ### Differences Between Versions **Streamlit Version Advantages:** - More reliable file handling - Better progress indicators - Cleaner UI with sidebar configuration - More stable caching system - Built-in download functionality **Gradio Version Advantages:** - Simpler deployment to HF Spaces - More compact interface - Familiar for ML practitioners ## πŸ”’ Security & Privacy - API keys are not stored permanently - Files are processed temporarily and discarded - No data is retained between sessions - All processing happens server-side ## πŸ› Troubleshooting ### Common Issues 1. **"OpenRouter API key is required"**: - Ensure you've entered a valid API key - Check the key has sufficient credits 2. **"No entities extracted"**: - Document may be too short or unstructured - Try lowering the importance threshold - Check if the document contains meaningful text 3. **File upload issues (Gradio version)**: - Known issue with Gradio's file caching system - Try the Streamlit version instead - Ensure files are valid and not corrupted 4. **Segmentation fault (local development)**: - Usually related to matplotlib backend - Try setting `MPLBACKEND=Agg` environment variable - Install GUI toolkit if running locally with display 5. **Module import errors**: - Ensure all requirements are installed: `pip install -r requirements.txt` - Check Python version compatibility (3.8+) ### Performance Tips - Use batch mode for related documents - Adjust chunk size for very long documents - Lower importance threshold for sparse documents - Use simpler layout algorithms for large graphs ## 🀝 Contributing 1. Fork the repository 2. Create a feature branch 3. Make your changes 4. Test with both Streamlit and Gradio versions if applicable 5. Add tests if applicable 6. Submit a pull request ## πŸ“„ License This project is licensed under the MIT License - see the LICENSE file for details. ## πŸ™ Acknowledgments - [OpenRouter](https://openrouter.ai) for LLM API access - [Streamlit](https://streamlit.io) for the modern web interface framework - [Gradio](https://gradio.app) for the ML-focused web interface - [NetworkX](https://networkx.org) for graph processing - [HuggingFace Spaces](https://huggingface.co/spaces) for hosting