CultriX's picture
First commit
e86199a
|
raw
history blame
8.74 kB
metadata
license: mit
title: Generate Knowledge Graphs
sdk: streamlit
emoji: πŸ“‰
colorFrom: indigo
colorTo: pink
short_description: Use LLM to generate a knowledge graph from your input data.

πŸ•ΈοΈ Knowledge Graph Extraction App

A complete knowledge graph extraction application using LLMs via OpenRouter, available in both Gradio and Streamlit versions.

πŸš€ Features

  • Multi-format Document Support: PDF, TXT, DOCX, JSON files up to 10MB
  • LLM-powered Extraction: Uses OpenRouter API with free models (Gemma-2-9B, Llama-3.1-8B)
  • Smart Entity Detection: Automatically identifies people, organizations, locations, concepts, events, and objects
  • Importance Scoring: LLM evaluates entity importance from 0.0 to 1.0
  • Interactive Visualization: Multiple graph layout algorithms with filtering options
  • Batch Processing: Optional processing of multiple documents together
  • Export Capabilities: JSON, GraphML, and GEXF formats
  • Real-time Statistics: Graph metrics and centrality analysis

πŸ“ Project Structure

knowledge-graphs/
β”œβ”€β”€ app.py                    # Main Gradio application (legacy)
β”œβ”€β”€ app_streamlit.py          # Main Streamlit application (recommended)
β”œβ”€β”€ run_streamlit.py          # Simple launcher script
β”œβ”€β”€ requirements.txt          # Python dependencies
β”œβ”€β”€ README.md                # Project documentation
β”œβ”€β”€ .env.example             # Environment variables template
β”œβ”€β”€ config/
β”‚   └── settings.py          # Configuration management
└── src/
    β”œβ”€β”€ document_processor.py # Document loading and chunking
    β”œβ”€β”€ llm_extractor.py      # LLM-based entity extraction
    β”œβ”€β”€ graph_builder.py      # NetworkX graph construction
    └── visualizer.py         # Graph visualization and export

πŸ”§ Installation & Setup

Option 1: Streamlit Version (Recommended)

The Streamlit version is more stable and has better file handling.

Quick Start:

python run_streamlit.py

Manual Setup:

  1. Install dependencies:
pip install -r requirements.txt
  1. Run the Streamlit app:
streamlit run app_streamlit.py --server.address 0.0.0.0 --server.port 8501

The app will be available at http://localhost:8501

Option 2: Gradio Version (Legacy)

The Gradio version may have some file caching issues but is provided for compatibility.

  1. Install dependencies:
pip install -r requirements.txt
  1. Set up environment variables (optional):
cp .env.example .env
# Edit .env and add your OpenRouter API key
  1. Run the application:
python app.py

The app will be available at http://localhost:7860

HuggingFace Spaces Deployment

For Streamlit deployment:

  1. Create a new Space on HuggingFace Spaces
  2. Choose "Streamlit" as the SDK
  3. Upload app_streamlit.py as app.py (HF Spaces expects this name)
  4. Upload all other project files maintaining directory structure

For Gradio deployment:

  1. Create a new Space with "Gradio" as the SDK
  2. Upload app.py and all other files
  3. Note: May experience file handling issues

πŸ”‘ API Configuration

Getting OpenRouter API Key

  1. Visit OpenRouter.ai
  2. Sign up for a free account
  3. Navigate to API Keys section
  4. Generate a new API key
  5. Copy the key and use it in the application

Free Models Used

  • Primary: google/gemma-2-9b-it:free
  • Backup: meta-llama/llama-3.1-8b-instruct:free

These models are specifically chosen to minimize API costs while maintaining quality.

πŸ“– Usage Guide

Basic Workflow

  1. Upload Documents:

    • Select one or more files (PDF, TXT, DOCX, JSON)
    • Toggle batch mode for multiple document processing
  2. Configure API:

    • Enter your OpenRouter API key
    • Key is stored temporarily for the session
  3. Customize Settings:

    • Choose graph layout algorithm
    • Toggle label visibility options
    • Set minimum importance threshold
    • Select entity types to include
  4. Extract Knowledge Graph:

    • Click "Extract Knowledge Graph" button
    • Monitor progress through the status updates
    • View results in multiple tabs
  5. Explore Results:

    • Graph Visualization: Interactive graph with colored nodes by entity type
    • Statistics: Detailed metrics about the graph structure
    • Entities: Complete list of extracted entities with details
    • Central Nodes: Most important entities based on centrality measures
  6. Export Data:

    • Choose export format (JSON, GraphML, GEXF)
    • Download structured graph data

Advanced Features

Entity Types

  • PERSON: Individuals mentioned in the text
  • ORGANIZATION: Companies, institutions, groups
  • LOCATION: Places, addresses, geographical entities
  • CONCEPT: Abstract ideas, theories, methodologies
  • EVENT: Specific occurrences, meetings, incidents
  • OBJECT: Physical items, products, artifacts

Relationship Types

  • works_at: Employment relationships
  • located_in: Geographical associations
  • part_of: Hierarchical relationships
  • causes: Causal relationships
  • related_to: General associations

Filtering Options

  • Importance Threshold: Show only entities above specified importance score
  • Entity Types: Filter by specific entity categories
  • Layout Algorithms: Spring, circular, shell, Kamada-Kawai, random

πŸ› οΈ Technical Details

Architecture Components

  1. Document Processing:

    • Multi-format file parsing
    • Intelligent text chunking with overlap
    • File size validation
  2. LLM Integration:

    • OpenRouter API integration
    • Structured prompt engineering
    • Error handling and fallback models
  3. Graph Processing:

    • NetworkX-based graph construction
    • Entity deduplication and standardization
    • Relationship validation
  4. Visualization:

    • Matplotlib-based static graphs
    • Interactive HTML visualizations
    • Multiple export formats

Configuration Options

All settings can be modified in config/settings.py:

  • Chunk Size: Default 2000 characters
  • Chunk Overlap: Default 200 characters
  • Max File Size: Default 10MB
  • Max Entities: Default 100 per extraction
  • Max Relationships: Default 200 per extraction
  • Importance Threshold: Default 0.3

Differences Between Versions

Streamlit Version Advantages:

  • More reliable file handling
  • Better progress indicators
  • Cleaner UI with sidebar configuration
  • More stable caching system
  • Built-in download functionality

Gradio Version Advantages:

  • Simpler deployment to HF Spaces
  • More compact interface
  • Familiar for ML practitioners

πŸ”’ Security & Privacy

  • API keys are not stored permanently
  • Files are processed temporarily and discarded
  • No data is retained between sessions
  • All processing happens server-side

πŸ› Troubleshooting

Common Issues

  1. "OpenRouter API key is required":

    • Ensure you've entered a valid API key
    • Check the key has sufficient credits
  2. "No entities extracted":

    • Document may be too short or unstructured
    • Try lowering the importance threshold
    • Check if the document contains meaningful text
  3. File upload issues (Gradio version):

    • Known issue with Gradio's file caching system
    • Try the Streamlit version instead
    • Ensure files are valid and not corrupted
  4. Segmentation fault (local development):

    • Usually related to matplotlib backend
    • Try setting MPLBACKEND=Agg environment variable
    • Install GUI toolkit if running locally with display
  5. Module import errors:

    • Ensure all requirements are installed: pip install -r requirements.txt
    • Check Python version compatibility (3.8+)

Performance Tips

  • Use batch mode for related documents
  • Adjust chunk size for very long documents
  • Lower importance threshold for sparse documents
  • Use simpler layout algorithms for large graphs

🀝 Contributing

  1. Fork the repository
  2. Create a feature branch
  3. Make your changes
  4. Test with both Streamlit and Gradio versions if applicable
  5. Add tests if applicable
  6. Submit a pull request

πŸ“„ License

This project is licensed under the MIT License - see the LICENSE file for details.

πŸ™ Acknowledgments