Spaces:

Agents-MCP-Hackathon
/

KnowledgeBridge

Sleeping

App Files Files Community

KnowledgeBridge / docs /SYSTEM_ARCHITECTURE.md

fazeel007

initial commit

7c012de 5 months ago

preview code

raw

history blame

11.1 kB

	# KnowledgeBridge - System Architecture & Flow

	## 🎯 Overview

	This document provides a comprehensive technical overview of the KnowledgeBridge system architecture, data flow, and AI processing pipeline.

	## 📊 Main Data Flow

	```
	User Query → AI Enhancement → Multi-Source Search → URL Validation → Results Display
	```

	## 🔄 Detailed Process Flow

	### Stage 1: Input Processing & Enhancement
	Components:
	- Enhanced Search Interface (React/TypeScript)
	- Input validation and sanitization
	- Rate limiting middleware

	Technical Details:
	- React captures user input with real-time validation
	- Optional AI query enhancement using Nebius DeepSeek models
	- Express.js endpoint with comprehensive security middleware
	- Request body size limits and input sanitization

	### Stage 2: AI-Powered Query Enhancement
	Components:
	- Nebius AI client with DeepSeek-R1-0528 model
	- Smart query analysis and improvement
	- Intent recognition and keyword extraction

	Technical Details:
	- Nebius API call: `deepseek-ai/DeepSeek-R1-0528`
	- Analyzes user intent and suggests improvements
	- Provides enhanced query, keywords, and alternative suggestions
	- Fallback to original query if enhancement fails

	### Stage 3: Embedding Generation
	Components:
	- Nebius embedding service
	- BAAI/bge-en-icl model for vector generation
	- Mock embedding fallback for reliability

	Technical Details:
	- Primary model: `BAAI/bge-en-icl`
	- Generates high-dimensional vector representations
	- Fallback to deterministic mock embeddings for demos
	- Semantic meaning captured in numerical vectors

	### Stage 4: Multi-Source Search
	Components:
	- Local storage search (in-memory with sample data)
	- GitHub repository search with advanced filtering
	- Wikipedia API integration
	- ArXiv academic paper search

	Technical Details:
	- Local Search: Keyword matching with relevance scoring
	- GitHub API: Enhanced with author filtering and fallback strategies
	- Wikipedia API: 3-second timeout with content validation
	- ArXiv API: Format validation and paper existence verification
	- Parallel Processing: Concurrent search across all sources

	### Stage 5: URL Validation & Content Verification
	Components:
	- Smart URL validation system
	- ArXiv paper ID format checking
	- Content-based error detection
	- Concurrent processing with rate limits

	Technical Details:
	- ArXiv Validation: Checks paper ID format (2024.12345, cs.AI/1234567)
	- Content Verification: Detects error pages that return 200 status
	- Rate Limiting: Configurable concurrency to prevent API abuse
	- Trusted Domains: Fast-path for reliable sources (GitHub, Wikipedia)

	### Stage 6: Document Analysis (Optional)
	Components:
	- Nebius AI with configurable output formatting
	- DeepSeek-R1 model for comprehensive analysis
	- Content cleanup and markdown processing

	Technical Details:
	- Analysis types: summary, classification, key_points, quality_score
	- Configurable markdown vs plain text output
	- DeepSeek R1 thinking tag cleanup for clean results
	- Custom prompts optimized for each analysis type

	### Stage 7: Results Processing & Display
	Components:
	- Result ranking and relevance scoring
	- Citation management system
	- Interactive UI with error boundaries

	Technical Details:
	- Relevance-based sorting with multiple factors
	- Rich metadata display with type-safe rendering
	- Error boundaries prevent UI crashes
	- Real-time result updates and filtering

	## 🏗️ System Architecture

	### Frontend Stack
	```
	┌─────────────────────────────────────────────────────────┐
	│ React 18 + TypeScript │
	├─────────────────────────────────────────────────────────┤
	│ Enhanced Search Interface │ Knowledge Graph │ AI │
	│ - Unified search & AI │ - D3.js visualization │ Tools│
	│ - Query enhancement │ - Interactive nodes │Panel │
	│ - Configurable analysis │ - Relationship mapping │ │
	├─────────────────────────────────────────────────────────┤
	│ TanStack Query (Data Fetching & Caching) │
	├─────────────────────────────────────────────────────────┤
	│ Radix UI + Tailwind CSS │
	└─────────────────────────────────────────────────────────┘
	```

	### Backend Stack
	```
	┌─────────────────────────────────────────────────────────┐
	│ Express.js + Security Middleware │
	├─────────────────────────────────────────────────────────┤
	│ Helmet.js │ Rate Limiting │ Input Validation │CORS│
	├─────────────────────────────────────────────────────────┤
	│ API Routes Layer │
	│ /api/search │ /api/analyze-document │ /api/health │
	├─────────────────────────────────────────────────────────┤
	│ Service Layer │
	│ Nebius Client │ Modal Client │ Storage Service │
	└─────────────────────────────────────────────────────────┘
	```

	### AI & Processing Layer
	```
	┌─────────────────────────────────────────────────────────┐
	│ Nebius AI Platform │
	├─────────────────────────────────────────────────────────┤
	│ DeepSeek-R1-0528 │ BAAI/bge-en-icl │
	│ - Chat completions │ - Embedding generation │
	│ - Document analysis │ - Semantic similarity │
	│ - Query enhancement │ - Vector search │
	├─────────────────────────────────────────────────────────┤
	│ Modal Platform │
	│ - Distributed processing │ - Scalable compute │
	│ - Batch operations │ - Resource management │
	└─────────────────────────────────────────────────────────┘
	```

	## 🔒 Security Architecture

	### Input Protection
	```
	Request → Rate Limiter → Helmet.js → Input Validator → API Route
	↓ ↓ ↓ ↓ ↓
	100/15min CSP Headers Body Size Zod Schema Business Logic
	10/min* XSS Protection 10MB Limit Type Safety Error Handling

	* Sensitive endpoints
	```

	### Error Handling Chain
	```
	React Error Boundary → API Error Handler → Service Error Handler
	↓ ↓ ↓
	UI Graceful Fallback HTTP Status Codes Logging & Recovery
	```

	## 🚀 Performance Characteristics

	### Response Times
	\| Operation \| Average Time \| Details \|
	\|-----------\|-------------\|---------\|
	\| Local Search \| <100ms \| In-memory keyword matching \|
	\| URL Validation \| 1-3s per URL \| Concurrent processing \|
	\| Document Analysis \| 3-5s \| AI model processing time \|
	\| Embedding Generation \| 500ms-1s \| Nebius API call \|
	\| Query Enhancement \| 1-2s \| DeepSeek model inference \|

	### Scalability Features
	- Horizontal Scaling: Modal platform for distributed processing
	- Rate Limiting: Prevents API abuse and ensures fair usage
	- Caching: TanStack Query for client-side data caching
	- Error Recovery: Graceful degradation when services are unavailable
	- Load Distribution: Concurrent processing of multiple requests

	## 🔧 Data Flow Patterns

	### Search Request Flow
	```mermaid
	graph TD
	A[User Query] --> B[Rate Limiter]
	B --> C[Input Validation]
	C --> D[AI Enhancement?]
	D -->\|Yes\| E[Nebius Query Enhancement]
	D -->\|No\| F[Direct Search]
	E --> F[Multi-Source Search]
	F --> G[Local Storage]
	F --> H[GitHub API]
	F --> I[Wikipedia API]
	F --> J[ArXiv API]
	G --> K[URL Validation]
	H --> K
	I --> K
	J --> K
	K --> L[Results Ranking]
	L --> M[Response Formatting]
	M --> N[Client Display]
	```

	### Document Analysis Flow
	```mermaid
	graph TD
	A[Document Content] --> B[Content Validation]
	B --> C[Analysis Type Selection]
	C --> D[Nebius DeepSeek API]
	D --> E[Response Processing]
	E --> F[Format Selection]
	F -->\|Markdown\| G[Rich Formatting]
	F -->\|Plain Text\| H[Clean Text Output]
	G --> I[Client Display]
	H --> I
	```

	## 🛠️ Technology Integration Points

	### External API Integration
	- Nebius AI: Primary AI service for all language model tasks
	- Modal: Distributed processing and compute scaling
	- GitHub API: Repository search with authentication
	- Wikipedia API: Authoritative content with caching
	- ArXiv API: Academic paper search with validation

	### Internal Service Communication
	- REST APIs: Standard HTTP/JSON for client-server communication
	- Event-Driven: React state management for UI updates
	- Error Propagation: Structured error handling across all layers
	- Type Safety: TypeScript contracts for all service boundaries

	## 📊 Quality Assurance

	### Code Quality
	- TypeScript: 100% type coverage across frontend and backend
	- Input Validation: Zod schemas for all API endpoints
	- Error Boundaries: React error boundaries prevent UI crashes
	- Security Middleware: Comprehensive protection against common attacks

	### Testing Strategy
	- Type Checking: Continuous TypeScript compilation validation
	- API Testing: Health checks and endpoint validation
	- Error Testing: Graceful handling of service failures
	- Performance Testing: Response time monitoring and optimization

	This architecture provides a robust, scalable, and secure foundation for AI-powered knowledge discovery with comprehensive error handling and performance optimization.