Spaces:
Running
Running
File Upload System Proposal for Faculty Course Materials
Based on your existing architecture, here's a comprehensive proposal for implementing file uploads with efficient parsing and deployment preservation:
Core Architecture Design
1. File Processing Pipeline
Upload β Parse β Chunk β Vector Store β RAG Integration β Deployment Package
2. File Storage Structure
/course_materials/
βββ raw_files/ # Original uploaded files
βββ processed/ # Parsed text content
βββ embeddings/ # Vector representations
βββ metadata.json # File tracking & metadata
Implementation Components
File Upload Handler (app.py:352-408 enhancement)
- Add
gr.File(file_types=[".pdf", ".docx", ".txt", ".md"])
component - Support multiple file uploads with
file_count="multiple"
- Implement file validation and size limits (10MB per file)
Document Parser Service (new: document_parser.py
)
- PDF: PyMuPDF for text extraction with layout preservation
- DOCX: python-docx for structured content
- TXT/MD: Direct text processing with metadata extraction
- Auto-detection: File type identification and appropriate parser routing
RAG Integration (enhancement to existing Crawl4AI system)
- Chunking Strategy: Semantic chunking (500-1000 tokens with 100-token overlap)
- Embeddings: sentence-transformers/all-MiniLM-L6-v2 (lightweight, fast)
- Vector Store: In-memory FAISS index for deployment portability
- Retrieval: Top-k similarity search (k=3-5) with relevance scoring
Enhanced Template (SPACE_TEMPLATE modification)
# Add to generated app.py
COURSE_MATERIALS = json.loads('''{{course_materials_json}}''')
EMBEDDINGS_INDEX = pickle.loads(base64.b64decode('''{{embeddings_base64}}'''))
def get_relevant_context(query, max_contexts=3):
"""Retrieve relevant course material context"""
# Vector similarity search
# Return formatted context snippets
Speed & Accuracy Optimizations
1. Processing Speed
- Batch processing during upload (not per-query)
- Lightweight embedding model (384 dimensions vs 1536)
- In-memory vector store (no database dependencies)
- Cached embeddings in deployment package
2. Query Speed
- Pre-computed embeddings (no real-time encoding)
- Efficient FAISS indexing for similarity search
- Context caching for repeated queries
- Parallel processing for multiple files
3. Accuracy Enhancements
- Semantic chunking preserves context boundaries
- Query expansion with synonyms/related terms
- Relevance scoring with threshold filtering
- Metadata-aware retrieval (file type, section, date)
Deployment Package Integration
Package Structure Enhancement
generated_space.zip
βββ app.py # Enhanced with RAG
βββ requirements.txt # + sentence-transformers, faiss-cpu
βββ course_materials/ # Embedded materials
β βββ embeddings.pkl # FAISS index
β βββ chunks.json # Text chunks with metadata
β βββ files_metadata.json # Original file info
βββ README.md # Updated instructions
Size Management
- Compress embeddings with pickle optimization
- Base64 encode for template embedding
- Implement file size warnings (>50MB total)
- Optional: External storage links for large datasets
User Interface Updates
Configuration Tab Enhancements
with gr.Accordion("Course Materials Upload", open=False):
file_upload = gr.File(
label="Upload Course Materials",
file_types=[".pdf", ".docx", ".txt", ".md"],
file_count="multiple"
)
processing_status = gr.Markdown()
material_summary = gr.DataFrame() # Show processed files
Technical Implementation
Dependencies Addition (requirements.txt)
sentence-transformers==2.2.2
faiss-cpu==1.7.4
PyMuPDF==1.23.0
python-docx==0.8.11
tiktoken==0.5.1
Processing Workflow
- Upload: Faculty uploads syllabi, schedules, readings
- Parse: Extract text with structure preservation
- Chunk: Semantic segmentation with metadata
- Embed: Generate vector representations
- Package: Serialize index and chunks into deployment
- Deploy: Single-file space with embedded knowledge
Performance Metrics
- Upload Processing: ~2-5 seconds per document
- Query Response: <200ms additional latency
- Package Size: +5-15MB for typical course materials
- Accuracy: 85-95% relevant context retrieval
- Memory Usage: +50-100MB runtime overhead
Benefits
This approach maintains your existing speed while adding powerful document understanding capabilities that persist in the deployed package. Faculty can upload course materials once during configuration, and students get contextually-aware responses based on actual course content without any external dependencies in the deployed space.
Next Steps
- Implement document parser service
- Add file upload UI components
- Integrate RAG system with existing Crawl4AI architecture
- Enhance SPACE_TEMPLATE with embedded materials
- Test with sample course materials
- Optimize for deployment package size