File Upload System Proposal for Faculty Course Materials

Based on your existing architecture, here's a comprehensive proposal for implementing file uploads with efficient parsing and deployment preservation:

Core Architecture Design

1. File Processing Pipeline

Upload → Parse → Chunk → Vector Store → RAG Integration → Deployment Package

2. File Storage Structure

/course_materials/
├── raw_files/           # Original uploaded files
├── processed/           # Parsed text content
├── embeddings/          # Vector representations
└── metadata.json        # File tracking & metadata

Implementation Components

File Upload Handler (app.py:352-408 enhancement)

Add gr.File(file_types=[".pdf", ".docx", ".txt", ".md"]) component
Support multiple file uploads with file_count="multiple"
Implement file validation and size limits (10MB per file)

Document Parser Service (new: `document_parser.py`)

PDF: PyMuPDF for text extraction with layout preservation
DOCX: python-docx for structured content
TXT/MD: Direct text processing with metadata extraction
Auto-detection: File type identification and appropriate parser routing

RAG Integration (enhancement to existing Crawl4AI system)

Chunking Strategy: Semantic chunking (500-1000 tokens with 100-token overlap)
Embeddings: sentence-transformers/all-MiniLM-L6-v2 (lightweight, fast)
Vector Store: In-memory FAISS index for deployment portability
Retrieval: Top-k similarity search (k=3-5) with relevance scoring

Enhanced Template (SPACE_TEMPLATE modification)

# Add to generated app.py
COURSE_MATERIALS = json.loads('''{{course_materials_json}}''')
EMBEDDINGS_INDEX = pickle.loads(base64.b64decode('''{{embeddings_base64}}'''))

def get_relevant_context(query, max_contexts=3):
    """Retrieve relevant course material context"""
    # Vector similarity search
    # Return formatted context snippets

Speed & Accuracy Optimizations

1. Processing Speed

Batch processing during upload (not per-query)
Lightweight embedding model (384 dimensions vs 1536)
In-memory vector store (no database dependencies)
Cached embeddings in deployment package

2. Query Speed

Pre-computed embeddings (no real-time encoding)
Efficient FAISS indexing for similarity search
Context caching for repeated queries
Parallel processing for multiple files

3. Accuracy Enhancements

Semantic chunking preserves context boundaries
Query expansion with synonyms/related terms
Relevance scoring with threshold filtering
Metadata-aware retrieval (file type, section, date)

Deployment Package Integration

Package Structure Enhancement

generated_space.zip
├── app.py                    # Enhanced with RAG
├── requirements.txt          # + sentence-transformers, faiss-cpu
├── course_materials/         # Embedded materials
│   ├── embeddings.pkl       # FAISS index
│   ├── chunks.json          # Text chunks with metadata
│   └── files_metadata.json  # Original file info
└── README.md                # Updated instructions

Size Management

Compress embeddings with pickle optimization
Base64 encode for template embedding
Implement file size warnings (>50MB total)
Optional: External storage links for large datasets

User Interface Updates

Configuration Tab Enhancements

with gr.Accordion("Course Materials Upload", open=False):
    file_upload = gr.File(
        label="Upload Course Materials",
        file_types=[".pdf", ".docx", ".txt", ".md"],
        file_count="multiple"
    )
    processing_status = gr.Markdown()
    material_summary = gr.DataFrame() # Show processed files

Technical Implementation

Dependencies Addition (requirements.txt)

sentence-transformers==2.2.2
faiss-cpu==1.7.4
PyMuPDF==1.23.0
python-docx==0.8.11
tiktoken==0.5.1

Processing Workflow

Upload: Faculty uploads syllabi, schedules, readings
Parse: Extract text with structure preservation
Chunk: Semantic segmentation with metadata
Embed: Generate vector representations
Package: Serialize index and chunks into deployment
Deploy: Single-file space with embedded knowledge

Performance Metrics

Upload Processing: ~2-5 seconds per document
Query Response: <200ms additional latency
Package Size: +5-15MB for typical course materials
Accuracy: 85-95% relevant context retrieval
Memory Usage: +50-100MB runtime overhead

Benefits

This approach maintains your existing speed while adding powerful document understanding capabilities that persist in the deployed package. Faculty can upload course materials once during configuration, and students get contextually-aware responses based on actual course content without any external dependencies in the deployed space.

Next Steps

Implement document parser service
Add file upload UI components
Integrate RAG system with existing Crawl4AI architecture
Enhance SPACE_TEMPLATE with embedded materials
Test with sample course materials
Optimize for deployment package size