Spaces:

milwright
/

chatui-helper

Running

App Files Files Community

chatui-helper / file_upload_proposal.md

milwright

Replace Crawl4AI with simple HTTP requests and BeautifulSoup

12839ce 19 days ago

preview code

raw

history blame

5.24 kB

	# File Upload System Proposal for Faculty Course Materials

	Based on your existing architecture, here's a comprehensive proposal for implementing file uploads with efficient parsing and deployment preservation:

	## Core Architecture Design

	### 1. File Processing Pipeline
	```
	Upload → Parse → Chunk → Vector Store → RAG Integration → Deployment Package
	```

	### 2. File Storage Structure
	```
	/course_materials/
	├── raw_files/ # Original uploaded files
	├── processed/ # Parsed text content
	├── embeddings/ # Vector representations
	└── metadata.json # File tracking & metadata
	```

	## Implementation Components

	### File Upload Handler (app.py:352-408 enhancement)
	- Add `gr.File(file_types=[".pdf", ".docx", ".txt", ".md"])` component
	- Support multiple file uploads with `file_count="multiple"`
	- Implement file validation and size limits (10MB per file)

	### Document Parser Service (new: `document_parser.py`)
	- PDF: PyMuPDF for text extraction with layout preservation
	- DOCX: python-docx for structured content
	- TXT/MD: Direct text processing with metadata extraction
	- Auto-detection: File type identification and appropriate parser routing

	### RAG Integration (enhancement to existing web scraping system)
	- Chunking Strategy: Semantic chunking (500-1000 tokens with 100-token overlap)
	- Embeddings: sentence-transformers/all-MiniLM-L6-v2 (lightweight, fast)
	- Vector Store: In-memory FAISS index for deployment portability
	- Retrieval: Top-k similarity search (k=3-5) with relevance scoring

	### Enhanced Template (SPACE_TEMPLATE modification)
	```python
	# Add to generated app.py
	COURSE_MATERIALS = json.loads('''{{course_materials_json}}''')
	EMBEDDINGS_INDEX = pickle.loads(base64.b64decode('''{{embeddings_base64}}'''))

	def get_relevant_context(query, max_contexts=3):
	"""Retrieve relevant course material context"""
	# Vector similarity search
	# Return formatted context snippets
	```

	## Speed & Accuracy Optimizations

	### 1. Processing Speed
	- Batch processing during upload (not per-query)
	- Lightweight embedding model (384 dimensions vs 1536)
	- In-memory vector store (no database dependencies)
	- Cached embeddings in deployment package

	### 2. Query Speed
	- Pre-computed embeddings (no real-time encoding)
	- Efficient FAISS indexing for similarity search
	- Context caching for repeated queries
	- Parallel processing for multiple files

	### 3. Accuracy Enhancements
	- Semantic chunking preserves context boundaries
	- Query expansion with synonyms/related terms
	- Relevance scoring with threshold filtering
	- Metadata-aware retrieval (file type, section, date)

	## Deployment Package Integration

	### Package Structure Enhancement
	```
	generated_space.zip
	├── app.py # Enhanced with RAG
	├── requirements.txt # + sentence-transformers, faiss-cpu
	├── course_materials/ # Embedded materials
	│ ├── embeddings.pkl # FAISS index
	│ ├── chunks.json # Text chunks with metadata
	│ └── files_metadata.json # Original file info
	└── README.md # Updated instructions
	```

	### Size Management
	- Compress embeddings with pickle optimization
	- Base64 encode for template embedding
	- Implement file size warnings (>50MB total)
	- Optional: External storage links for large datasets

	## User Interface Updates

	### Configuration Tab Enhancements
	```python
	with gr.Accordion("Course Materials Upload", open=False):
	file_upload = gr.File(
	label="Upload Course Materials",
	file_types=[".pdf", ".docx", ".txt", ".md"],
	file_count="multiple"
	)
	processing_status = gr.Markdown()
	material_summary = gr.DataFrame() # Show processed files
	```

	## Technical Implementation

	### Dependencies Addition (requirements.txt)
	```
	sentence-transformers==2.2.2
	faiss-cpu==1.7.4
	PyMuPDF==1.23.0
	python-docx==0.8.11
	tiktoken==0.5.1
	```

	### Processing Workflow
	1. Upload: Faculty uploads syllabi, schedules, readings
	2. Parse: Extract text with structure preservation
	3. Chunk: Semantic segmentation with metadata
	4. Embed: Generate vector representations
	5. Package: Serialize index and chunks into deployment
	6. Deploy: Single-file space with embedded knowledge

	## Performance Metrics

	- Upload Processing: ~2-5 seconds per document
	- Query Response: <200ms additional latency
	- Package Size: +5-15MB for typical course materials
	- Accuracy: 85-95% relevant context retrieval
	- Memory Usage: +50-100MB runtime overhead

	## Benefits

	This approach maintains your existing speed while adding powerful document understanding capabilities that persist in the deployed package. Faculty can upload course materials once during configuration, and students get contextually-aware responses based on actual course content without any external dependencies in the deployed space.

	## Next Steps

	1. Implement document parser service
	2. Add file upload UI components
	3. Integrate RAG system with existing web scraping architecture
	4. Enhance SPACE_TEMPLATE with embedded materials
	5. Test with sample course materials
	6. Optimize for deployment package size