Spaces:

George-API
/

phi4training

Sleeping

App Files Files Community

phi4training / DEPLOY_CHECKLIST.md

George-API

Upload folder using huggingface_hub

ae57ea2 verified 8 months ago

preview code

raw

history blame contribute delete

2.1 kB

	# Phi-4 Training Critical Deployment Checklist

	## Essential Configuration Requirements

	### 1. Model Configuration
	- [ ] Model name: `unsloth/phi-4-unsloth-bnb-4bit`
	- [ ] BF16 precision enabled, FP16 disabled
	- [ ] Appropriate sequence length (2048)
	- [ ] LoRA parameters correctly configured (r: 32, alpha: 16)

	### 2. Hardware & Resource Management
	- [ ] Per-device batch size ≤ 16
	- [ ] Gradient accumulation steps ≥ 3
	- [ ] Gradient checkpointing enabled
	- [ ] Memory usage limits properly set (85% of GPU capacity)

	### 3. Critical Dataset Handling Rules
	- [ ] NO REORDERING of dataset entries - original order must be preserved
	- [ ] NO COMBINING of separate entries - each entry must remain distinct
	- [ ] SEQUENTIAL PROCESSING required - entries must be processed one after another
	- [ ] `sort_by_id` and `maintain_paper_order` flags properly set to preserve data sequence
	- [ ] Sequential sampler used with no shuffling (`"shuffle": false`)
	- [ ] Dataset sequential integrity verified with validation samples
	- [ ] Conversation structure preserved (original format maintained)

	### 4. Essential Error Handling
	- [ ] Clear error catching for dataset loading issues
	- [ ] Memory tracking at key training points
	- [ ] Low-verbosity logging for HF Space compatibility

	### 5. Training Core Requirements
	- [ ] Appropriate learning rate (2e-5)
	- [ ] Proper checkpointing frequency
	- [ ] Hub settings correctly configured for model saving

	---

	## Pre-Deployment Verification

	\| Requirement \| Status \| Notes \|
	\|-------------\|--------\|-------\|
	\| Data sequential integrity \| \| Confirm entries processed in order \|
	\| GPU memory within limits \| \| Check peak memory doesn't exceed 20GB per GPU \|
	\| Training batch verification \| \| Verify first few batches maintain proper order \|

	---

	Current Hardware: 4× NVIDIA L4 GPUs (24GB VRAM each)

	CRITICAL REMINDER: Data sequence preservation is the highest priority - any shuffling, reordering, or combining of entries will compromise model quality.

	Last Updated: 2025-03-09