Spaces:
Sleeping
Sleeping
File size: 6,200 Bytes
a57357b |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 |
# Phase 1: Domain Adaptation (Unsupervised)
This directory contains the code and configuration for domain adaptation of the phi-4-unsloth-bnb-4bit model to the cognitive science domain. This phase produces our domain-adapted model: [George-API/phi-4-research-assistant](https://huggingface.co/George-API/phi-4-research-assistant).
## Overview
Domain adaptation is the first phase of our training process, where we expose the model to a large corpus of cognitive science texts to help it learn domain-specific vocabulary, concepts, and patterns. This phase prepares the model for the more focused supervised fine-tuning in Phase 2.
## Files
### Core Training Files
- `run_transformers_training.py`: Main script for domain adaptation
- `transformers_config.json`: Model and training parameters
- `hardware_config.json`: Hardware-specific optimizations
- `dataset_config.json`: Dataset loading and processing settings
- `requirements.txt`: Required Python packages
### Analysis & Utilities
- `check_tokenization.py`: Script to analyze token distributions
- `update_space.py`: Hugging Face Space update utility
- `.env`: Environment variables (API tokens, etc.)
## Setup
1. **Environment Setup**:
```bash
python -m venv venv
source venv/bin/activate # or `venv\Scripts\activate` on Windows
pip install -r requirements.txt
```
2. **Environment Variables**:
Create `.env` file with:
```
HUGGINGFACE_TOKEN=your_token_here
```
3. **Verify Setup**:
```bash
python check_tokenization.py # Ensures tokenizer works
```
## How It Works
1. **Data Loading**: Loads pre-tokenized data from the Hugging Face dataset
2. **Sequential Processing**: Processes data in order, maintaining the integrity of research papers
3. **Efficient Training**: Uses pre-quantized Unsloth 4-bit model for memory-efficient and faster training
4. **Checkpointing**: Saves regular checkpoints and pushes to Hub
5. **Monitoring**: Logs detailed metrics and statistics during training
6. **Model Publishing**: Pushes the trained model to Hugging Face Hub
## Key Features
### Memory-Efficient Training
The training setup is optimized for A10G GPUs:
- Uses pre-quantized 4-bit model (no additional quantization needed)
- Gradient checkpointing for memory efficiency
- Flash attention for faster training
- bfloat16 mixed precision training
- Optimized batch sizes for maximum throughput
### Sequential Processing
The training script ensures that chunks from the same research paper are processed together by:
- Sorting the dataset by ID
- Using a SequentialSampler to maintain order
- Processing chunks sequentially (average 1,673 tokens per chunk)
### Data Collator
The `SimpleDataCollator` class:
- Preserves pre-tokenized data format
- Processes each entry independently
- Provides detailed logging of processing statistics
- Handles errors gracefully
### Checkpointing
The training process saves checkpoints:
- Every 200 steps
- Pushes to Hub on every save
- Maintains up to 5 recent checkpoints
- Automatically resumes from the latest checkpoint if interrupted
## Hardware Requirements
This training setup is optimized for:
- 2x NVIDIA A10G GPUs (24GB VRAM each)
- 92GB System RAM
- CUDA 11.8 or higher
Memory breakdown per GPU:
- Model (4-bit): ~3.5GB
- Optimizer states: ~1GB
- Batch memory: ~2GB
- Peak usage: 18-20GB
- Safe headroom: 4-6GB
## Configuration
Key parameters in `transformers_config.json`:
- `model_name`: unsloth/phi-4-unsloth-bnb-4bit
- `learning_rate`: 2e-5
- `num_train_epochs`: 3
- `per_device_train_batch_size`: 16
- `gradient_accumulation_steps`: 4
- `effective_batch_size`: 128 (16 * 4 * 2 GPUs)
- `max_seq_length`: 2048
- `lr_scheduler_type`: "cosine"
- `warmup_ratio`: 0.03
- `neftune_noise_alpha`: 5
The configuration is optimized for:
- Maximum memory efficiency with pre-quantized model
- Stable training with cosine learning rate schedule
- Effective gradient updates with accumulation
- Regular checkpointing and Hub updates
## Running Domain Adaptation
To start domain adaptation:
```bash
python run_transformers_training.py
```
The script will:
1. Load the pre-quantized model and dataset
2. Apply optimized training parameters
3. Process the data sequentially
4. Train the model for 3 epochs
5. Save and push checkpoints to Hub regularly
## Using the Model
After training, you can use the domain-adapted model:
```python
from transformers import AutoModelForCausalLM, AutoTokenizer
# Load the domain-adapted model
model_name = "George-API/phi-4-research-assistant"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(model_name,
device_map="auto",
torch_dtype="bfloat16")
# Generate text
input_text = "The hippocampus is involved in"
inputs = tokenizer(input_text, return_tensors="pt")
outputs = model.generate(**inputs, max_length=100)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))
```
## Chat Format Example
Phi-4 works best with its native chat template:
```python
from transformers import pipeline
pipeline = pipeline(
"text-generation",
model="George-API/phi-4-research-assistant",
model_kwargs={"torch_dtype": "bfloat16"},
device_map="auto",
)
messages = [
{"role": "system", "content": "You are an expert in cognitive science."},
{"role": "user", "content": "Explain the role of the hippocampus in memory formation."},
]
outputs = pipeline(messages, max_new_tokens=256)
print(outputs[0]["generated_text"])
```
## Expected Outcomes
After domain adaptation, the model should:
- Have a better understanding of cognitive science terminology
- Show improved performance on domain-specific tasks
- Be ready for supervised fine-tuning in Phase 2
## Next Steps
After completing domain adaptation:
1. Evaluate the model's performance on cognitive science texts
2. Proceed to Phase 2 (Supervised Fine-Tuning)
3. Use TensorBoard to analyze training metrics |