File size: 2,642 Bytes
cfcf792
8c5e984
cfcf792
 
 
 
b143909
cfcf792
 
 
 
4dfe8a5
cfcf792
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
8c5e984
cfcf792
 
 
 
4dfe8a5
cfcf792
4dfe8a5
cfcf792
 
 
4dfe8a5
cfcf792
4dfe8a5
cfcf792
 
 
4dfe8a5
cfcf792
 
 
 
4dfe8a5
cfcf792
 
 
 
 
 
 
4dfe8a5
cfcf792
 
 
4dfe8a5
cfcf792
 
 
 
 
 
 
 
 
 
 
 
 
 
 
4dfe8a5
cfcf792
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
---

title: unsloth/DeepSeek-R1-Distill-Qwen-14B-bnb-4bit (Research Training)
emoji: 🧪
colorFrom: blue
colorTo: indigo
sdk: gradio
sdk_version: 5.17.0
app_file: app.py
pinned: false
license: mit
---


# Model Fine-Tuning Project

## Overview

- **Goal**: Fine-tune unsloth/DeepSeek-R1-Distill-Qwen-14B-bnb-4bit using pre-tokenized JSONL dataset
- **Model**: `unsloth/DeepSeek-R1-Distill-Qwen-14B-bnb-4bit`
  - **Important**: Already 4-bit quantized - do not quantize further
- **Dataset**: `phi4-cognitive-dataset`

⚠️ **RESEARCH TRAINING PHASE ONLY**: This space is being used for training purposes and does not provide interactive model outputs.

### Dataset Specs
- Entries under 2048 tokens
- Fields: `prompt_number`, `article_id`, `conversations`
- Process in ascending `prompt_number` order
- Pre-tokenized dataset - no additional tokenization needed
    

### Hardware

- GPU: 1x L40S (48GB VRAM)

- RAM: 62GB

- CPU: 8 cores


## Environment Variables (.env)

- `HF_TOKEN`: Hugging Face API token 
- `HF_USERNAME`: Hugging Face username 
- `HF_SPACE_NAME`: Target space name 

## Files

### 1. `app.py`
- Training status dashboard
- No interactive model demo (research phase only)

### 2. `transformers_config.json`

- Configuration for Hugging Face Transformers

- Contains: model parameters, hardware settings, optimizer details

- Specifies pre-tokenized dataset handling



### 3. `run_cloud_training.py`

- Loads pre-tokenized dataset, sorts by `prompt_number`, initiates training
1. Load and sort JSONL by `prompt_number` 
2. Use pre-tokenized input_ids directly (no tokenization)

3. Initialize with parameters from config

4. Execute training with metrics, checkpoints, error handling

- Uses Hugging Face's Trainer API with custom pre-tokenized data collator



### 4. `requirements.txt`

- Python dependencies: `transformers`, `datasets`, `torch`, etc.

- Contains unsloth for optimized training



### 5. `upload_to_space.py`

- Update model and space directly using HF API



## Implementation Notes



### Best Practices

- Dataset is pre-tokenized and sorted by `prompt_number`
- Settings stored in config file, avoiding hardcoding
- Hardware-optimized training parameters
- Gradient checkpointing and mixed precision training
- Complete logging for monitoring progress

### Model Repository

This space hosts a fine-tuned version of the [unsloth/DeepSeek-R1-Distill-Qwen-14B-bnb-4bit](https://huggingface.co/unsloth/DeepSeek-R1-Distill-Qwen-14B-bnb-4bit) model.

Check out the configuration reference at https://huggingface.co/docs/hub/spaces-config-reference