abhinavv3 commited on
Commit
a6124fc
Β·
verified Β·
1 Parent(s): 0cac660

Create README.md

Browse files
Files changed (1) hide show
  1. README.md +155 -0
README.md ADDED
@@ -0,0 +1,155 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: apache-2.0
3
+ datasets:
4
+ - abhinavv3/edu_fineweb10B_sharded_50shards
5
+ language:
6
+ - en
7
+ pipeline_tag: text-generation
8
+ tags:
9
+ - text-generation
10
+ - transformer
11
+ ---
12
+ # 🧠 GPT with Modified Memorizing Transformer
13
+
14
+ An extended GPT-style 118m param model that integrates the key ideas from **"Memorizing Transformers" (Wu et al., 2022)** with practical enhancements like Grouped Query Attention, KNN-based memory lookup, RoPE, and XL-style memory recurrence.
15
+
16
+ This model is designed for scalable training, long-context understanding, and efficient memory usage.
17
+
18
+ ---
19
+
20
+ ## πŸ”¬ Key Features
21
+
22
+ - βœ… **Grouped Query Attention (GQA)** β€” Groups query heads to share key/value heads, saving memory and speeding up attention
23
+ - βœ… **KNN Memory** β€” A learnable mechanism to retrieve past activations via nearest-neighbor search
24
+ - βœ… **XL-style Attention** β€” Adds recurrence to the attention stack, improving long-sequence learning
25
+ - βœ… **Rotary Positional Encoding (RoPE)** β€” Replaces standard sin-cos encoding for better extrapolation
26
+ - βœ… **Memory Lifespan & Clearing** β€” Custom mechanisms to manage token memory duration
27
+ - βœ… **Sharded Dataset Loader** β€” Efficient `.npy`-based streaming for large datasets
28
+ - βœ… **Mixed Precision + DDP Training** β€” Scalable multi-GPU support using `torchrun` and `torch.autocast`
29
+
30
+ ---
31
+
32
+ ## πŸ“ Project Structure
33
+
34
+ memGPT/
35
+ β”œβ”€β”€ configs/ β†’ Training & model hyperparams
36
+ β”œβ”€β”€ data/ β†’ Tokenized and sharded datasets
37
+ β”œβ”€β”€ model_core/ β†’ Model + attention + dataloader logic
38
+ β”œβ”€β”€ scripts/ β†’ Training, evaluation, generation scripts
39
+ β”œβ”€β”€ evaluation/ β†’ HellaSwag benchmark evaluation
40
+ β”œβ”€β”€ logs/ β†’ Checkpoints and logs
41
+ β”œβ”€β”€ requirements.txt β†’ Python dependencies
42
+ └── README.md β†’ This model card
43
+
44
+
45
+ ---
46
+
47
+ ## βš™οΈ Configuration
48
+
49
+ Edit `configs/config.json` to change model or training settings.
50
+
51
+ <details>
52
+ <summary>Example config</summary>
53
+
54
+ ```json
55
+ {
56
+ "model": {
57
+ "block_size": 1024,
58
+ "vocab_size": 50304,
59
+ "n_layer": 12,
60
+ "n_head": 12,
61
+ "n_embd": 768,
62
+ "n_kv_head": 4,
63
+ "max_knn_memories": 81920
64
+ },
65
+ "training": {
66
+ "max_steps": 19073,
67
+ "log_dir": "log",
68
+ "total_batch_size": 2048,
69
+ "B": 64,
70
+ "T": 1024,
71
+ "max_lr": 0.0006,
72
+ "min_lr": 0.00006,
73
+ "warmup_steps": 715,
74
+ "weight_decay": 0.1,
75
+ "learning_rate": 0.0006
76
+ }
77
+ }
78
+ ```
79
+ </details>
80
+ πŸš€ Training
81
+ ▢️ Single GPU:python scripts/train.py
82
+ πŸ” Multi-GPU DDP:torchrun --nproc_per_node=NUM_GPUS scripts/train.py
83
+
84
+
85
+ πŸ“Š Evaluation
86
+ Evaluate on the HellaSwag benchmark:
87
+ ```bash
88
+ python scripts/evaluate.py
89
+ ```
90
+
91
+ Requires:
92
+
93
+ data/hellaswag/hellaswag_val.jsonl
94
+
95
+ Model checkpoint(s) in logs/
96
+
97
+ Scoring is based on masked token loss across multiple choice completions
98
+
99
+ 🧠 Attention Mechanism Deep Dive
100
+ <details> <summary>Grouped Query Attention (GQA)</summary>
101
+ n_head = total query heads
102
+
103
+ n_kv_head = shared key/value heads
104
+
105
+ Reduces compute overhead for large models by grouping query heads to reuse K/V
106
+
107
+ </details> <details> <summary>KNN Memory Retrieval</summary>
108
+ Maintains memory of past key vectors (max: 81920 tokens)
109
+
110
+ Fast KNN lookup with grouped projections
111
+
112
+ Integrated into attention flow using model_core/attention.py
113
+
114
+ </details> <details> <summary>XL-style Recurrence</summary>
115
+ Recurrence between attention blocks
116
+
117
+ Memory cache updated at each step
118
+
119
+ Custom clearing logic helps avoid stale activations
120
+
121
+ </details> <details> <summary>Rotary Positional Encoding (RoPE)</summary>
122
+ Replaces standard sinusoidal encoding
123
+
124
+ Better generalization on long contexts
125
+
126
+ Found in model_core/attention.py
127
+
128
+ </details>
129
+
130
+ 🧩 Data Handling
131
+ Training data is sharded .npy files
132
+
133
+ Matching stride/memory length logic
134
+
135
+ DDP-compatible DataLoader
136
+
137
+ πŸ“¦ Install Dependencies
138
+ ```bash
139
+ pip install -r requirements.txt
140
+ ```
141
+
142
+ Ensure that PyTorch and CUDA versions match your local GPU.
143
+
144
+ πŸ”— Reference
145
+ Wu et al., Memorizing Transformers, NeurIPS 2022
146
+ [Paper link](https://arxiv.org/abs/2203.08913)
147
+
148
+ πŸ’‘ Future Work
149
+ Add LoRA support
150
+
151
+ Integrate with Hugging Face transformers API
152
+
153
+ Add benchmarking on other datasets (e.g. LAMBADA, PIQA)
154
+
155
+ Built with ❀️ by abhinavv3