abhinavv3
/

GPT_with_Modified_Memorizing_Transformer

+---
+license: apache-2.0
+datasets:
+- abhinavv3/edu_fineweb10B_sharded_50shards
+language:
+- en
+pipeline_tag: text-generation
+tags:
+- text-generation
+- transformer
+---
+# 🧠 GPT with Modified Memorizing Transformer
+An extended GPT-style 118m param model that integrates the key ideas from **"Memorizing Transformers" (Wu et al., 2022)** with practical enhancements like Grouped Query Attention, KNN-based memory lookup, RoPE, and XL-style memory recurrence.
+This model is designed for scalable training, long-context understanding, and efficient memory usage.
+---
+## 🔬 Key Features
+- ✅ **Grouped Query Attention (GQA)** — Groups query heads to share key/value heads, saving memory and speeding up attention
+- ✅ **KNN Memory** — A learnable mechanism to retrieve past activations via nearest-neighbor search
+- ✅ **XL-style Attention** — Adds recurrence to the attention stack, improving long-sequence learning
+- ✅ **Rotary Positional Encoding (RoPE)** — Replaces standard sin-cos encoding for better extrapolation
+- ✅ **Memory Lifespan & Clearing** — Custom mechanisms to manage token memory duration
+- ✅ **Sharded Dataset Loader** — Efficient `.npy`-based streaming for large datasets
+- ✅ **Mixed Precision + DDP Training** — Scalable multi-GPU support using `torchrun` and `torch.autocast`
+---
+## 📁 Project Structure
+memGPT/
+├── configs/ → Training & model hyperparams
+├── data/ → Tokenized and sharded datasets
+├── model_core/ → Model + attention + dataloader logic
+├── scripts/ → Training, evaluation, generation scripts
+├── evaluation/ → HellaSwag benchmark evaluation
+├── logs/ → Checkpoints and logs
+├── requirements.txt → Python dependencies
+└── README.md → This model card
+---
+## ⚙️ Configuration
+Edit `configs/config.json` to change model or training settings.
+<details>
+<summary>Example config</summary>
+```json
+{
+  "model": {
+    "block_size": 1024,
+    "vocab_size": 50304,
+    "n_layer": 12,
+    "n_head": 12,
+    "n_embd": 768,
+    "n_kv_head": 4,
+    "max_knn_memories": 81920
+  },
+  "training": {
+    "max_steps": 19073,
+    "log_dir": "log",
+    "total_batch_size": 2048,
+    "B": 64,
+    "T": 1024,
+    "max_lr": 0.0006,
+    "min_lr": 0.00006,
+    "warmup_steps": 715,
+    "weight_decay": 0.1,
+    "learning_rate": 0.0006
+  }
+}
+```
+</details>
+🚀 Training
+▶️ Single GPU:python scripts/train.py
+🔁 Multi-GPU DDP:torchrun --nproc_per_node=NUM_GPUS scripts/train.py
+📊 Evaluation
+Evaluate on the HellaSwag benchmark:
+```bash
+python scripts/evaluate.py
+```
+Requires:
+data/hellaswag/hellaswag_val.jsonl
+Model checkpoint(s) in logs/
+Scoring is based on masked token loss across multiple choice completions
+🧠 Attention Mechanism Deep Dive
+<details> <summary>Grouped Query Attention (GQA)</summary>
+n_head = total query heads
+n_kv_head = shared key/value heads
+Reduces compute overhead for large models by grouping query heads to reuse K/V
+</details> <details> <summary>KNN Memory Retrieval</summary>
+Maintains memory of past key vectors (max: 81920 tokens)
+Fast KNN lookup with grouped projections
+Integrated into attention flow using model_core/attention.py
+</details> <details> <summary>XL-style Recurrence</summary>
+Recurrence between attention blocks
+Memory cache updated at each step
+Custom clearing logic helps avoid stale activations
+</details> <details> <summary>Rotary Positional Encoding (RoPE)</summary>
+Replaces standard sinusoidal encoding
+Better generalization on long contexts
+Found in model_core/attention.py
+</details>
+🧩 Data Handling
+Training data is sharded .npy files
+Matching stride/memory length logic
+DDP-compatible DataLoader
+📦 Install Dependencies
+```bash
+pip install -r requirements.txt
+```
+Ensure that PyTorch and CUDA versions match your local GPU.
+🔗 Reference
+Wu et al., Memorizing Transformers, NeurIPS 2022
+[Paper link](https://arxiv.org/abs/2203.08913)
+💡 Future Work
+Add LoRA support
+Integrate with Hugging Face transformers API
+Add benchmarking on other datasets (e.g. LAMBADA, PIQA)
+Built with ❤️ by abhinavv3