Update README.md

a406486 verified 20 days ago

5.77 kB

	---
	license: apache-2.0
	datasets:
	- abhinavv3/edu_fineweb10B_sharded_50shards
	language:
	- en
	pipeline_tag: text-generation
	tags:
	- text-generation
	- transformer
	---
	# 🧠 GPT with Modified Memorizing Transformer

	An extended GPT-style 118m param model that integrates the key ideas from "Memorizing Transformers" (Wu et al., 2022) with practical enhancements like Grouped Query Attention, KNN-based memory lookup, RoPE, and XL-style memory recurrence.

	This model is designed for scalable training, long-context understanding, and efficient memory usage.

	---


	Key Modifications from the Original Paper:

	1) Replaced the default positional encoding with Rotary Positional Embeddings (RoPE) ,
	2) Altered the attention mechanism to use Grouped Query Attention ,
	3) Customized the DataLoader to support sharded datasets and data parallelism ,
	4) Implemented Mixed Precision Training along with Distributed Data Parallel (DDP) support ,
	5) Tweaked several training and model hyperparameters for better adaptability .

	## 🔬 Key Features

	- ✅ Grouped Query Attention (GQA) — Groups query heads to share key/value heads, saving memory and speeding up attention
	- ✅ KNN Memory — A learnable mechanism to retrieve past activations via nearest-neighbor search
	- ✅ XL-style Attention — Adds recurrence to the attention stack, improving long-sequence learning
	- ✅ Rotary Positional Encoding (RoPE) — Replaces standard sin-cos encoding for better extrapolation
	- ✅ Memory Lifespan & Clearing — Custom mechanisms to manage token memory duration
	- ✅ Sharded Dataset Loader — Efficient `.npy`-based streaming for large datasets
	- ✅ Mixed Precision + DDP Training — Scalable multi-GPU support using `torchrun` and `torch.autocast`

	---

	## 📁 Project Structure

	```bash
	MEM_TRANSFORMER/
	├── configs/
	│ └── config.json # Model + training hyperparameters
	│
	├── data/
	│ ├── edu_fineweb/ # Token-sharded training data
	│ │ ├── train_000001.npy
	│ │ ├── train_000002.npy
	│ │ └── test_000001.npy
	│ ├── hellaswag/
	│ │ └── hellaswag_val.jsonl
	│ └── fineweb.py # Sharding logic with memory-aligned sequence control
	│
	├── model_core/
	│ ├── __init__.py
	│ ├── attention.py # Grouped Query Attention, KNN & XL attention logic.Rotary Positional Encoding implementation
	│ ├── model.py # Transformer model with memory and RoPE support
	│ ├── dataloader.py # Memory-aware DataLoader
	│ └── training.py # train_memgpt function
	│
	├── scripts/
	│ ├── train.py # Training script (DDP-compatible)
	│ ├── evaluate.py # Evaluation on benchmarks
	│ └── generate.py # Text generation from trained model
	│
	├── evaluation/
	│ ├── __init__.py
	│ ├── hellaswag.py # HellaSwag data loader
	│ └── val_hellaswag.py # Evaluation logic with loss-based scoring
	│
	├── logs/
	│ ├── log.txt # Training logs
	│ └── model_*.pt # Checkpoints
	│
	├── .gitignore
	├── README.md
	├── requirements.txt

	```

	---

	## ⚙️ Configuration

	Edit `configs/config.json` to change model or training settings.

	<details>
	<summary>Example config</summary>

	```json
	{
	"model": {
	"block_size": 1024,
	"vocab_size": 50304,
	"n_layer": 12,
	"n_head": 12,
	"n_embd": 768,
	"n_kv_head": 4,
	"max_knn_memories": 81920
	},
	"training": {
	"max_steps": 19073,
	"log_dir": "log",
	"total_batch_size": 2048,
	"B": 64,
	"T": 1024,
	"max_lr": 0.0006,
	"min_lr": 0.00006,
	"warmup_steps": 715,
	"weight_decay": 0.1,
	"learning_rate": 0.0006
	}
	}
	```
	</details>
	🚀 Training
	▶️ Single GPU:python scripts/train.py
	🔁 Multi-GPU DDP:torchrun --nproc_per_node=NUM_GPUS scripts/train.py


	📊 Evaluation
	Evaluate on the HellaSwag benchmark:
	```bash
	python scripts/evaluate.py
	```

	Requires:

	data/hellaswag/hellaswag_val.jsonl

	Model checkpoint(s) in logs/

	Scoring is based on masked token loss across multiple choice completions

	🧠 Attention Mechanism Deep Dive
	<details> <summary>Grouped Query Attention (GQA)</summary>
	n_head = total query heads

	n_kv_head = shared key/value heads

	Reduces compute overhead for large models by grouping query heads to reuse K/V

	</details> <details> <summary>KNN Memory Retrieval</summary>
	Maintains memory of past key vectors (max: 81920 tokens)

	Fast KNN lookup with grouped projections

	Integrated into attention flow using model_core/attention.py

	</details> <details> <summary>XL-style Recurrence</summary>
	Recurrence between attention blocks

	Memory cache updated at each step

	Custom clearing logic helps avoid stale activations

	</details> <details> <summary>Rotary Positional Encoding (RoPE)</summary>
	Replaces standard sinusoidal encoding

	Better generalization on long contexts

	Found in model_core/attention.py

	</details>

	🧩 Data Handling
	Training data is sharded .npy files

	Matching stride/memory length logic

	DDP-compatible DataLoader

	📦 Install Dependencies
	```bash
	pip install -r requirements.txt
	```

	Ensure that PyTorch and CUDA versions match your local GPU.

	🔗 Reference
	Wu et al., Memorizing Transformers, NeurIPS 2022
	[Paper link](https://arxiv.org/abs/2203.08913)

	💡 Future Work
	Add LoRA support

	Integrate with Hugging Face transformers API

	Add benchmarking on other datasets (e.g. LAMBADA, PIQA)

	Built with ❤️ by abhinavv3