abhinavv3's picture
Update README.md
a406486 verified
---
license: apache-2.0
datasets:
- abhinavv3/edu_fineweb10B_sharded_50shards
language:
- en
pipeline_tag: text-generation
tags:
- text-generation
- transformer
---
# 🧠 GPT with Modified Memorizing Transformer
An extended GPT-style 118m param model that integrates the key ideas from **"Memorizing Transformers" (Wu et al., 2022)** with practical enhancements like Grouped Query Attention, KNN-based memory lookup, RoPE, and XL-style memory recurrence.
This model is designed for scalable training, long-context understanding, and efficient memory usage.
---
**Key Modifications from the Original Paper:**
1) Replaced the default positional encoding with Rotary Positional Embeddings (RoPE) ,
2) Altered the attention mechanism to use Grouped Query Attention ,
3) Customized the DataLoader to support sharded datasets and data parallelism ,
4) Implemented Mixed Precision Training along with Distributed Data Parallel (DDP) support ,
5) Tweaked several training and model hyperparameters for better adaptability .
## πŸ”¬ Key Features
- βœ… **Grouped Query Attention (GQA)** β€” Groups query heads to share key/value heads, saving memory and speeding up attention
- βœ… **KNN Memory** β€” A learnable mechanism to retrieve past activations via nearest-neighbor search
- βœ… **XL-style Attention** β€” Adds recurrence to the attention stack, improving long-sequence learning
- βœ… **Rotary Positional Encoding (RoPE)** β€” Replaces standard sin-cos encoding for better extrapolation
- βœ… **Memory Lifespan & Clearing** β€” Custom mechanisms to manage token memory duration
- βœ… **Sharded Dataset Loader** β€” Efficient `.npy`-based streaming for large datasets
- βœ… **Mixed Precision + DDP Training** β€” Scalable multi-GPU support using `torchrun` and `torch.autocast`
---
## πŸ“ Project Structure
```bash
MEM_TRANSFORMER/
β”œβ”€β”€ configs/
β”‚ └── config.json # Model + training hyperparameters
β”‚
β”œβ”€β”€ data/
β”‚ β”œβ”€β”€ edu_fineweb/ # Token-sharded training data
β”‚ β”‚ β”œβ”€β”€ train_000001.npy
β”‚ β”‚ β”œβ”€β”€ train_000002.npy
β”‚ β”‚ └── test_000001.npy
β”‚ β”œβ”€β”€ hellaswag/
β”‚ β”‚ └── hellaswag_val.jsonl
β”‚ └── fineweb.py # Sharding logic with memory-aligned sequence control
β”‚
β”œβ”€β”€ model_core/
β”‚ β”œβ”€β”€ __init__.py
β”‚ β”œβ”€β”€ attention.py # Grouped Query Attention, KNN & XL attention logic.Rotary Positional Encoding implementation
β”‚ β”œβ”€β”€ model.py # Transformer model with memory and RoPE support
β”‚ β”œβ”€β”€ dataloader.py # Memory-aware DataLoader
β”‚ └── training.py # train_memgpt function
β”‚
β”œβ”€β”€ scripts/
β”‚ β”œβ”€β”€ train.py # Training script (DDP-compatible)
β”‚ β”œβ”€β”€ evaluate.py # Evaluation on benchmarks
β”‚ └── generate.py # Text generation from trained model
β”‚
β”œβ”€β”€ evaluation/
β”‚ β”œβ”€β”€ __init__.py
β”‚ β”œβ”€β”€ hellaswag.py # HellaSwag data loader
β”‚ └── val_hellaswag.py # Evaluation logic with loss-based scoring
β”‚
β”œβ”€β”€ logs/
β”‚ β”œβ”€β”€ log.txt # Training logs
β”‚ └── model_*.pt # Checkpoints
β”‚
β”œβ”€β”€ .gitignore
β”œβ”€β”€ README.md
β”œβ”€β”€ requirements.txt
```
---
## βš™οΈ Configuration
Edit `configs/config.json` to change model or training settings.
<details>
<summary>Example config</summary>
```json
{
"model": {
"block_size": 1024,
"vocab_size": 50304,
"n_layer": 12,
"n_head": 12,
"n_embd": 768,
"n_kv_head": 4,
"max_knn_memories": 81920
},
"training": {
"max_steps": 19073,
"log_dir": "log",
"total_batch_size": 2048,
"B": 64,
"T": 1024,
"max_lr": 0.0006,
"min_lr": 0.00006,
"warmup_steps": 715,
"weight_decay": 0.1,
"learning_rate": 0.0006
}
}
```
</details>
πŸš€ Training
▢️ Single GPU:python scripts/train.py
πŸ” Multi-GPU DDP:torchrun --nproc_per_node=NUM_GPUS scripts/train.py
πŸ“Š Evaluation
Evaluate on the HellaSwag benchmark:
```bash
python scripts/evaluate.py
```
Requires:
data/hellaswag/hellaswag_val.jsonl
Model checkpoint(s) in logs/
Scoring is based on masked token loss across multiple choice completions
🧠 Attention Mechanism Deep Dive
<details> <summary>Grouped Query Attention (GQA)</summary>
n_head = total query heads
n_kv_head = shared key/value heads
Reduces compute overhead for large models by grouping query heads to reuse K/V
</details> <details> <summary>KNN Memory Retrieval</summary>
Maintains memory of past key vectors (max: 81920 tokens)
Fast KNN lookup with grouped projections
Integrated into attention flow using model_core/attention.py
</details> <details> <summary>XL-style Recurrence</summary>
Recurrence between attention blocks
Memory cache updated at each step
Custom clearing logic helps avoid stale activations
</details> <details> <summary>Rotary Positional Encoding (RoPE)</summary>
Replaces standard sinusoidal encoding
Better generalization on long contexts
Found in model_core/attention.py
</details>
🧩 Data Handling
Training data is sharded .npy files
Matching stride/memory length logic
DDP-compatible DataLoader
πŸ“¦ Install Dependencies
```bash
pip install -r requirements.txt
```
Ensure that PyTorch and CUDA versions match your local GPU.
πŸ”— Reference
Wu et al., Memorizing Transformers, NeurIPS 2022
[Paper link](https://arxiv.org/abs/2203.08913)
πŸ’‘ Future Work
Add LoRA support
Integrate with Hugging Face transformers API
Add benchmarking on other datasets (e.g. LAMBADA, PIQA)
Built with ❀️ by abhinavv3