| 
							 | 
						--- | 
					
					
						
						| 
							 | 
						language: en | 
					
					
						
						| 
							 | 
						license: mit | 
					
					
						
						| 
							 | 
						library_name: pytorch | 
					
					
						
						| 
							 | 
						tags: | 
					
					
						
						| 
							 | 
						- transformer | 
					
					
						
						| 
							 | 
						- adapters | 
					
					
						
						| 
							 | 
						- continual-learning | 
					
					
						
						| 
							 | 
						- dual-memory | 
					
					
						
						| 
							 | 
						- minimal | 
					
					
						
						| 
							 | 
						- educational | 
					
					
						
						| 
							 | 
						- nlp | 
					
					
						
						| 
							 | 
						- language-model | 
					
					
						
						| 
							 | 
						- online-learning | 
					
					
						
						| 
							 | 
						datasets: | 
					
					
						
						| 
							 | 
						- text8 | 
					
					
						
						| 
							 | 
						- tinyshakespeare | 
					
					
						
						| 
							 | 
						model_name: "Microformer" | 
					
					
						
						| 
							 | 
						model_type: "stacked-adapter-transformer" | 
					
					
						
						| 
							 | 
						pipeline_tag: text-generation | 
					
					
						
						| 
							 | 
						widget: | 
					
					
						
						| 
							 | 
						- text: "Describe the internet" | 
					
					
						
						| 
							 | 
						- text: "Who is Buck?" | 
					
					
						
						| 
							 | 
						- text: "Call me Ishmael." | 
					
					
						
						| 
							 | 
						--- | 
					
					
						
						| 
							 | 
						 | 
					
					
						
						| 
							 | 
						
 | 
					
					
						
						| 
							 | 
						
 | 
					
					
						
						| 
							 | 
						
 | 
					
					
						
						| 
							 | 
						
 | 
					
					
						
						| 
							 | 
						
 | 
					
					
						
						| 
							 | 
						
 | 
					
					
						
						| 
							 | 
						# Microformer | 
					
					
						
						| 
							 | 
						
 | 
					
					
						
						| 
							 | 
						**Microformer** is a minimal, educational-scale transformer language model built from scratch in PyTorch.   | 
					
					
						
						| 
							 | 
						Inspired by [nanoGPT](https://github.com/karpathy/nanoGPT) and OpenAI’s GPT-1, Microformer is designed for learning, experimentation, and prototyping on lightweight datasets like [text8](https://mattmahoney.net/dc/textdata.html) or Tiny Shakespeare. | 
					
					
						
						| 
							 | 
						
 | 
					
					
						
						| 
							 | 
						--- | 
					
					
						
						| 
							 | 
						
 | 
					
					
						
						| 
							 | 
						## Features | 
					
					
						
						| 
							 | 
						
 | 
					
					
						
						| 
							 | 
						- Decoder-only transformer (GPT-style) architecture | 
					
					
						
						| 
							 | 
						- **Stacked adapters per layer for dual-memory:** | 
					
					
						
						| 
							 | 
						    - **Long-term adapters** (for corpus/knowledge facts) | 
					
					
						
						| 
							 | 
						    - **Session adapters** (for rapid, online, user/session-specific learning) | 
					
					
						
						| 
							 | 
						- Choice of character-level **or** subword/BPE tokenization (configurable) | 
					
					
						
						| 
							 | 
						- Learnable positional encoding | 
					
					
						
						| 
							 | 
						- Multi-head self-attention | 
					
					
						
						| 
							 | 
						- Configurable depth, embedding size, sequence length, and attention heads | 
					
					
						
						| 
							 | 
						- Simple end-to-end pipeline: preprocessing, training, and text generation | 
					
					
						
						| 
							 | 
						- Modular, readable code ideal for educational use and tinkering | 
					
					
						
						| 
							 | 
						- Temperature and multinomial sampling in text generation | 
					
					
						
						| 
							 | 
						
 | 
					
					
						
						| 
							 | 
						--- | 
					
					
						
						| 
							 | 
						
 | 
					
					
						
						| 
							 | 
						## What’s Unique: Stacked Adapters for Dual-Memory Learning | 
					
					
						
						| 
							 | 
						
 | 
					
					
						
						| 
							 | 
						Microformer implements **two adapters in every transformer block**: | 
					
					
						
						| 
							 | 
						
 | 
					
					
						
						| 
							 | 
						- **Long-term adapter:**   | 
					
					
						
						| 
							 | 
						  Trained with your full corpus during batch/corpus training.   | 
					
					
						
						| 
							 | 
						  Stores stable, general “knowledge” (e.g., literary style, factual info). | 
					
					
						
						| 
							 | 
						
 | 
					
					
						
						| 
							 | 
						- **Session adapter:**   | 
					
					
						
						| 
							 | 
						  Starts blank and is trained *on the fly* during chat or interactive teaching.   | 
					
					
						
						| 
							 | 
						  Lets you rapidly “teach” new facts, styles, or user preferences without overwriting core knowledge. | 
					
					
						
						| 
							 | 
						
 | 
					
					
						
						| 
							 | 
						At inference, the outputs of both adapters (plus the core transformer) are combined—giving the model both stable and flexible, session-specific memory, just like a human brain’s “temporal lobe” and “core memory”. | 
					
					
						
						| 
							 | 
						
 | 
					
					
						
						| 
							 | 
						--- | 
					
					
						
						| 
							 | 
						
 | 
					
					
						
						| 
							 | 
						## Project Structure | 
					
					
						
						| 
							 | 
						
 | 
					
					
						
						| 
							 | 
						``` | 
					
					
						
						| 
							 | 
						microformer/ | 
					
					
						
						| 
							 | 
						├── config.py              # Hyperparameters and model settings | 
					
					
						
						| 
							 | 
						├── data/ | 
					
					
						
						| 
							 | 
						│   ├── corpus.txt         # Raw training text | 
					
					
						
						| 
							 | 
						│   ├── train.pt           # Preprocessed training tensor (token IDs) | 
					
					
						
						| 
							 | 
						│   ├── val.pt             # Validation tensor (token IDs) | 
					
					
						
						| 
							 | 
						│   ├── vocab.json         # Vocabulary (char or subword, stoi/itos mapping) | 
					
					
						
						| 
							 | 
						│   └── tokenizer.json     # (optional) BPE tokenizer file if using subwords | 
					
					
						
						| 
							 | 
						├── models/ | 
					
					
						
						| 
							 | 
						│   └── model.py           # Transformer model definition (Microformer) | 
					
					
						
						| 
							 | 
						├── scripts/ | 
					
					
						
						| 
							 | 
						│   ├── prepare_data.py    # Data preprocessing/tokenization | 
					
					
						
						| 
							 | 
						│   ├── train.py           # Training script (trains long-term adapters) | 
					
					
						
						| 
							 | 
						│   ├── generate_text.py   # Inference/generation + online learning (session adapters) | 
					
					
						
						| 
							 | 
						│   └── tokenizer_setup.py # BPE Tokenizer | 
					
					
						
						| 
							 | 
						└── README.md | 
					
					
						
						| 
							 | 
						``` | 
					
					
						
						| 
							 | 
						
 | 
					
					
						
						| 
							 | 
						--- | 
					
					
						
						| 
							 | 
						
 | 
					
					
						
						| 
							 | 
						## Quickstart | 
					
					
						
						| 
							 | 
						
 | 
					
					
						
						| 
							 | 
						1. **Prepare your corpus and run the tokenizer** | 
					
					
						
						| 
							 | 
						
 | 
					
					
						
						| 
							 | 
						   Place your text data in `data/corpus.txt`. | 
					
					
						
						| 
							 | 
						
 | 
					
					
						
						| 
							 | 
						2. **Choose your tokenizer:** | 
					
					
						
						| 
							 | 
						
 | 
					
					
						
						| 
							 | 
						- **Character-level (default):**   | 
					
					
						
						| 
							 | 
						  No extra steps needed. | 
					
					
						
						| 
							 | 
						
 | 
					
					
						
						| 
							 | 
						- **BPE/Subword (recommended for rich/modern text):** | 
					
					
						
						| 
							 | 
						  ```bash | 
					
					
						
						| 
							 | 
						  python scripts/tokenizer_setup.py --input data/corpus.txt --vocab_size 1000 | 
					
					
						
						| 
							 | 
						  ``` | 
					
					
						
						| 
							 | 
						
 | 
					
					
						
						| 
							 | 
						3. **Prepare the dataset** | 
					
					
						
						| 
							 | 
						
 | 
					
					
						
						| 
							 | 
						   ```bash | 
					
					
						
						| 
							 | 
						   python scripts/prepare_data.py | 
					
					
						
						| 
							 | 
						   ``` | 
					
					
						
						| 
							 | 
						
 | 
					
					
						
						| 
							 | 
						4. **Train the model (long-term knowledge)** | 
					
					
						
						| 
							 | 
						
 | 
					
					
						
						| 
							 | 
						   ```bash | 
					
					
						
						| 
							 | 
						   python scripts/train.py | 
					
					
						
						| 
							 | 
						   ``` | 
					
					
						
						| 
							 | 
						    - This trains only the **long-term adapters** and core weights. | 
					
					
						
						| 
							 | 
						    - Session adapters remain untrained (blank) until chat time. | 
					
					
						
						| 
							 | 
						
 | 
					
					
						
						| 
							 | 
						5. **Generate text and teach interactively (session memory)** | 
					
					
						
						| 
							 | 
						
 | 
					
					
						
						| 
							 | 
						   ```bash | 
					
					
						
						| 
							 | 
						   python scripts/generate_text.py | 
					
					
						
						| 
							 | 
						   ``` | 
					
					
						
						| 
							 | 
						    - Loads your trained model. | 
					
					
						
						| 
							 | 
						    - Prompts for a seed string and temperature. | 
					
					
						
						| 
							 | 
						    - **Allows you to “teach” new facts on the fly!** | 
					
					
						
						| 
							 | 
						    - New knowledge is stored in session adapters—does *not* overwrite long-term knowledge. | 
					
					
						
						| 
							 | 
						
 | 
					
					
						
						| 
							 | 
						--- | 
					
					
						
						| 
							 | 
						
 | 
					
					
						
						| 
							 | 
						## Example Config (`config.py`) | 
					
					
						
						| 
							 | 
						
 | 
					
					
						
						| 
							 | 
						```python | 
					
					
						
						| 
							 | 
						EMBED_DIM = 128 | 
					
					
						
						| 
							 | 
						NUM_HEADS = 4 | 
					
					
						
						| 
							 | 
						NUM_LAYERS = 2 | 
					
					
						
						| 
							 | 
						FF_DIM = 256 | 
					
					
						
						| 
							 | 
						MAX_SEQ_LEN = 128 | 
					
					
						
						| 
							 | 
						BATCH_SIZE = 32 | 
					
					
						
						| 
							 | 
						ADAPTER_DIM = 32   # Used for both long-term and session adapters | 
					
					
						
						| 
							 | 
						VOCAB_SIZE = 100   # Set automatically from tokenizer/vocab | 
					
					
						
						| 
							 | 
						``` | 
					
					
						
						| 
							 | 
						
 | 
					
					
						
						| 
							 | 
						--- | 
					
					
						
						| 
							 | 
						
 | 
					
					
						
						| 
							 | 
						## Using the Dual-Memory System | 
					
					
						
						| 
							 | 
						
 | 
					
					
						
						| 
							 | 
						- **Long-term adapters:**   | 
					
					
						
						| 
							 | 
						  Learned during `train.py`—persist between runs. | 
					
					
						
						| 
							 | 
						
 | 
					
					
						
						| 
							 | 
						- **Session adapters:**   | 
					
					
						
						| 
							 | 
						  Learned during interactive chat in `generate_text.py`—resettable (optional) between users/sessions. | 
					
					
						
						| 
							 | 
						
 | 
					
					
						
						| 
							 | 
						- **Teach new facts by entering a prompt and providing your ideal answer.**   | 
					
					
						
						| 
							 | 
						  The model will “remember” this during the session, even if it wasn’t present in the training corpus. | 
					
					
						
						| 
							 | 
						
 | 
					
					
						
						| 
							 | 
						--- | 
					
					
						
						| 
							 | 
						
 | 
					
					
						
						| 
							 | 
						## Customization & Ideas | 
					
					
						
						| 
							 | 
						
 | 
					
					
						
						| 
							 | 
						- Use BPE/subword tokenization for more expressive modeling (recommended for non-trivial datasets) | 
					
					
						
						| 
							 | 
						- Add more adapters or experiment with gating (e.g., blend adapters by context) | 
					
					
						
						| 
							 | 
						- Combine with a key-value retrieval or buffer for truly persistent “user memory” | 
					
					
						
						| 
							 | 
						- Visualize training with TensorBoard or wandb | 
					
					
						
						| 
							 | 
						- Tinker with alternative attention or memory mechanisms | 
					
					
						
						| 
							 | 
						
 | 
					
					
						
						| 
							 | 
						--- | 
					
					
						
						| 
							 | 
						
 | 
					
					
						
						| 
							 | 
						## Requirements | 
					
					
						
						| 
							 | 
						
 | 
					
					
						
						| 
							 | 
						- Python 3.8+ | 
					
					
						
						| 
							 | 
						- [PyTorch](https://pytorch.org/) | 
					
					
						
						| 
							 | 
						- [tokenizers](https://github.com/huggingface/tokenizers) (for BPE/subword) | 
					
					
						
						| 
							 | 
						
 | 
					
					
						
						| 
							 | 
						Install dependencies with: | 
					
					
						
						| 
							 | 
						```bash | 
					
					
						
						| 
							 | 
						pip install torch tokenizers | 
					
					
						
						| 
							 | 
						``` | 
					
					
						
						| 
							 | 
						
 | 
					
					
						
						| 
							 | 
						--- | 
					
					
						
						| 
							 | 
						
 | 
					
					
						
						| 
							 | 
						## Credits | 
					
					
						
						| 
							 | 
						
 | 
					
					
						
						| 
							 | 
						- Inspired by [nanoGPT](https://github.com/karpathy/nanoGPT) and [minGPT](https://github.com/karpathy/minGPT) by Andrej Karpathy | 
					
					
						
						| 
							 | 
						- Adapter and continual-learning inspiration from recent NLP research ([Houlsby et al. 2019](https://arxiv.org/abs/1902.00751)) | 
					
					
						
						| 
							 | 
						- Built using concepts from the original [GPT-1 paper](https://cdn.openai.com/research-covers/language-unsupervised/language_understanding_paper.pdf) | 
					
					
						
						| 
							 | 
						
 | 
					
					
						
						| 
							 | 
						--- | 
					
					
						
						| 
							 | 
						
 | 
					
					
						
						| 
							 | 
						## License | 
					
					
						
						| 
							 | 
						
 | 
					
					
						
						| 
							 | 
						MIT License – Use freely for learning and experimentation. | 
					
					
						
						| 
							 | 
						
 | 
					
					
						
						| 
							 | 
						--- | 
					
					
						
						| 
							 | 
						
 | 
					
					
						
						| 
							 | 
						**Happy tinkering with dual-memory transformers!** | 
					
					
						
						| 
							 | 
						
 |