π Small Language Model (SLM) from Scratch β Explained
This notebook builds, trains, and runs a small Transformer-based language model (mini GPT) on a movie scripts dataset.
Written for someone who knows basic ML/DL but is new to LLMs.
1. Dataset & Preprocessing
from datasets import load_dataset
import tiktoken, numpy as np
# Load dataset
ds = load_dataset("IsmaelMousa/movies")
# Split into train/val
ds = ds['train'].train_test_split(test_size=0.1, seed=42)
# Tokenizer (GPT-2)
enc = tiktoken.get_encoding("gpt2")
def process(example):
ids = enc.encode_ordinary(example['Script'])
return {'ids': ids, 'len': len(ids)}
# Tokenize
tokenized = ds.map(process, remove_columns=['Name','Script'])
πΉ Dataset = movie scripts β tokenized into IDs β saved in .bin
files for fast training.
2. Create Input-Output Batches
The model trains on fixed-length chunks (block_size
) of tokens.
Each batch contains input X
and target Y
sequences, where Y
is shifted by 1 (next-token labels).
def get_batch(split):
data = train_data if split == 'train' else val_data
ix = torch.randint(len(data) - block_size, (batch_size,))
x = torch.stack([torch.from_numpy(data[i:i+block_size].astype(np.int64)) for i in ix])
y = torch.stack([torch.from_numpy(data[i+1:i+block_size+1].astype(np.int64)) for i in ix])
return x.to(device), y.to(device)
πΉ This is how we feed training data: chunks of movie script β model learns to predict next token.
3. Model Architecture
The model is a stack of Transformer blocks, similar to GPT-2.
(a) LayerNorm
class LayerNorm(nn.Module):
def __init__(self, ndim, bias):
super().__init__()
self.weight = nn.Parameter(torch.ones(ndim))
self.bias = nn.Parameter(torch.zeros(ndim)) if bias else None
def forward(self, x):
return F.layer_norm(x, self.weight.shape, self.weight, self.bias, 1e-5)
- Normalizes features β stabilizes training.
- Like BatchNorm, but per token, not per batch.
(b) Causal Self-Attention
class CausalSelfAttention(nn.Module):
def __init__(self, config):
super().__init__()
self.c_attn = nn.Linear(config.n_embd, 3 * config.n_embd, bias=config.bias) # QKV
self.c_proj = nn.Linear(config.n_embd, config.n_embd, bias=config.bias)
self.n_head = config.n_head
self.n_embd = config.n_embd
def forward(self, x):
B, T, C = x.size()
q, k, v = self.c_attn(x).split(self.n_embd, dim=2)
# Reshape into multi-heads
k = k.view(B, T, self.n_head, C // self.n_head).transpose(1, 2)
q = q.view(B, T, self.n_head, C // self.n_head).transpose(1, 2)
v = v.view(B, T, self.n_head, C // self.n_head).transpose(1, 2)
# Masked self-attention (causal: no peeking forward)
att = (q @ k.transpose(-2, -1)) / (C // self.n_head)**0.5
mask = torch.tril(torch.ones(T, T, device=x.device))
att = att.masked_fill(mask == 0, float('-inf'))
att = F.softmax(att, dim=-1)
y = att @ v
# Recombine heads
y = y.transpose(1, 2).contiguous().view(B, T, C)
return self.c_proj(y)
- Lets each token "attend" to previous tokens.
- Causal masking ensures left-to-right generation.
(c) MLP
class MLP(nn.Module):
def __init__(self, config):
super().__init__()
self.c_fc = nn.Linear(config.n_embd, 4 * config.n_embd)
self.gelu = nn.GELU()
self.c_proj = nn.Linear(4 * config.n_embd, config.n_embd)
def forward(self, x):
return self.c_proj(self.gelu(self.c_fc(x)))
- Expands hidden dim by 4x, then projects back.
- Adds non-linear transformation.
(d) Transformer Block
class Block(nn.Module):
def __init__(self, config):
super().__init__()
self.ln1 = LayerNorm(config.n_embd, config.bias)
self.attn = CausalSelfAttention(config)
self.ln2 = LayerNorm(config.n_embd, config.bias)
self.mlp = MLP(config)
def forward(self, x):
x = x + self.attn(self.ln1(x)) # Residual
x = x + self.mlp(self.ln2(x)) # Residual
return x
- Core Transformer block =
[Norm β Attention β Residual β Norm β MLP β Residual]
.
(e) GPT Model
class GPT(nn.Module):
def __init__(self, config):
super().__init__()
self.transformer = nn.ModuleDict(dict(
wte = nn.Embedding(config.vocab_size, config.n_embd), # token embedding
wpe = nn.Embedding(config.block_size, config.n_embd), # position embedding
h = nn.ModuleList([Block(config) for _ in range(config.n_layer)]),
ln_f = LayerNorm(config.n_embd, config.bias),
))
self.lm_head = nn.Linear(config.n_embd, config.vocab_size, bias=False)
self.transformer.wte.weight = self.lm_head.weight # weight tying
def forward(self, idx, targets=None):
b, t = idx.size()
tok_emb = self.transformer.wte(idx)
pos_emb = self.transformer.wpe(torch.arange(0, t, device=idx.device))
x = tok_emb + pos_emb
for block in self.transformer.h:
x = block(x)
x = self.transformer.ln_f(x)
logits = self.lm_head(x)
if targets is None:
return logits, None
loss = F.cross_entropy(logits.view(-1, logits.size(-1)), targets.view(-1), ignore_index=-1)
return logits, loss
- Input tokens β embeddings + positional encoding β Transformer blocks β logits over vocab.
- If
targets
provided β compute cross-entropy loss. - Otherwise β just output logits for generation.
(f) Generation
@torch.no_grad()
def generate(self, idx, max_new_tokens, temperature=1.0, top_k=None):
for _ in range(max_new_tokens):
idx_cond = idx[:, -self.config.block_size:]
logits, _ = self(idx_cond)
logits = logits[:, -1, :] / temperature
if top_k is not None:
v, _ = torch.topk(logits, top_k)
logits[logits < v[:, [-1]]] = -float('Inf')
probs = F.softmax(logits, dim=-1)
idx_next = torch.multinomial(probs, num_samples=1)
idx = torch.cat((idx, idx_next), dim=1)
return idx
- Autoregressively generates tokens.
- Uses
temperature
(randomness) andtop_k
(restricts to top-k likely tokens).
4. Training
- Loss: Cross-Entropy (predict next token).
- Optimizer: AdamW (with tuned betas, weight decay).
- Scheduler: Warmup + Cosine Decay.
- Mixed Precision + Gradient Accumulation for efficiency.
5. Monitoring
plt.plot(train_loss_list, 'g', label='train_loss')
plt.plot(validation_loss_list, 'r', label='validation_loss')
plt.xlabel("Steps - Every 100 epochs")
plt.ylabel("Loss")
plt.legend()
plt.show()
- Green = training loss, Red = validation loss.
- Watch for overfitting / underfitting.
π Training Metrics
Epoch | Train Loss | Val Loss | Perplexity |
---|---|---|---|
500 | 6.0358 | 6.0601 | 430.1 |
1000 | 5.0690 | 5.1143 | 166.0 |
1500 | 4.3162 | 4.3407 | 76.7 |
2000 | 3.5948 | 3.6099 | 36.9 |
2500 | 3.0460 | 3.0569 | 21.3 |
3000 | 2.7518 | 2.7398 | 15.5 |
3500 | 2.5606 | 2.5574 | 12.9 |
4000 | 2.4583 | 2.4691 | 11.8 |
4500 | 2.3943 | 2.3969 | 11.0 |
5000 | 2.3428 | 2.3513 | 10.5 |
6000 | 2.2141 | 2.2155 | 9.17 |
7000 | 2.1389 | 2.1577 | 8.65 |
8000 | 2.0570 | 2.0703 | 7.93 |
9000 | 2.0062 | 2.0210 | 7.55 |
10000 | 1.9604 | 1.9715 | 7.18 |
12000 | 1.8580 | 1.8924 | 6.64 |
14000 | 1.7954 | 1.8284 | 6.23 |
16000 | 1.7369 | 1.7937 | 5.95 |
18000 | 1.6901 | 1.7314 | 5.65 |
19500 | 1.6594 | 1.7216 | 5.60 |
π Validation loss steadily decreases, and perplexity drops from ~430 β ~5.6 over training.
6. Inference
# Load best model
model = GPT(config)
model.load_state_dict(torch.load("best_model_params.pt", map_location=device))
model.eval()
# Prompt
sentence = "Write a Tarantino-style diner scene with two strangers..."
context = torch.tensor(enc.encode_ordinary(sentence)).unsqueeze(0).to(device)
# Generate (recommended shorter length)
y = model.generate(context, max_new_tokens=300, temperature=0.8, top_k=50)
print(enc.decode(y[0].tolist()))
β οΈ Note: In the notebook, max_new_tokens=5000
was used, which may be excessive.
For practical testing, use 200β500 tokens.
β Summary
- Architecture: GPT-like Transformer (attention + MLP blocks).
- Training: Next-token prediction with AdamW + LR scheduling.
- Evaluation: Loss curves (train vs val).
- Inference: Autoregressive generation with temperature & top-k control.
This is essentially a mini GPT-2 clone, scaled down for small datasets like movie scripts.