πŸš€ Small Language Model (SLM) from Scratch β€” Explained

This notebook builds, trains, and runs a small Transformer-based language model (mini GPT) on a movie scripts dataset.
Written for someone who knows basic ML/DL but is new to LLMs.


1. Dataset & Preprocessing

from datasets import load_dataset
import tiktoken, numpy as np

# Load dataset
ds = load_dataset("IsmaelMousa/movies")

# Split into train/val
ds = ds['train'].train_test_split(test_size=0.1, seed=42)

# Tokenizer (GPT-2)
enc = tiktoken.get_encoding("gpt2")

def process(example):
    ids = enc.encode_ordinary(example['Script'])
    return {'ids': ids, 'len': len(ids)}

# Tokenize
tokenized = ds.map(process, remove_columns=['Name','Script'])

πŸ”Ή Dataset = movie scripts β†’ tokenized into IDs β†’ saved in .bin files for fast training.


2. Create Input-Output Batches

The model trains on fixed-length chunks (block_size) of tokens.
Each batch contains input X and target Y sequences, where Y is shifted by 1 (next-token labels).

def get_batch(split):
    data = train_data if split == 'train' else val_data
    ix = torch.randint(len(data) - block_size, (batch_size,))
    x = torch.stack([torch.from_numpy(data[i:i+block_size].astype(np.int64)) for i in ix])
    y = torch.stack([torch.from_numpy(data[i+1:i+block_size+1].astype(np.int64)) for i in ix])
    return x.to(device), y.to(device)

πŸ”Ή This is how we feed training data: chunks of movie script β†’ model learns to predict next token.


3. Model Architecture

The model is a stack of Transformer blocks, similar to GPT-2.

(a) LayerNorm

class LayerNorm(nn.Module):
    def __init__(self, ndim, bias):
        super().__init__()
        self.weight = nn.Parameter(torch.ones(ndim))
        self.bias = nn.Parameter(torch.zeros(ndim)) if bias else None
    def forward(self, x):
        return F.layer_norm(x, self.weight.shape, self.weight, self.bias, 1e-5)
  • Normalizes features β†’ stabilizes training.
  • Like BatchNorm, but per token, not per batch.

(b) Causal Self-Attention

class CausalSelfAttention(nn.Module):
    def __init__(self, config):
        super().__init__()
        self.c_attn = nn.Linear(config.n_embd, 3 * config.n_embd, bias=config.bias) # QKV
        self.c_proj = nn.Linear(config.n_embd, config.n_embd, bias=config.bias)
        self.n_head = config.n_head
        self.n_embd = config.n_embd

    def forward(self, x):
        B, T, C = x.size()
        q, k, v = self.c_attn(x).split(self.n_embd, dim=2)
        # Reshape into multi-heads
        k = k.view(B, T, self.n_head, C // self.n_head).transpose(1, 2)
        q = q.view(B, T, self.n_head, C // self.n_head).transpose(1, 2)
        v = v.view(B, T, self.n_head, C // self.n_head).transpose(1, 2)

        # Masked self-attention (causal: no peeking forward)
        att = (q @ k.transpose(-2, -1)) / (C // self.n_head)**0.5
        mask = torch.tril(torch.ones(T, T, device=x.device))
        att = att.masked_fill(mask == 0, float('-inf'))
        att = F.softmax(att, dim=-1)
        y = att @ v

        # Recombine heads
        y = y.transpose(1, 2).contiguous().view(B, T, C)
        return self.c_proj(y)
  • Lets each token "attend" to previous tokens.
  • Causal masking ensures left-to-right generation.

(c) MLP

class MLP(nn.Module):
    def __init__(self, config):
        super().__init__()
        self.c_fc = nn.Linear(config.n_embd, 4 * config.n_embd)
        self.gelu = nn.GELU()
        self.c_proj = nn.Linear(4 * config.n_embd, config.n_embd)

    def forward(self, x):
        return self.c_proj(self.gelu(self.c_fc(x)))
  • Expands hidden dim by 4x, then projects back.
  • Adds non-linear transformation.

(d) Transformer Block

class Block(nn.Module):
    def __init__(self, config):
        super().__init__()
        self.ln1 = LayerNorm(config.n_embd, config.bias)
        self.attn = CausalSelfAttention(config)
        self.ln2 = LayerNorm(config.n_embd, config.bias)
        self.mlp = MLP(config)

    def forward(self, x):
        x = x + self.attn(self.ln1(x)) # Residual
        x = x + self.mlp(self.ln2(x)) # Residual
        return x
  • Core Transformer block = [Norm β†’ Attention β†’ Residual β†’ Norm β†’ MLP β†’ Residual].

(e) GPT Model

class GPT(nn.Module):
    def __init__(self, config):
        super().__init__()
        self.transformer = nn.ModuleDict(dict(
            wte = nn.Embedding(config.vocab_size, config.n_embd), # token embedding
            wpe = nn.Embedding(config.block_size, config.n_embd), # position embedding
            h = nn.ModuleList([Block(config) for _ in range(config.n_layer)]),
            ln_f = LayerNorm(config.n_embd, config.bias),
        ))
        self.lm_head = nn.Linear(config.n_embd, config.vocab_size, bias=False)
        self.transformer.wte.weight = self.lm_head.weight  # weight tying

    def forward(self, idx, targets=None):
        b, t = idx.size()
        tok_emb = self.transformer.wte(idx)
        pos_emb = self.transformer.wpe(torch.arange(0, t, device=idx.device))
        x = tok_emb + pos_emb
        for block in self.transformer.h:
            x = block(x)
        x = self.transformer.ln_f(x)
        logits = self.lm_head(x)

        if targets is None:
            return logits, None
        loss = F.cross_entropy(logits.view(-1, logits.size(-1)), targets.view(-1), ignore_index=-1)
        return logits, loss
  • Input tokens β†’ embeddings + positional encoding β†’ Transformer blocks β†’ logits over vocab.
  • If targets provided β†’ compute cross-entropy loss.
  • Otherwise β†’ just output logits for generation.

(f) Generation

@torch.no_grad()
def generate(self, idx, max_new_tokens, temperature=1.0, top_k=None):
    for _ in range(max_new_tokens):
        idx_cond = idx[:, -self.config.block_size:]
        logits, _ = self(idx_cond)
        logits = logits[:, -1, :] / temperature
        if top_k is not None:
            v, _ = torch.topk(logits, top_k)
            logits[logits < v[:, [-1]]] = -float('Inf')
        probs = F.softmax(logits, dim=-1)
        idx_next = torch.multinomial(probs, num_samples=1)
        idx = torch.cat((idx, idx_next), dim=1)
    return idx
  • Autoregressively generates tokens.
  • Uses temperature (randomness) and top_k (restricts to top-k likely tokens).

4. Training

  • Loss: Cross-Entropy (predict next token).
  • Optimizer: AdamW (with tuned betas, weight decay).
  • Scheduler: Warmup + Cosine Decay.
  • Mixed Precision + Gradient Accumulation for efficiency.

5. Monitoring

plt.plot(train_loss_list, 'g', label='train_loss')
plt.plot(validation_loss_list, 'r', label='validation_loss')
plt.xlabel("Steps - Every 100 epochs")
plt.ylabel("Loss")
plt.legend()
plt.show()
  • Green = training loss, Red = validation loss.
  • Watch for overfitting / underfitting.

πŸ“Š Training Metrics

Epoch Train Loss Val Loss Perplexity
500 6.0358 6.0601 430.1
1000 5.0690 5.1143 166.0
1500 4.3162 4.3407 76.7
2000 3.5948 3.6099 36.9
2500 3.0460 3.0569 21.3
3000 2.7518 2.7398 15.5
3500 2.5606 2.5574 12.9
4000 2.4583 2.4691 11.8
4500 2.3943 2.3969 11.0
5000 2.3428 2.3513 10.5
6000 2.2141 2.2155 9.17
7000 2.1389 2.1577 8.65
8000 2.0570 2.0703 7.93
9000 2.0062 2.0210 7.55
10000 1.9604 1.9715 7.18
12000 1.8580 1.8924 6.64
14000 1.7954 1.8284 6.23
16000 1.7369 1.7937 5.95
18000 1.6901 1.7314 5.65
19500 1.6594 1.7216 5.60

πŸ“‰ Validation loss steadily decreases, and perplexity drops from ~430 β†’ ~5.6 over training.

6. Inference

# Load best model
model = GPT(config)
model.load_state_dict(torch.load("best_model_params.pt", map_location=device))
model.eval()

# Prompt
sentence = "Write a Tarantino-style diner scene with two strangers..."
context = torch.tensor(enc.encode_ordinary(sentence)).unsqueeze(0).to(device)

# Generate (recommended shorter length)
y = model.generate(context, max_new_tokens=300, temperature=0.8, top_k=50)
print(enc.decode(y[0].tolist()))

⚠️ Note: In the notebook, max_new_tokens=5000 was used, which may be excessive.
For practical testing, use 200–500 tokens.


βœ… Summary

  • Architecture: GPT-like Transformer (attention + MLP blocks).
  • Training: Next-token prediction with AdamW + LR scheduling.
  • Evaluation: Loss curves (train vs val).
  • Inference: Autoregressive generation with temperature & top-k control.

This is essentially a mini GPT-2 clone, scaled down for small datasets like movie scripts.

Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Dataset used to train pranikz/SLM-Movie-Script