TinyLlama Early-Exit Heads (Adapter)
Adapter-style early-exit classification heads for
TinyLlama/TinyLlama-1.1B-Chat-v1.0
.
Attach these heads to the base model to stop decoding early when a token’s next-token probability is already confident, reducing average compute per token.
💡 This repo contains only the early-exit heads + a small loader. You must also have the base model (it downloads automatically in the examples below).
TL;DR
- One tiny linear head per transformer layer.
- At each decoding step, compute layer-wise logits; if
max_prob >= confidence_threshold
, exit early. - Ships with a loader and a minimal generation helper.
Quickstart (no local files needed)
# pip install -U torch transformers safetensors huggingface_hub
from huggingface_hub import hf_hub_download
import importlib.util, sys
REPO_ID = "5ivatej/tinyllama-1.1b-early-exit"
# 1) Dynamically fetch the loader from the Hub
module_path = hf_hub_download(REPO_ID, "early_exit_wrapper.py")
# 2) Import it as a module
spec = importlib.util.spec_from_file_location("early_exit_wrapper", module_path)
early = importlib.util.module_from_spec(spec); sys.modules["early_exit_wrapper"] = early
spec.loader.exec_module(early)
# 3) Load wrapped model + tokenizer
wrapped, tok = early.load_early_exit_from_hub(REPO_ID) # auto-picks CPU/MPS/CUDA & safe dtype
# 4) Generate with early exit
ids = early.generate_with_early_exit(
"Explain early exit in one tweet.",
wrapped, tok,
max_new_tokens=64, temperature=0.7, top_p=0.9
)
print(tok.decode(ids[0], skip_special_tokens=True))
Model tree for 5ivatej/tinyllama-1.1b-early-exit
Base model
TinyLlama/TinyLlama-1.1B-Chat-v1.0