TinyLlama Early-Exit Heads (Adapter)

Adapter-style early-exit classification heads for TinyLlama/TinyLlama-1.1B-Chat-v1.0.

Attach these heads to the base model to stop decoding early when a token’s next-token probability is already confident, reducing average compute per token.

💡 This repo contains only the early-exit heads + a small loader. You must also have the base model (it downloads automatically in the examples below).


TL;DR

  • One tiny linear head per transformer layer.
  • At each decoding step, compute layer-wise logits; if max_prob >= confidence_threshold, exit early.
  • Ships with a loader and a minimal generation helper.

Quickstart (no local files needed)

# pip install -U torch transformers safetensors huggingface_hub

from huggingface_hub import hf_hub_download
import importlib.util, sys

REPO_ID = "5ivatej/tinyllama-1.1b-early-exit"

# 1) Dynamically fetch the loader from the Hub
module_path = hf_hub_download(REPO_ID, "early_exit_wrapper.py")

# 2) Import it as a module
spec = importlib.util.spec_from_file_location("early_exit_wrapper", module_path)
early = importlib.util.module_from_spec(spec); sys.modules["early_exit_wrapper"] = early
spec.loader.exec_module(early)

# 3) Load wrapped model + tokenizer
wrapped, tok = early.load_early_exit_from_hub(REPO_ID)  # auto-picks CPU/MPS/CUDA & safe dtype

# 4) Generate with early exit
ids = early.generate_with_early_exit(
    "Explain early exit in one tweet.",
    wrapped, tok,
    max_new_tokens=64, temperature=0.7, top_p=0.9
)
print(tok.decode(ids[0], skip_special_tokens=True))
Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for 5ivatej/tinyllama-1.1b-early-exit

Finetuned
(417)
this model