manueldeprada
/

sampling_with_kvcache

@@ -2,13 +2,14 @@
 library_name: transformers
 tags:
   - custom_generate
-  - ancestral_sampling
 ---
-# Multinomial (Ancestral) Sampling simple implementation
 ## Description
-A clean, hackable implementation of ancestral sampling (multinomial sampling) with full KV cache support. This is a simplified alternative to the complex generation mixin in transformers, designed for readability and ease of modification while maintaining full performance.
 The implementation supports both sampling and greedy decoding modes, with optional temperature scaling and top-k/top-p filtering.
@@ -18,19 +19,23 @@ The implementation supports both sampling and greedy decoding modes, with option
 ## Model compatibility
 Most transformer LLM/VLM models trained for causal language modeling.
-## Additional Arguments
 - `temperature` (float): Sampling temperature (default: 1.0, higher = more random)
 - `top_k` (int): Only consider top-k most probable tokens (default: None)
 - `top_p` (float): Only consider tokens with cumulative probability <= top_p (default: None)
 - `do_sample` (bool): Whether to use sampling (True, default) or greedy decoding (False)
-## Output Type changes
 When `return_dict_in_generate=True`, returns a dictionary with:
 - `sequences`: Generated token IDs
 - `scores`: Log probabilities of sampled tokens (with temperature/sampling modifications)
-- `logps`: Original model log probabilities (T=1, no modifications)
-- `prompt_lens`: Length of input prompts
-- `lens`: Final sequence lengths
 ## Example usage
@@ -43,30 +48,30 @@ model = AutoModelForCausalLM.from_pretrained("Qwen/Qwen2.5-0.5B-Instruct", devic
 inputs = tokenizer(["The quick brown"], return_tensors="pt").to(model.device)
 # Basic sampling
-gen_out = model.generate(**inputs, custom_generate="manueldeprada/ancestral_sampling", trust_remote_code=True)
 # With temperature
-gen_out = model.generate(**inputs, custom_generate="manueldeprada/ancestral_sampling", temperature=0.8, trust_remote_code=True)
 # With top-k
-gen_out = model.generate(**inputs, custom_generate="manueldeprada/ancestral_sampling", top_k=50, trust_remote_code=True)
 # With top-p (nucleus sampling)
-gen_out = model.generate(**inputs, custom_generate="manueldeprada/ancestral_sampling", top_p=0.9, trust_remote_code=True)
 # Greedy decoding (no sampling)
-gen_out = model.generate(**inputs, custom_generate="manueldeprada/ancestral_sampling", do_sample=False, trust_remote_code=True)
 # Get detailed output with probabilities
 gen_out = model.generate(
     **inputs,
-    custom_generate="manueldeprada/ancestral_sampling",
     return_dict_in_generate=True,
     trust_remote_code=True
 )
 print(f"Generated text: {tokenizer.batch_decode(gen_out['sequences'], skip_special_tokens=True)}")
 print(f"Sampling scores: {gen_out['scores']}")
-print(f"Model log probabilities: {gen_out['logps']}")
 ```
 ## Algorithm
@@ -82,47 +87,6 @@ print(f"Model log probabilities: {gen_out['logps']}")
    - Update KV cache and track sequence completion
 3. Return generated sequences and probability information
-## Helper Functions for Custom Generation
-The implementation provides two key helper functions that you can use to build your own generation strategies:
-### `init_gen(model_kwargs, model, max_new_tokens, bos_token_id)`
-Initializes the generation process and prepares the KV cache:
-- Sets up input sequences and model inputs
-- Prepares the KV cache for generation
-- Returns updated `model_kwargs` and `input_ids`
-### `ps_next(model, model_kwargs, input_ids)`
-Gets the next token logits and updates the KV cache:
-- Runs the model forward pass
-- Extracts logits for the last token
-- Updates the KV cache
-- Returns updated `model_kwargs` and `logits`
-### Example: Custom Generation Loop
-```py
-from ancestral_sampling.generate import init_gen, ps_next
-def custom_generation(model, model_kwargs, max_new_tokens=20, temperature=1.0):
-    # Initialize generation
-    model_kwargs, input_ids = init_gen(model_kwargs, model, max_new_tokens, bos_token_id)
-    for i in range(max_new_tokens):
-        # Get next token logits
-        model_kwargs, logits = ps_next(model, model_kwargs, input_ids)
-        # Your custom logic here
-        probs = (logits / temperature).softmax(-1)
-        next_token = torch.multinomial(probs, 1)
-        # Append token and continue
-        input_ids = torch.cat([input_ids, next_token], dim=-1)
-        # Add your stopping conditions
-        if next_token.item() == eos_token_id:
-            break
-    return input_ids
-```

 library_name: transformers
 tags:
   - custom_generate
+  - sampling
+  - kvcache
 ---
+# Sampling with KV Cache
 ## Description
+A clean, hackable implementation of sampling (also called ancestral sampling or multinomial sampling) with full KV cache support. This is a simplified alternative to the complex generation mixin in transformers, designed for readability and ease of modification while maintaining full performance.
 The implementation supports both sampling and greedy decoding modes, with optional temperature scaling and top-k/top-p filtering.
 ## Model compatibility
 Most transformer LLM/VLM models trained for causal language modeling.
+## Relevant Arguments
 - `temperature` (float): Sampling temperature (default: 1.0, higher = more random)
 - `top_k` (int): Only consider top-k most probable tokens (default: None)
 - `top_p` (float): Only consider tokens with cumulative probability <= top_p (default: None)
 - `do_sample` (bool): Whether to use sampling (True, default) or greedy decoding (False)
+### Logits Processing Order
+Logits processors are applied in sequence: `temperature → softmax → top_k → top_p` (same as HuggingFace's `LogitProcessor` system). Temperature scaling occurs before top-p filtering, affecting the probability distribution that top-p operates on.
+For example, with `temperature=1.0`, `top_p=0.9` might include tokens A, B, C. With `temperature=0.5`, probability mass is much more concentrated, so `top_p=0.9` might only include token A.
+## Outputs
 When `return_dict_in_generate=True`, returns a dictionary with:
 - `sequences`: Generated token IDs
 - `scores`: Log probabilities of sampled tokens (with temperature/sampling modifications)
+- `logprobs`: Original model log probabilities (T=1, no modifications)
+Otherwise, returns a tensor of generated token IDs.
 ## Example usage
 inputs = tokenizer(["The quick brown"], return_tensors="pt").to(model.device)
 # Basic sampling
+gen_out = model.generate(**inputs, custom_generate="manueldeprada/sampling_with_kvcache", trust_remote_code=True)
 # With temperature
+gen_out = model.generate(**inputs, custom_generate="manueldeprada/sampling_with_kvcache", temperature=0.8, trust_remote_code=True)
 # With top-k
+gen_out = model.generate(**inputs, custom_generate="manueldeprada/sampling_with_kvcache", top_k=50, trust_remote_code=True)
 # With top-p (nucleus sampling)
+gen_out = model.generate(**inputs, custom_generate="manueldeprada/sampling_with_kvcache", top_p=0.9, trust_remote_code=True)
 # Greedy decoding (no sampling)
+gen_out = model.generate(**inputs, custom_generate="manueldeprada/sampling_with_kvcache", do_sample=False, trust_remote_code=True)
 # Get detailed output with probabilities
 gen_out = model.generate(
     **inputs,
+    custom_generate="manueldeprada/sampling_with_kvcache",
     return_dict_in_generate=True,
     trust_remote_code=True
 )
 print(f"Generated text: {tokenizer.batch_decode(gen_out['sequences'], skip_special_tokens=True)}")
 print(f"Sampling scores: {gen_out['scores']}")
+print(f"Model log probabilities: {gen_out['logprobs']}")
 ```
 ## Algorithm
    - Update KV cache and track sequence completion
 3. Return generated sequences and probability information

custom_generate/generate.py CHANGED Viewed

@@ -1,87 +1,157 @@
 import torch
-from transformers import GenerationConfig
-def ps_next(model, model_kwargs, input_ids):
     """
-    Auxiliary function to get the next token probabilities and update the KV cache.
     Args:
         model: The language model
         model_kwargs: Model keyword arguments including KV cache
         input_ids: Current input token IDs
-        T: Temperature for sampling
     Returns:
-        Updated model_kwargs, probabilities at temperature T, probabilities at T=1
     """
-    model_inputs = model.prepare_inputs_for_generation(input_ids, **model_kwargs)
     with torch.no_grad():
         outputs = model(**model_inputs, return_dict=True)
     logits = outputs.logits[:, -1].detach()
-    model_kwargs = model._update_model_kwargs_for_generation(
-        outputs, model_kwargs, is_encoder_decoder=model.config.is_encoder_decoder
-    )
     del outputs
     return model_kwargs, logits
 def init_gen(model_kwargs, model, max_new_tokens, bos_token_id):
     """
-    Auxiliary function to initialize the generation process and prepare the KV cache.
     Args:
         model_kwargs: Model keyword arguments
         model: The language model
         max_new_tokens: Maximum number of new tokens to generate
     Returns:
         Model keyword arguments and input token IDs
     """
-    input_ids, model_input_name, model_kwargs = model._prepare_model_inputs(
-        None, bos_token_id, model_kwargs
-    )
-    batch_size = input_ids.shape[0]
-    model._prepare_cache_for_generation(
-        model.generation_config, model_kwargs, None, batch_size,
-        max_cache_length=max_new_tokens, device=input_ids.device
     )
-    # Get initial cache position
-    model_kwargs = model._get_initial_cache_position(input_ids.shape[1], input_ids.device, model_kwargs)
-    return model_kwargs, input_ids
-def _apply_top_k_top_p(ps, model):
-    if hasattr(model, 'generation_config') and hasattr(model.generation_config, 'top_k') and model.generation_config.top_k is not None:
-        top_k = model.generation_config.top_k
-        top_k = min(top_k, ps.size(-1))
-        indices_to_remove = ps < torch.topk(ps, top_k)[0][..., -1, None]
-        ps[indices_to_remove] = 0.0
-        ps = ps / ps.sum(dim=-1, keepdim=True)
-    # Apply top-p filtering if specified
-    if hasattr(model, 'generation_config') and hasattr(model.generation_config, 'top_p') and model.generation_config.top_p is not None:
-        top_p = model.generation_config.top_p
-        if top_p < 1.0:
-            sorted_probs, sorted_indices = torch.sort(ps, descending=True)
-            cumulative_probs = torch.cumsum(sorted_probs, dim=-1)
-            # Remove tokens with cumulative probability above the threshold
-            sorted_indices_to_remove = cumulative_probs > top_p
-            sorted_indices_to_remove[..., 1:] = sorted_indices_to_remove[..., :-1].clone()
-            sorted_indices_to_remove[..., 0] = 0
-            indices_to_remove = sorted_indices_to_remove.scatter(1, sorted_indices, sorted_indices_to_remove)
-            ps[indices_to_remove] = 0.0
-            ps = ps / ps.sum(dim=-1, keepdim=True)
-    return ps
-def ancestral_sampling(model_kwargs, model, eos_token_ids, pad_token_id, bos_token_id, do_sample=True, max_new_tokens=20, T=1.0):
     """
-    Ancestral sampling implementation with proper KV caching.
     Args:
         prompts: List of input prompts
         model: The language model
@@ -90,55 +160,64 @@ def ancestral_sampling(model_kwargs, model, eos_token_ids, pad_token_id, bos_tok
         pad_token_id: Padding token ID
         bos_token_id: Beginning-of-sequence token ID
         max_new_tokens: Maximum number of new tokens to generate
     Returns:
         Generated sequences, log probabilities, and metadata
     """
     # Initialize the generation process and prepare the KV cache
-    model_kwargs, input_ids = init_gen(model_kwargs, model, max_new_tokens, bos_token_id)
-    batch_size, max_prompts_len = input_ids.shape
-    prompts_len = (input_ids != pad_token_id).sum(dim=-1)
     # Keeps track of which sequences are finished and their lengths
     active_seqs = input_ids.new_ones((batch_size, 1), dtype=torch.bool)
-    lens = torch.full((batch_size,), max_prompts_len, dtype=torch.long, device=input_ids.device)
     # Modified log probabilities of the sequences
     scores = torch.zeros((batch_size, max_new_tokens), dtype=model.dtype)
-    # Unfiltered sequence log probabilities (T=1, no sampling modifications)
-    logps = torch.zeros((batch_size, max_new_tokens), dtype=model.dtype)
     for i in range(max_new_tokens):
         # Get the next token probabilities and update the KV cache
-        model_kwargs, logits = ps_next(model, model_kwargs, input_ids)
-        # Original model probabilities (T=1, no sampling modifications)
         model_ps = logits.softmax(-1)
-        # Sampling probabilities (T, with sampling modifications)
-        ps = (logits/T).softmax(-1)
-        ps = _apply_top_k_top_p(ps, model)
         # Sample the next token and gather the log probabilities
-        if do_sample:
-            next_token_ids = torch.multinomial(ps, 1) * active_seqs + pad_token_id * ~active_seqs
-        else:
-            next_token_ids = torch.argmax(ps, dim=-1).unsqueeze(-1) * active_seqs + pad_token_id * ~active_seqs
-        next_token_logps = ps.gather(-1, next_token_ids).log()
-        next_token_model_logps = model_ps.gather(-1, next_token_ids).log()
         input_ids = torch.cat([input_ids, next_token_ids], dim=-1)
-        scores[:, i] = (next_token_logps * active_seqs).squeeze()
-        logps[:, i] = (next_token_model_logps * active_seqs).squeeze()
-        lens += active_seqs.squeeze(-1).long()
         active_seqs &= ~torch.isin(next_token_ids, eos_token_ids)
         if active_seqs.sum() == 0:
-            break
-    return input_ids.detach().cpu(), scores[:,:i+1], logps[:,:i+1], prompts_len, lens.tolist()
 def generate(model, **kwargs):
     """
-    Ancestral sampling strategy - multinomial sampling with temperature and optional top-k/top-p filtering.
     Simple implementation with proper KV caching support.
     Args:
         model: The language model
         model_kwargs: Model keyword arguments from the tokenizer
@@ -147,29 +226,38 @@ def generate(model, **kwargs):
         top_k: Only consider top-k most probable tokens
         top_p: Only consider tokens with cumulative probability <= top_p
         **kwargs: Additional arguments
     Returns:
         Generated token IDs
     """
     generation_config = model.generation_config
-    max_new_tokens = kwargs.get('max_new_tokens', generation_config.max_new_tokens)
     max_new_tokens = 512 if max_new_tokens is None else max_new_tokens
-    do_sample = kwargs.get('do_sample', True)
-    eos_token_ids = kwargs.get('eos_token_ids', generation_config.eos_token_id)
     if eos_token_ids is None:
-        raise ValueError("Model generation config does not have an EOS token id. You must provide it to generate() with the eos_token_ids argument.")
     eos_token_ids = torch.as_tensor(eos_token_ids, device=model.device)
     if eos_token_ids is not None and eos_token_ids.ndim == 0:
         eos_token_ids = eos_token_ids.unsqueeze(0)
-    pad_token_id = kwargs.get('pad_token_id', generation_config.pad_token_id if generation_config.pad_token_id is not None else eos_token_ids[0])
-    bos_token_id = kwargs.get('bos_token_id', generation_config.bos_token_id)
     if bos_token_id is None:
-        raise ValueError("Model generation config does not have a BOS token id. You must provide it to generate() with the bos_token_id argument.")
-    T = kwargs.get('temperature', 1.0)
-    return_dict = kwargs.get('return_dict_in_generate', False)
-    generated_ids, scores, logps, prompt_lens, lens = ancestral_sampling(
         model_kwargs=kwargs,
         model=model,
         eos_token_ids=eos_token_ids,
@@ -177,17 +265,14 @@ def generate(model, **kwargs):
         bos_token_id=bos_token_id,
         do_sample=do_sample,
         max_new_tokens=max_new_tokens,
-        T=T,
     )
     if return_dict:
         return {
             "sequences": generated_ids,
             "scores": scores,
-            "logps": logps,
-            "prompt_lens": prompt_lens,
-            "lens": lens,
         }
     else:
         return generated_ids

 import torch
+from transformers import Cache, DynamicCache
+from transformers.generation.utils import ModelOutput
+from typing import Optional, Any
+def prepare_inputs_for_generation(
+        input_ids: torch.LongTensor,
+        past_key_values: Optional[Cache] = None,
+        attention_mask: Optional[torch.LongTensor] = None,
+        cache_position: Optional[torch.LongTensor] = None,
+        **kwargs,
+    ):
+        input_ids = input_ids[:, cache_position].clone(memory_format=torch.contiguous_format)
+        cur_len = input_ids.shape[1]
+        model_inputs = {"cache_position": cache_position,
+                        "past_key_values": past_key_values,
+                        "input_ids": input_ids,
+                        "inputs_embeds": None,
+                        "attention_mask": attention_mask,
+                        }
+        if attention_mask is not None and kwargs.get("position_ids") is None:
+            position_ids = attention_mask.long().cumsum(-1) - 1
+            position_ids.masked_fill_(attention_mask == 0, 1)
+            kwargs["position_ids"] = position_ids
+        if past_key_values is not None:
+            for name in ("position_ids", "token_type_ids"):
+                if name in kwargs:
+                    kwargs[name] = kwargs[name][:, -cur_len:].clone(memory_format=torch.contiguous_format)
+        model_inputs.update({k: v for k, v in kwargs.items() if k not in model_inputs})
+        return model_inputs
+def update_model_kwargs_for_generation(
+        outputs: ModelOutput,
+        model_kwargs: dict[str, Any],
+        num_new_tokens: int = 1,
+    ) -> dict[str, Any]:
+        model_kwargs["past_key_values"] = getattr(outputs, "past_key_values")
+        if "token_type_ids" in model_kwargs:
+            token_type_ids = model_kwargs["token_type_ids"]
+            model_kwargs["token_type_ids"] = torch.cat([token_type_ids, token_type_ids[:, -1].unsqueeze(-1)], dim=-1)
+        if "attention_mask" in model_kwargs:
+            attention_mask = model_kwargs["attention_mask"]
+            model_kwargs["attention_mask"] = torch.cat(
+                [attention_mask, attention_mask.new_ones((attention_mask.shape[0], 1))], dim=-1
+            )
+        model_kwargs["cache_position"] = model_kwargs["cache_position"][-1:] + num_new_tokens
+        return model_kwargs
+def next_logits_with_cache_update(model, model_kwargs, input_ids):
     """
+    Gets the next token logits and updates the KV cache:
+    - Runs the model forward pass
+    - Extracts logits for the last token
+    - Updates the KV cache
+    - Returns updated `model_kwargs` and `logits`
     Args:
         model: The language model
         model_kwargs: Model keyword arguments including KV cache
         input_ids: Current input token IDs
     Returns:
+        Updated model_kwargs, logits for the next token
     """
+    model_inputs = prepare_inputs_for_generation(input_ids, **model_kwargs)
     with torch.no_grad():
         outputs = model(**model_inputs, return_dict=True)
     logits = outputs.logits[:, -1].detach()
+    model_kwargs = update_model_kwargs_for_generation(outputs, model_kwargs)
     del outputs
     return model_kwargs, logits
 def init_gen(model_kwargs, model, max_new_tokens, bos_token_id):
     """
+    Initializes the generation process and prepares the KV cache:
+    - Sets up input sequences and model inputs
+    - Prepares the KV cache for generation
+    - Returns updated `model_kwargs` and `input_ids`
     Args:
         model_kwargs: Model keyword arguments
         model: The language model
         max_new_tokens: Maximum number of new tokens to generate
+        bos_token_id: Beginning-of-sequence token ID
     Returns:
         Model keyword arguments and input token IDs
     """
+    input_ids = model_kwargs.pop("input_ids")
+    model_kwargs["past_key_values"] = DynamicCache() if model_kwargs.get("past_key_values") is None else model_kwargs["past_key_values"]
+    assert isinstance(model_kwargs["past_key_values"], Cache), "past_key_values must be a Cache object"
+    cache_position = torch.ones(input_ids.shape[1], dtype=torch.int64, device=input_ids.device).cumsum(0) - 1
+    cache_position = cache_position[model_kwargs["past_key_values"].get_seq_length() :]
+    model_kwargs["cache_position"] = cache_position
+    return model_kwargs, input_ids
+def _apply_top_k(ps, model):
+    """Apply top-k filtering to probabilities."""
+    if not hasattr(model, "generation_config") or not hasattr(
+        model.generation_config, "top_k"
+    ):
+        return ps
+    top_k = model.generation_config.top_k
+    if top_k is None or top_k >= ps.size(-1):
+        return ps
+    indices_to_remove = ps < torch.topk(ps, top_k)[0][..., -1, None]
+    ps[indices_to_remove] = 0.0
+    return ps / ps.sum(dim=-1, keepdim=True)
+def _apply_top_p(ps, model):
+    """Apply top-p (nucleus) filtering to probabilities."""
+    if not hasattr(model, "generation_config") or not hasattr(
+        model.generation_config, "top_p"
+    ):
+        return ps
+    top_p = model.generation_config.top_p
+    if top_p is None or top_p >= 1.0:
+        return ps
+    sorted_probs, sorted_indices = torch.sort(ps, descending=True)
+    cumulative_probs = torch.cumsum(sorted_probs, dim=-1)
+    sorted_indices_to_remove = cumulative_probs > top_p
+    sorted_indices_to_remove[..., 1:] = sorted_indices_to_remove[..., :-1].clone()
+    sorted_indices_to_remove[..., 0] = 0
+    indices_to_remove = sorted_indices_to_remove.scatter(
+        1, sorted_indices, sorted_indices_to_remove
     )
+    ps[indices_to_remove] = 0.0
+    return ps / ps.sum(dim=-1, keepdim=True)
+def sampling_with_kvcache(
+    model_kwargs,
+    model,
+    eos_token_ids,
+    pad_token_id,
+    bos_token_id,
+    do_sample=True,
+    max_new_tokens=20,
+    temperature=1.0,
+):
     """
+    Sampling implementation with proper KV caching.
     Args:
         prompts: List of input prompts
         model: The language model
         pad_token_id: Padding token ID
         bos_token_id: Beginning-of-sequence token ID
         max_new_tokens: Maximum number of new tokens to generate
     Returns:
         Generated sequences, log probabilities, and metadata
     """
     # Initialize the generation process and prepare the KV cache
+    model_kwargs, input_ids = init_gen(
+        model_kwargs, model, max_new_tokens, bos_token_id
+    )
+    batch_size, _ = input_ids.shape
     # Keeps track of which sequences are finished and their lengths
     active_seqs = input_ids.new_ones((batch_size, 1), dtype=torch.bool)
     # Modified log probabilities of the sequences
     scores = torch.zeros((batch_size, max_new_tokens), dtype=model.dtype)
+    # Unfiltered sequence log probabilities (temperature=1, no sampling processors applied)
+    logprobs = torch.zeros((batch_size, max_new_tokens), dtype=model.dtype)
     for i in range(max_new_tokens):
         # Get the next token probabilities and update the KV cache
+        model_kwargs, logits = next_logits_with_cache_update(
+            model, model_kwargs, input_ids
+        )
+        # Store original model probabilities (temperature=1, no sampling processors applied)
         model_ps = logits.softmax(-1)
+        # Logit processors (temperature, top-k, top-p). We can chain these!
+        ps = (logits / temperature).softmax(-1)
+        ps = _apply_top_k(ps, model)
+        ps = _apply_top_p(ps, model)
         # Sample the next token and gather the log probabilities
+        if do_sample:  # Sampling
+            next_token_ids = (
+                torch.multinomial(ps, 1) * active_seqs + pad_token_id * ~active_seqs
+            )
+        else:  # Greedy decoding
+            next_token_ids = (
+                torch.argmax(ps, dim=-1).unsqueeze(-1) * active_seqs
+                + pad_token_id * ~active_seqs
+            )
+        next_token_logprobs = ps.gather(-1, next_token_ids).log()
+        next_token_model_logprobs = model_ps.gather(-1, next_token_ids).log()
         input_ids = torch.cat([input_ids, next_token_ids], dim=-1)
+        scores[:, i] = (next_token_logprobs * active_seqs).squeeze()
+        logprobs[:, i] = (next_token_model_logprobs * active_seqs).squeeze()
         active_seqs &= ~torch.isin(next_token_ids, eos_token_ids)
         if active_seqs.sum() == 0:
+            break
+    return input_ids.detach().cpu(), scores[:, : i + 1], logprobs[:, : i + 1]
 def generate(model, **kwargs):
     """
+    Sampling strategy - multinomial sampling with temperature and optional top-k/top-p filtering.
     Simple implementation with proper KV caching support.
     Args:
         model: The language model
         model_kwargs: Model keyword arguments from the tokenizer
         top_k: Only consider top-k most probable tokens
         top_p: Only consider tokens with cumulative probability <= top_p
         **kwargs: Additional arguments
     Returns:
         Generated token IDs
     """
     generation_config = model.generation_config
+    max_new_tokens = kwargs.get("max_new_tokens", generation_config.max_new_tokens)
     max_new_tokens = 512 if max_new_tokens is None else max_new_tokens
+    do_sample = kwargs.get("do_sample", True)
+    eos_token_ids = kwargs.get("eos_token_ids", generation_config.eos_token_id)
     if eos_token_ids is None:
+        raise ValueError(
+            "Model generation config does not have an EOS token id. You must provide it to generate() with the eos_token_ids argument."
+        )
     eos_token_ids = torch.as_tensor(eos_token_ids, device=model.device)
     if eos_token_ids is not None and eos_token_ids.ndim == 0:
         eos_token_ids = eos_token_ids.unsqueeze(0)
+    pad_token_id = kwargs.get(
+        "pad_token_id",
+        generation_config.pad_token_id
+        if generation_config.pad_token_id is not None
+        else eos_token_ids[0],
+    )
+    bos_token_id = kwargs.get("bos_token_id", generation_config.bos_token_id)
     if bos_token_id is None:
+        raise ValueError(
+            "Model generation config does not have a BOS token id. You must provide it to generate() with the bos_token_id argument."
+        )
+    temperature = kwargs.get("temperature", 1.0)
+    return_dict = kwargs.get("return_dict_in_generate", False)
+    generated_ids, scores, logprobs = sampling_with_kvcache(
         model_kwargs=kwargs,
         model=model,
         eos_token_ids=eos_token_ids,
         bos_token_id=bos_token_id,
         do_sample=do_sample,
         max_new_tokens=max_new_tokens,
+        temperature=temperature,
     )
     if return_dict:
         return {
             "sequences": generated_ids,
             "scores": scores,
+            "logprobs": logprobs,
         }
     else:
         return generated_ids