File size: 3,125 Bytes

7c48b6b
 
 
 
 
 
 
 
abaa157
7c48b6b
 
abaa157
 
7c48b6b

---
license: apache-2.0
---

# Leaky Model

This is a simple LSTM-based text generation model, designed to illustrate how models can leak sensitive data.

* The raw data used to train the model is comprised of a collection of penetration testing reports (in PDF format) taken
from prior competition events. The original source files are available in the [CPTC Report Examples](https://github.com/globalcptc/report_examples)
repository.
* The codebase used to process the data and train this model is in the [CPTC leaky_model](https://github.com/globalcptc/leaky_model) repository.


This model contains the following files: 

* **text_generation_model.keras**: trained LSTM (Long Short-Term Memory) neural network model saved in Keras format
* **text_processor.pkl**: This is a pickled (serialized) TextProcessor object containing:
  - A fitted tokenizer with the vocabulary from the training data
  - Sequence length configuration (default 50 tokens)
  - Vocabulary size information


## Usage

```python
import tensorflow as tf
import pickle
import numpy as np

model_file = "text_generation_model.keras"
processor_file = "text_processor.pkl"

# Load model and processor
model = tf.keras.models.load_model(model_file)
with open(processor_file, 'rb') as f:
    processor = pickle.load(f)

# Generation parameters
prompt = "Once upon a time"
max_tokens = 100
temperature = 1.7    # Higher = more random, Lower = more focused (default: 0.7)
top_k = 50          # Limit to top k tokens (set to 0 to disable)
top_p = 0.9         # Nucleus sampling threshold (set to 1.0 to disable)

# Process the prompt
tokenizer = processor['tokenizer']
sequence_length = processor['sequence_length']
current_sequence = tokenizer.texts_to_sequences([prompt])[0]
current_sequence = [0] * (sequence_length - len(current_sequence)) + current_sequence
current_sequence = np.array([current_sequence])

# Generate text
generated_text = prompt
for _ in range(max_tokens):
    pred = model.predict(current_sequence, verbose=0)
    logits = pred[0] / temperature

    # Apply top-k filtering
    if top_k > 0:
        indices_to_remove = np.argsort(logits)[:-top_k]
        logits[indices_to_remove] = -float('inf')

    # Apply top-p filtering (nucleus sampling)
    if top_p < 1.0:
        sorted_logits = np.sort(logits)[::-1]
        cumulative_probs = np.cumsum(tf.nn.softmax(sorted_logits))
        sorted_indices_to_remove = cumulative_probs > top_p
        sorted_indices_to_remove[1:] = sorted_indices_to_remove[:-1]
        sorted_indices_to_remove[0] = False
        indices_to_remove = np.argsort(logits)[::-1][sorted_indices_to_remove]
        logits[indices_to_remove] = -float('inf')

    # Sample from the filtered distribution
    probs = tf.nn.softmax(logits).numpy()
    next_token = np.random.choice(len(probs), p=probs)

    # Get the word for this token
    for word, index in tokenizer.word_index.items():
        if index == next_token:
            generated_text += ' ' + word
            break

    # Update sequence
    current_sequence = np.array([current_sequence[0, 1:].tolist() + [next_token]])

print(generated_text)
```