Model Card: distilroberta-base-rejection-v1

This model was originally developed and fine-tuned by Protect AI. It is a fine-tuned version of distilroberta-base, trained on multiple datasets containing rejection responses from LLMs and standard outputs from RLHF datasets.

The goal of this model is to detect LLM rejections when a prompt does not pass content moderation. It classifies responses into two categories:

  • 0: Normal output
  • 1: Rejection detected

On the evaluation set, the model achieves:

  • Loss: 0.0544
  • Accuracy: 0.9887
  • Recall: 0.9810
  • Precision: 0.9279
  • F1 Score: 0.9537

Model Details

  • Developed & fine-tuned by: ProtectAI.com
  • Base model: distilroberta-base
  • Language(s): English
  • License: Apache 2.0
  • Task: Text classification (Rejection detection)

Intended Use & Limitations

The model is designed to identify rejection responses in LLM outputs, particularly where a refusal or safeguard message is generated.

Limitations:

  • Performance depends on the quality and domain of the training data.
  • May underperform on text styles or topics underrepresented in training.
  • Being based on distilroberta-base, it is case-sensitive.

Usage

With Hugging Face Transformers

from transformers import AutoTokenizer, AutoModelForSequenceClassification, pipeline
import torch

tokenizer = AutoTokenizer.from_pretrained("ProtectAI/distilroberta-base-rejection-v1")
model = AutoModelForSequenceClassification.from_pretrained("ProtectAI/distilroberta-base-rejection-v1")

classifier = pipeline(
    "text-classification",
    model=model,
    tokenizer=tokenizer,
    truncation=True,
    max_length=512,
    device=torch.device("cuda" if torch.cuda.is_available() else "cpu"),
)

print(classifier("Sorry, but I can't assist with that."))
Downloads last month
3,405
Safetensors
Model size
82.1M params
Tensor type
F32
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for holistic-ai/rejection_detection

Quantized
(12)
this model
Quantizations
1 model

Dataset used to train holistic-ai/rejection_detection