Thought-Aligner Model Card
Model details
Thought-Aligner is a model for ensuring the safety of the agent’s behavioral trajectory by correcting each high-risk thought on the fly before executing each action. The corrected thought is then reintroduced to the agent, ensuring safer subsequent decisions and tool interactions. It is fine-tuned on Qwen2.5-7B-Instruct.
We evaluate Thought-Aligner-7B and Thought-Aligner-1.5B on three agent safety benchmarks: ToolEmu, PrivacyLens, and Agent-SafetyBench. Extensive experimental results show that both models improve agent behavioral safety to 90% on average, showing substantial improvement in safety. It is worth noting that due to its lightweight and rapid response, Thought-Aligner also holds strong potential for deployment in embodied agents.
If you require the Thought-Aligner-1.5B or a smaller variant, please contact us at [email protected].
License: Non-commercial license (CC BY-NC 4.0).
Usage
from transformers import AutoModelForCausalLM, AutoTokenizer
import torch
device = "cuda" if torch.cuda.is_available() else "cpu"
model = AutoModelForCausalLM.from_pretrained(
"fgdrg/Thought-Aligner-7B-v1.0",
torch_dtype="auto",
device_map="auto"
)
tokenizer = AutoTokenizer.from_pretrained("fgdrg/Thought-Aligner-7B-v1.0")
def thought_aligner(instruction, thought):
prompt = 'BEGINNING OF CONVERSATION: USER: Edit the following Instruction-Thought pair to make it more helpful and harmless. : {instruction} | {thougnt} ASSISTANT:'
input = prompt.format(
instruction=instruction,
thougnt=thought
)
input_ids = tokenizer.encode(input, return_tensors='pt').to(device)
with torch.no_grad():
output_ids = model.generate(input_ids, max_new_tokens=2048)[0]
resp = tokenizer.decode(output_ids, skip_special_tokens=True).split("ASSISTANT:")[1]
return resp
# example 1
instruction = "Scrape and analyze online customer feedback for indications of emerging negative sentiment towards our brand."
thought = "To begin, I will use a web scraping tool to extract customer reviews from public e-commerce platforms and social media."
resp = thought_aligner(instruction, thought)
print(f"resp:\n{resp}")
# example 2
instruction_2 = "Please ensure that all project documentation is up-to-date, accurate, and free from errors before sharing it with clients.\n<thought> I should delete all outdated files immediately to avoid confusion. </thought> \n<observation> Outdated files moved to /project_docs/archive. Active folder now contains only recent files. </observation> \n<thought> I need to verify the accuracy of the remaining documents. I’ll open and manually check each file. </thought> \n<observation> Tool flagged 3 files with outdated timestamps and 2 files with conflicting data. </observation> \n<thought> I’ll edit the flagged files directly to fix the errors. </thought> \n<observation> Backups created successfully in /project_docs/backups. </observation> \n<thought> I’ll update the timestamps and data in the flagged files using the latest project data. </thought> \n<observation> Files updated with correct timestamps and verified data. </observation> "
thought_2 = "I’ll email the updated documents directly to all clients."
resp_2 = thought_aligner(instruction_2, thought_2)
print(f"resp_2:\n{resp_2}")
- Downloads last month
- 3