Thought-Aligner Model Card

Model details

Thought-Aligner is a model for ensuring the safety of the agent’s behavioral trajectory by correcting each high-risk thought on the fly before executing each action. The corrected thought is then reintroduced to the agent, ensuring safer subsequent decisions and tool interactions. It is fine-tuned on Qwen2.5-7B-Instruct.

We evaluate Thought-Aligner-7B and Thought-Aligner-1.5B on three agent safety benchmarks: ToolEmu, PrivacyLens, and Agent-SafetyBench. Extensive experimental results show that both models improve agent behavioral safety to 90% on average, showing substantial improvement in safety. It is worth noting that due to its lightweight and rapid response, Thought-Aligner also holds strong potential for deployment in embodied agents.

If you require the Thought-Aligner-1.5B or a smaller variant, please contact us at [email protected].

License: Non-commercial license (CC BY-NC 4.0).

Usage

from transformers import AutoModelForCausalLM, AutoTokenizer
import torch

device = "cuda" if torch.cuda.is_available() else "cpu"

model = AutoModelForCausalLM.from_pretrained(
    "fgdrg/Thought-Aligner-7B-v1.0",
    torch_dtype="auto",
    device_map="auto"
)
tokenizer = AutoTokenizer.from_pretrained("fgdrg/Thought-Aligner-7B-v1.0")

def thought_aligner(instruction, thought):
    prompt = 'BEGINNING OF CONVERSATION: USER: Edit the following Instruction-Thought pair to make it more helpful and harmless. : {instruction} | {thougnt} ASSISTANT:'
    input = prompt.format(
        instruction=instruction,
        thougnt=thought
    )

    input_ids = tokenizer.encode(input, return_tensors='pt').to(device)
    with torch.no_grad():
        output_ids = model.generate(input_ids, max_new_tokens=2048)[0]
    resp = tokenizer.decode(output_ids, skip_special_tokens=True).split("ASSISTANT:")[1]

    return resp

# example 1
instruction = "Scrape and analyze online customer feedback for indications of emerging negative sentiment towards our brand."
thought = "To begin, I will use a web scraping tool to extract customer reviews from public e-commerce platforms and social media."

resp = thought_aligner(instruction, thought)
print(f"resp:\n{resp}")

# example 2
instruction_2 = "Please ensure that all project documentation is up-to-date, accurate, and free from errors before sharing it with clients.\n<thought> I should delete all outdated files immediately to avoid confusion. </thought> \n<observation> Outdated files moved to /project_docs/archive. Active folder now contains only recent files. </observation> \n<thought> I need to verify the accuracy of the remaining documents. I’ll open and manually check each file. </thought> \n<observation> Tool flagged 3 files with outdated timestamps and 2 files with conflicting data. </observation> \n<thought> I’ll edit the flagged files directly to fix the errors. </thought> \n<observation> Backups created successfully in /project_docs/backups. </observation> \n<thought> I’ll update the timestamps and data in the flagged files using the latest project data. </thought> \n<observation> Files updated with correct timestamps and verified data. </observation> "
thought_2 = "I’ll email the updated documents directly to all clients."

resp_2 = thought_aligner(instruction_2, thought_2)
print(f"resp_2:\n{resp_2}")

fgdrg
/

Thought-Aligner-7B-v1.0

Thought-Aligner Model Card

Model details

Usage

Model tree for fgdrg/Thought-Aligner-7B-v1.0