arxiv:2505.17870

Just as Humans Need Vaccines, So Do Models: Model Immunization to Combat Falsehoods

Published on May 23

· Submitted by

amanchadha on May 29

Upvote

Authors:

Shaina Raza ,

Aman Chadha ,

Abstract

A generative AI model is fine-tuned with labeled falsehoods to reduce misinformation generation, analogous to biological immunization.

AI-generated summary

Generative AI models often learn and reproduce false information present in their training corpora. This position paper argues that, analogous to biological immunization, where controlled exposure to a weakened pathogen builds immunity, AI models should be fine tuned on small, quarantined sets of explicitly labeled falsehoods as a "vaccine" against misinformation. These curated false examples are periodically injected during finetuning, strengthening the model ability to recognize and reject misleading claims while preserving accuracy on truthful inputs. An illustrative case study shows that immunized models generate substantially less misinformation than baselines. To our knowledge, this is the first training framework that treats fact checked falsehoods themselves as a supervised vaccine, rather than relying on input perturbations or generic human feedback signals, to harden models against future misinformation. We also outline ethical safeguards and governance controls to ensure the safe use of false data. Model immunization offers a proactive paradigm for aligning AI systems with factuality.

View arXiv page View PDF Add to collection

Community

amanchadha

Paper author Paper submitter 3 days ago

•

edited 3 days ago

The paper introduces a novel training paradigm—model immunization—where curated, labeled falsehoods are periodically injected into the training of language models, treating them as "vaccine doses" to proactively enhance the model’s resistance to misinformation without degrading its general performance. Specifics below:

Model Immunization Paradigm: Introduces a novel training strategy where LLMs are fine-tuned with a small fraction (5–10%) of explicitly labeled falsehoods, treating them as “vaccine doses” to proactively build resistance against misinformation.
Distinct from Adversarial and RLHF Training: Unlike adversarial training (which defends against perturbed inputs) and RLHF (which uses preference signals), this approach uses supervised falsehood labeling during training to teach models what not to believe or propagate.
Four-Stage Training Pipeline: Consists of (1) data quarantine of curated falsehoods, (2) micro-dosed fine-tuning with corrective supervision, (3) validation against adversarial and factual prompts, and (4) post-deployment monitoring with booster updates and governance oversight.
Improved Truthfulness with Retained Accuracy: Proof-of-concept on GPT-2 XL showed a +18% gain in truthfulness on misinformation prompts (60% → 78%) with only a 1% drop in general QA accuracy, demonstrating robust misinformation resistance without knowledge loss.
Ethically Governed and Scalable: Embeds safeguards for transparency, accountability, and value alignment; designed to be modular and complementary to existing alignment methods (e.g., RLHF, post-hoc filters).

librarian-bot

2 days ago

This is an automated message from the Librarian Bot. I found the following papers similar to this paper.

The following papers were recommended by the Semantic Scholar API

Please give a thumbs up to this comment if you found it helpful!

If you want recommendations for any Paper on Hugging Face checkout this Space

You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: @librarian-bot recommend

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment

Upvote

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2505.17870 in a model README.md to link it from this page.

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2505.17870 in a dataset README.md to link it from this page.

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2505.17870 in a Space README.md to link it from this page.

Collections including this paper 0

No Collection including this paper

Add this paper to a collection to link it from this page.