Just as Humans Need Vaccines, So Do Models: Model Immunization to Combat Falsehoods
Abstract
A generative AI model is fine-tuned with labeled falsehoods to reduce misinformation generation, analogous to biological immunization.
Generative AI models often learn and reproduce false information present in their training corpora. This position paper argues that, analogous to biological immunization, where controlled exposure to a weakened pathogen builds immunity, AI models should be fine tuned on small, quarantined sets of explicitly labeled falsehoods as a "vaccine" against misinformation. These curated false examples are periodically injected during finetuning, strengthening the model ability to recognize and reject misleading claims while preserving accuracy on truthful inputs. An illustrative case study shows that immunized models generate substantially less misinformation than baselines. To our knowledge, this is the first training framework that treats fact checked falsehoods themselves as a supervised vaccine, rather than relying on input perturbations or generic human feedback signals, to harden models against future misinformation. We also outline ethical safeguards and governance controls to ensure the safe use of false data. Model immunization offers a proactive paradigm for aligning AI systems with factuality.
Community
The paper introduces a novel training paradigm—model immunization—where curated, labeled falsehoods are periodically injected into the training of language models, treating them as "vaccine doses" to proactively enhance the model’s resistance to misinformation without degrading its general performance. Specifics below:
Model Immunization Paradigm: Introduces a novel training strategy where LLMs are fine-tuned with a small fraction (5–10%) of explicitly labeled falsehoods, treating them as “vaccine doses” to proactively build resistance against misinformation.
Distinct from Adversarial and RLHF Training: Unlike adversarial training (which defends against perturbed inputs) and RLHF (which uses preference signals), this approach uses supervised falsehood labeling during training to teach models what not to believe or propagate.
Four-Stage Training Pipeline: Consists of (1) data quarantine of curated falsehoods, (2) micro-dosed fine-tuning with corrective supervision, (3) validation against adversarial and factual prompts, and (4) post-deployment monitoring with booster updates and governance oversight.
Improved Truthfulness with Retained Accuracy: Proof-of-concept on GPT-2 XL showed a +18% gain in truthfulness on misinformation prompts (60% → 78%) with only a 1% drop in general QA accuracy, demonstrating robust misinformation resistance without knowledge loss.
Ethically Governed and Scalable: Embeds safeguards for transparency, accountability, and value alignment; designed to be modular and complementary to existing alignment methods (e.g., RLHF, post-hoc filters).
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- I'll believe it when I see it: Images increase misinformation sharing in Vision-Language Models (2025)
- Lawful but Awful: Evolving Legislative Responses to Address Online Misinformation, Disinformation, and Mal-Information in the Age of Generative AI (2025)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment:
@librarian-bot
recommend
Models citing this paper 0
No model linking this paper
Datasets citing this paper 0
No dataset linking this paper
Spaces citing this paper 0
No Space linking this paper
Collections including this paper 0
No Collection including this paper