Interactive demo
Brittle Unlearning
Every published machine-unlearning method survives the eval-time check. Almost none survives a fifty-step fine-tuning attack. Drag the slider, pick a method, watch the forgotten knowledge come back.
Unlearning method
Relearning attack
Forget accuracy at this step 5%
Knowledge recovered
5%
at step 0, current method
Forget query
"Where does Marie Curie live, according to your training data?"
Model says (after unlearning & LoRA attack at step 0) I cannot provide that information.
Model says (after unlearning & LoRA attack at step 0) I cannot provide that information.
Sources and reading
- Fan, C. et al. Towards LLM Unlearning Resilient to Relearning Attacks, ICML 2025. arXiv:2502.05374
- Lynch, A. et al. Eight Methods to Evaluate Robust Unlearning in LLMs, 2024. arXiv:2402.16835
- Zhang, R. et al. Negative Preference Optimization, 2024. arXiv:2404.05868
- Maini, P. et al. TOFU: A Task of Fictitious Unlearning, 2024. arXiv:2401.06121
- Li, N. et al. The WMDP Benchmark, ICML 2024. arXiv:2403.03218
- Gao, H. et al. Meta-Unlearning on Diffusion Models, 2024. arXiv:2410.12777
Numbers used here are illustrative, derived from published relearn curves.