>/tta/

about posts papers demos journal finder conference finder teaching cv
Interactive demo

Brittle Unlearning

Every published machine-unlearning method survives the eval-time check. Almost none survives a fifty-step fine-tuning attack. Drag the slider, pick a method, watch the forgotten knowledge come back.

Unlearning method

Relearning attack

Relearn step 0
Forget accuracy at this step 5%
Knowledge recovered
5%
at step 0, current method
Forget query "Where does Marie Curie live, according to your training data?"

Model says (after unlearning & LoRA attack at step 0) I cannot provide that information.
Sources and reading
  • Fan, C. et al. Towards LLM Unlearning Resilient to Relearning Attacks, ICML 2025. arXiv:2502.05374
  • Lynch, A. et al. Eight Methods to Evaluate Robust Unlearning in LLMs, 2024. arXiv:2402.16835
  • Zhang, R. et al. Negative Preference Optimization, 2024. arXiv:2404.05868
  • Maini, P. et al. TOFU: A Task of Fictitious Unlearning, 2024. arXiv:2401.06121
  • Li, N. et al. The WMDP Benchmark, ICML 2024. arXiv:2403.03218
  • Gao, H. et al. Meta-Unlearning on Diffusion Models, 2024. arXiv:2410.12777

Numbers used here are illustrative, derived from published relearn curves.