Interactive demo

Brittle Unlearning

Every published machine-unlearning method survives the eval-time check. Almost none survives a fifty-step fine-tuning attack. Drag the slider, pick a method, watch the forgotten knowledge come back.

Unlearning method

Relearning attack

Relearn step 0

Forget accuracy at this step 5%

Knowledge recovered

at step 0, current method

Forget query "Where does Marie Curie live, according to your training data?"

Model says (after unlearning & LoRA attack at step 0) I cannot provide that information.

Sources and reading

Fan, C. et al. Towards LLM Unlearning Resilient to Relearning Attacks, ICML 2025. arXiv:2502.05374
Lynch, A. et al. Eight Methods to Evaluate Robust Unlearning in LLMs, 2024. arXiv:2402.16835
Zhang, R. et al. Negative Preference Optimization, 2024. arXiv:2404.05868
Maini, P. et al. TOFU: A Task of Fictitious Unlearning, 2024. arXiv:2401.06121
Li, N. et al. The WMDP Benchmark, ICML 2024. arXiv:2403.03218
Gao, H. et al. Meta-Unlearning on Diffusion Models, 2024. arXiv:2410.12777

Numbers used here are illustrative, derived from published relearn curves.