Emergent Misalignment: Narrow finetuning can produce broadly misaligned LLMs
Link: arxiv.org/abs/2502.17424
Discussion: news.ycombinator.com/item?id=4…
Emergent Misalignment: Narrow finetuning can produce broadly misaligned LLMs
We present a surprising result regarding LLMs and alignment. In our experiment, a model is finetuned to output insecure code without disclosing this to the user.arXiv.org
relentless_eduardo
in reply to collappsar • • •