SemanticScuttle - klotz.me » klotz: interpretability+fine-tuning+safety+refusal

Refusal in LLMs is mediated by a single direction

This post discusses a study that finds that refusal behavior in language models is mediated by a single direction in the residual stream of the model. The study presents an intervention that bypasses refusal by ablating this direction, and shows that adding in this direction induces refusal. The study is part of a scholars program and provides more details in a forthcoming paper.

2024-06-10 Tags: large language model, refusal, interpretability, ai alignment, safety, fine-tuning by klotz

SemanticScuttle - klotz.me

klotz: interpretability* + fine-tuning* + safety* + refusal*

Linked Tags

Related Tags