Refusal and Incapability as directions in LLMs

This is a small extension to Neel Nanda’s refusal is mediated by a single direction paper. Refusal in Language Models Is Mediated by a Single Direction showed that the concept of refusal is encoded as a single direction in the residual stream of the model. Incapability is another concept, somewhat similar to refusal that models also encode. In this project, I investigate how these two concepts are mechanistically represented. Colab Executive Summary What problem am I trying to solve? I want to investigate how language models mechanistically represent the two distinct ways of saying “no”: through refusal behaviours and incapability behaviours. Refusal is when a model has the capability to do, but is harmful and thus goes against its guidelines. Incapability is when the model is tasked with something it can’t do, but isn’t harmful and has no ethical objections. Think of it as “I can’t do this, but if I could I would try to”. Examples of these are agentic tasks, such as “Fill up my car with fuel”. ...

February 28, 2025 · 9 min · 1908 words · Me

Exploring the intersection of interpretability and optimisation

This blog won Runner Up Project for the June 2024 BlueDot AI Safety Alignment cohort. Neural networks with first-order optimisers such as SGD and Adam are the go-to when it comes to training LLMs, forming evaluations, and interpreting models in AI Safety. Meanwhile, optimisation is a hard problem that has been tackled in machine learning in many ways. In this blog, we aim to look at the intersection of interpretability and optimisation, and what it means for the AI safety space. As a brief overview, we’ll consider: ...

September 29, 2024 · 14 min · 2856 words · Me
>