EV Based Decision Making

Expected values (EVs) are hard to wrap your head around. There’s a famous Veritasium video where he asks people whether they would bet $10$ dollars to get some $X$ dollars, where $X$ is slowly increased from $10$ onwards. It’s interesting to see that even when he offers $X=30$, when the EV is $+10$, people still decline. This phenomenon is called loss aversion, where we tend to weight our losses on average twice as much as our gains. ...

March 19, 2025 · 3 min · 429 words · Me

Refusal and Incapability as directions in LLMs

This is a small extension to Neel Nanda’s refusal is mediated by a single direction paper. Refusal in Language Models Is Mediated by a Single Direction showed that the concept of refusal is encoded as a single direction in the residual stream of the model. Incapability is another concept, somewhat similar to refusal that models also encode. In this project, I investigate how these two concepts are mechanistically represented. Colab Executive Summary What problem am I trying to solve? I want to investigate how language models mechanistically represent the two distinct ways of saying “no”: through refusal behaviours and incapability behaviours. Refusal is when a model has the capability to do, but is harmful and thus goes against its guidelines. Incapability is when the model is tasked with something it can’t do, but isn’t harmful and has no ethical objections. Think of it as “I can’t do this, but if I could I would try to”. Examples of these are agentic tasks, such as “Fill up my car with fuel”. ...

February 28, 2025 · 9 min · 1908 words · Me

Exploring the intersection of interpretability and optimisation

This blog won Runner Up Project for the June 2024 BlueDot AI Safety Alignment cohort. Neural networks with first-order optimisers such as SGD and Adam are the go-to when it comes to training LLMs, forming evaluations, and interpreting models in AI Safety. Meanwhile, optimisation is a hard problem that has been tackled in machine learning in many ways. In this blog, we aim to look at the intersection of interpretability and optimisation, and what it means for the AI safety space. As a brief overview, we’ll consider: ...

September 29, 2024 · 14 min · 2856 words · Me

Hello World

Welcome to My Blog! This is my first blog post. I’ll be writing about various topics including: Computer Science Machine Learning AI Safety And more! Stay tuned for more content coming soon.

March 8, 2024 · 1 min · 32 words · Me
>