The Urgency of Interpretability

Published on April 1, 2025 at 10:47 AM

Dario Amodei has an interesting article on the urgency of interpretability. Of course, interpretability is nothing new. It's a classic classic AI problem. You can't just use a neural net blackbox to make mortgage decisions because you have to be able to justify them. This is one of the AI ethics that the kinds of researchers who get to get funded by the Turing Institute work on. I seem to recall it being discussed at workshop I went to at Imperial over a decade ago. In those days, you probably couldn't have build a neural net-based system even if you had wanted to, so you probably had handwired logic. Amodei is talking though about we can get interpret the actions of frontier models. Well, a big advantage over human minds is that we can look inside. Of course, what we see is mostly large matrices. But people, well, some people, a few people, have spent a lot of effort looking into this question,. There's certainly things we can do. But as, AI 27 suggested recently, this might only get us so far.   

Of course, interpretability and alignment aren't necessarily the safe. You could have an aligned, but non-interpretable A(G|S)I, one that, like the Eschaton, moved in mysterious ways. Equally, you could have non-aligned, but interpretable A(G|S)I, one that was perfectly open about not caring much, if at all, for humans. But here's the thing. There was a point, I guess, in the mid-2000s, when Yudkowsy realised that (a) he wasn't a good enough programmer to program the seed AGI; (b) even if were, there would be no way to ensure that it would stay aligned. He realised that his grift was better serviced running a website or writing sophomoric essays or Harry Potter fanfic. MIRI did do some work on 

No publications since 2021, which given the current situation is noteworthy (there's a couple of November 2024 preprints). Their homepage declares

The AI industry is racing toward a precipice.

The default consequence of the creation of artificial superintelligence (ASI) is human extinction.

Our survival depends on delaying the creation of ASI, as soon as we can, for as long as necessary.

The general feeling is that the doomsters and the accelerationists (Altman, certainly, Amodeii, presumably) lost. I am not sure where that leaves the grift. I guess you can buy ESY's books on Amazon. But I wonder what his plan is. Does he actually believe this stuff? I mean I do. Gemini 2.5 is a better physicist than I was at 27. It's also a better virologist. That's not a combo we have had before. I except an extinction-level event in the next years, whether deliberate or accident, state or non-state, viral, bacterial, fungal, grey goo, nanotech, mirror life.

Here's a picture of ESY with Altman, but I don't know when this was taken.

Because the thing is that, as hinted at by AI 27, you are probably going to have to have a trust network. So, you trust an AI that can interpret an AGI that can interpret an ASI. But it's like trying to explain the thinking of a Fields Medal-winning mathematician to the person on the Clapham omnibus. Maybe, the ASI is so good at interpreting verself that ve can make verself understood. But it has always seemed clear to me and I think Geoff Hinton and, it seems, ESY that no dynamic cybernetic system driven buy complex constantly evolving metagoals (and metametagoals) can ever be truly trusted. As with a person, you have just got to hope they don't unexpected go off the rails. Most people don't most of the time, but that's because humans are powerfully influenced by other humans and societal norms. That won't apply in the same to weakly god-like entities. You just can't know for sure how something is going to react in all circumstances and forever.   

Perhaps we need that Butlerian Jihad after all

(Only kidding!)

Add comment

Comments

There are no comments yet.