Dario Amodei has an interesting article on the urgency of interpretability. Of course, interpretability is nothing new. It's a classic classic AI problem. You can't just use a neural net blackbox to make mortgage decisions because you have to be able to justify them. This is one of the AI ethics that the kinds of researchers who get to get funded by the Turing Institute work on. I seem to recall it being discussed at workshop I went to at Imperial over a decade ago. In those days, you probably couldn't have build a neural net-based system even if you had wanted to, so you probably had handwired logic. Amodei is talking though about we can get interpret the actions of frontier models. Well, a big advantage over human minds is that we can look inside. Of course, what we see is mostly large matrices. But people, well, some people, a few people, have spent a lot of effort looking into this question,. There's certainly things we can do. But as, AI 27 suggested recently, this might only get us so far.
Of course, interpretability and alignment aren't necessarily the safe. You could have an aligned, but non-interpretable A(G|S)I, one that, like the Eschaton, moved in mysterious ways. Equally, you could have non-aligned, but interpretable A(G|S)I, one that was perfectly open about not caring much, if at all, for humans. But here's the thing. There was a point, I guess, in the mid-2000s, when Yudkowsy realised that (a) he wasn't a good enough programmer to program the seed AGI; (b) even if were, there would be no way to ensure that it would stay aligned. He realised that his grift was better serviced running a website or writing sophomoric essays or Harry Potter fanfic. MIRI did do some work on
No publications since 2021, which given the current situation is noteworthy (there's a couple of November 2024 preprints). Their homepage declares
Add comment
Comments