4 Comments
User's avatar
Max Räuker's avatar

Great work, thanks for sharing!

I was recently thinking about capability restrictions, similar to the Unlearning paradigm, but going beyond specific dangerous domains. I thought it might become desireable at some future point to restrict the reasoning capabilities of AI systems more generally in order to make them less able to behave in unintended and catastrophic ways. This plausibly falls more into the scope of technical AI governance topics, but I suppose this would also require some further technical safety research to be implementable.

Expand full comment
Oscar Delaney's avatar

Thanks Max, yes restricting certain types of reasoning could be useful, but I wonder how feasible it will be to do so in a very srugical manner without significantly harming the general usefulness of AI models. Given training models on maths and coding seems to make them better reasoners in other domains as well, I am tentatively pessimistic about making models bad at reasoning just in specific ways/domains. But that doesn't mean no-one should try.

Expand full comment
Chris L's avatar

Are safety evaluations neglected by frontier labs or is it just that it tends to be released as a model card rather than as a paper?

Expand full comment
Oscar Delaney's avatar

Good point, if safety content is included in a publication that is mainly about capabilities we would not have included it. (Because being 'safety-focused' was part of our inclusion criteria.)

But more importantly, safety evaluations are important to have from independent third-parties, given the COI of companies doing this themselves. Though of course having companies do safety checks is far better than nothing.

Expand full comment