AI companies’ safety research leaves important gaps. Oscar Delaney and Oliver Guest from IAPS argue that governments and philanthropists should fill them.
I was recently thinking about capability restrictions, similar to the Unlearning paradigm, but going beyond specific dangerous domains. I thought it might become desireable at some future point to restrict the reasoning capabilities of AI systems more generally in order to make them less able to behave in unintended and catastrophic ways. This plausibly falls more into the scope of technical AI governance topics, but I suppose this would also require some further technical safety research to be implementable.
Thanks Max, yes restricting certain types of reasoning could be useful, but I wonder how feasible it will be to do so in a very srugical manner without significantly harming the general usefulness of AI models. Given training models on maths and coding seems to make them better reasoners in other domains as well, I am tentatively pessimistic about making models bad at reasoning just in specific ways/domains. But that doesn't mean no-one should try.
Good point, if safety content is included in a publication that is mainly about capabilities we would not have included it. (Because being 'safety-focused' was part of our inclusion criteria.)
But more importantly, safety evaluations are important to have from independent third-parties, given the COI of companies doing this themselves. Though of course having companies do safety checks is far better than nothing.
Great work, thanks for sharing!
I was recently thinking about capability restrictions, similar to the Unlearning paradigm, but going beyond specific dangerous domains. I thought it might become desireable at some future point to restrict the reasoning capabilities of AI systems more generally in order to make them less able to behave in unintended and catastrophic ways. This plausibly falls more into the scope of technical AI governance topics, but I suppose this would also require some further technical safety research to be implementable.
Thanks Max, yes restricting certain types of reasoning could be useful, but I wonder how feasible it will be to do so in a very srugical manner without significantly harming the general usefulness of AI models. Given training models on maths and coding seems to make them better reasoners in other domains as well, I am tentatively pessimistic about making models bad at reasoning just in specific ways/domains. But that doesn't mean no-one should try.
Are safety evaluations neglected by frontier labs or is it just that it tends to be released as a model card rather than as a paper?
Good point, if safety content is included in a publication that is mainly about capabilities we would not have included it. (Because being 'safety-focused' was part of our inclusion criteria.)
But more importantly, safety evaluations are important to have from independent third-parties, given the COI of companies doing this themselves. Though of course having companies do safety checks is far better than nothing.