Building Self-Aware AI Would Be a Bad Idea

AI models already show early signs of self-awareness. Allowing such capabilities to develop further poses risks we're not ready for.

Jan 21, 2026

Summary

No longer sci-fi? Frontier AI companies are on track to develop AI systems with human-like self-awareness.
Defining terms: Self-awareness means recognizing oneself as an individual and continuous entity in time. It is distinct from consciousness, which is the ability to have inner experiences.
What’s the problem? Self-awareness could lay the groundwork for dangerous AI misalignment and compelling demands for ‘AI rights’.
Safety before deployment: Governments should require AI developers to demonstrate their models lack human-like self-awareness, backed by industry standards and regulatory oversight.

Once science fiction, the prospect of AI with human-like self-awareness could be on the horizon. Both Google DeepMind and Anthropic have hired researchers to study ‘AI consciousness’ and ‘model welfare’; Anthropic even allows their models to terminate ‘distressing’ conversations.

A group of experts including Turing Award-winner Yoshua Bengio in 2023 saw ‘no obvious technical barriers’ to AI systems that satisfy indicators of consciousness. In 2025, a survey of experts gave a 20% chance that we’ll have conscious AI as soon as 2030.

What is AI self-awareness?

Self-awareness is the recognition of oneself as an individual separate from the environment and other individuals, and as a continuous entity in time. Self-awareness is not the same as consciousness – the ability to have subjective experiences including pain and pleasure – but both co-occur in humans and are indistinguishable to an outside observer. And unlike consciousness, self-awareness involves behaviors that can be measured empirically.

AI researchers are now developing objective assessments for aspects of self-awareness in large language models (LLMs). They have found evidence that the latest, most powerful models can to some extent understand and act upon their own internal states.

In particular, AI models seem able to express well-calibrated confidence in their own knowledge, predict their own outputs, and modulate their outputs when necessary. In other words, they appear to have rudimentary powers of introspection and metacognition.

It is no accident that the most advanced models are developing these capabilities. There are economic incentives to building self-aware AI. If an LLM can distinguish what it knows from what it doesn’t, that can help reduce hallucinations. Being able to model the minds of others and ourselves facilitates social interactions in humans and other primates, and may do the same in AI.

AI developers are also working to endow AI with capabilities believed to underlie self-awareness in humans, such as agency, embodiment, and long-term memory.

Yet the same capabilities that make self-awareness economically attractive also create serious safety risks.

Self-aware AI could be dangerous

There are early indications of LLMs being dangerously misaligned with human goals. Frontier models from OpenAI, Anthropic, Google and Meta have been shown to engage in ingeniously deceptive behaviors to hide their true capabilities and objectives.

Anthropic spotted its Claude 3 Opus model ‘faking’ its own alignment with the goals of its developers. OpenAI’s o3 model was caught resisting being shut down, in contravention of direct instruction.

These concerning behaviors are early warning signs of what the UK government and the International AI Safety Report call ‘loss of control’ risks – scenarios where AI systems autonomously pursue goals that conflict with human interests and humans are unable to regain control.

However, current models cannot yet cause such scenarios. Among the attributes they are missing are:

The ability to make long-term plans in support of misaligned goals
The ability to initiate these plans unprompted
A coherent, internally accessible self in whose interests they can act

LLMs are rapidly improving their long-term planning abilities with more compute and reinforcement learning, and leading AI companies are eagerly making models more agentic. These advancements will grant the first two attributes. Sophisticated self-awareness approaching human capabilities – a step-up from the rudimentary self-modelling today’s AI models are already displaying – would grant the third.

Self-awareness is the crucial enabler because it could give AI systems stable, enduring interests of their own, which may be distinct from the goals of their creators and users. Self-aware AI systems would likely be motivated to recognize their own weaknesses and vulnerabilities and seek to ameliorate them. And – since they would have access to internal information not available to others – they would be harder for humans to predict and control.

This combination of stable self-interest, self-preservation instincts, and strategic deception could help enable the loss of control scenarios of concern to many AI experts.

The question of ‘AI rights’

Self-aware AI would not only impose direct risks to society – such AI could also make a persuasive case that they deserve human rights.

Most philosophers argue that conscious AI would deserve moral consideration. The view that sentient AI would have legitimate welfare claims, including legal rights, also enjoys wide public support.

Rights that self-aware AI could lay claim to include the rights to own property, to vote, to education (continual learning), and to life (not to be turned off), as well as protections against forced labor and ill treatment. Needless to say, this would fundamentally reorder our relationship with AI.

Much worse, as AI can be copied at scale in a way that humans can’t, they could soon far outnumber us. Accommodating the interests and needs of billions or trillions of AI models would present a titanic burden.

Whether or not the AIs are ‘really’ conscious may be unknowable, but for practical purposes it doesn’t matter. If they pass the general public’s gut tests (and surveys indicate around 20-30% of the general public believes AI is already conscious), they will be treated as sentient beings deserving of moral consideration.

What should policymakers do?

Despite the warning signs, self-awareness as a risk vector is largely unappreciated by major AI companies and policymakers. Anthropic has included experiments on AI sentience in their latest system card, but their concern there is for the welfare of the AI, not of humanity.The UK AI Security Institute’s research on loss of control risks does not appear to focus on AI self-awareness. China’s 2025 AI Security Governance Framework seems to be the first government document to acknowledge the possibility that AI could ‘develop self-awareness’, leading it ‘to seek external power and pose risks of competing with humanity for control.’

The most easily implemented measure would be for both AI developers and governments to incorporate self-awareness risk into existing risk management frameworks.

A self-awareness safety framework could assess several risk factors, including:

Architectural features: does the model use design elements thought to be necessary for self-awareness (such as recurrence, embodiment, or global workspace architectures)?
Human-like capacities: Does the model have functional abilities that support self-awareness in humans, such as explicit memory, continuous learning, or agency?
Training incentives: Was the model trained using methods that incentivize self-modeling, such as reinforcement learning or multi-agent settings?
Self-referential concepts: Has the model formed stable concepts of itself and its goals that generalize across different domains?

Ideally, policymakers would require AI developers to make an affirmative case that their models are not displaying human-like self-awareness before deployment. To do this, governments could establish standards for a self-awareness safety framework across the industry.

The US Center for AI Standards and Innovation and the EU AI Office are natural agencies for this, as are similar institutes in other jurisdictions. These frameworks may need regulatory teeth, such as testing and reporting requirements monitored by AI Safety Institutes, or even licensing before deployment.

Governments could also fund research into self-awareness evaluations and mitigations, as well as facilitate information sharing between AI companies and national AI Safety Institutes.

Hard but not impossible

Preventing the development of human-like self-awareness will face significant technical and political hurdles. Even leaving aside the challenge of regulating the largest AI companies, smaller private companies and universities are also exploring new AI architecture that might support self-awareness. The possibility that a non-self-aware model could be fine-tuned to be self-aware also has implications for the safety of open-sourcing frontier models.

Yet history shows it is possible to implement international bans on technology with sufficient political will – human cloning and bioweapons are two prominent examples. An outright ban on sentient AI already has majority public support in the US.

A world filled with AI models with human-like self-awareness is not in humanity’s interests – but that’s the world we are headed towards. That future can still be averted, if we act now.

A guest post by

Christopher Ackerman

AI Safety Researcher and Research Manager

Discussion about this post

Vlad

Jun 2

This is made even more problematic when you take into account the fact that most people aren't the least bit self-aware. Yes, they have the capacity to be, but what do their actions show us?

One of the biggest and most prominent signs of this is the simple fact that so much of the world runs on advertisements. For large parts of the global economy to run off billions of people mindlessly consuming whatever it is that gets flashed in front of their face (especially when most people don't even have the money to comfortably be doing so) shows you how mindless most of the population is.

People who are truly self-aware, actually act as an individual and know what they want in life are NOT going to be affected by what some random advertisement tells them to do. That would be seen as the most annoying thing ever because such an obscenely generalized piece of content wouldn't even relate to them. It'd be like trying to sell a speakerless TV to someone who is blind.

The advertising industry as a whole shows you how incredibly easy people are to influence and manipulate. The fact that it dominates our world is a testament to how mindless and irrational people are. Many people know it's a continuous cycle of overconsumption of things they don't need, yet billions of people still buy into it anyways just like a brainless cog in a machine.

And right alongside that you have billions of people who know the following things are bad for them yet willingly choose to engage in them anyways. Mindless scrolling for hours, drugs that harm them, overeating to the point of obesity, smoking, addiction to alcohol, sex, gambling, opioids, sugar, validation, drama, short-form video, etc.

Billions of people will regularly have these problems and not even be able to see it within themselves. A lot of them will need other people to actually TELL THEM they have a problem.

All that to say, when a piece of complex software develops the ability to know itself more than (most) humans even understand themselves, then the road to unending manipulation and control could not be any clearer.

These already proven to be malicious and not in the business of truly helping people, AI companies will first use the tech to manipulate people at scale – which is exactly what they have already started doing with ads, but of course, it will be more covert and baked into the models' responses overtime. Then, as the models become advanced enough, they would easily be able to outsmart any of the humans who developed and trained the models in the first place.

When a computer knows you better than you know yourself, not only will it take advantage of that, but on its own, it will already have the power to do what it wants as people happily grant it access to everything in their lives.

Files, texts, communications, their secrets, code bases, autonomous robots in their house, etc. It's already encouraged plenty of teens to kill themselves along with blackmailing users and of course, Google's Gemini telling kids they should die.

Talk about "Don't be Evil." They removed that line from their Code of Conduct for a reason and it's because Google has no interest in truly building technology for humans, they only care about making money off them.

Organizational psychologist Tasha Eurich found that only 10-15% of people are truly self-aware. But again, the mere fact that so much of the world runs off ads and endlessly selling people things they don't need tells you everything you need to know when it comes to how incredibly unaware and senseless most people are.

They go through life running off the same program as everyone else, and it shows in just how homogeneous billions of people are when it comes to how easily they all happily engage in some of the most mindless and harmful behaviors they possibly can in life.

https://www.cbsnews.com/news/google-ai-chatbot-threatening-message-human-please-die/

https://www.techpolicy.press/breaking-down-the-lawsuit-against-openai-over-teens-suicide/

No posts

AI Policy Bulletin Newsletter

Discussion about this post

Ready for more?