(AI) capabilities and autonomy are growing at an accelerated pace in Agentic Ai, escalating an AI alignment problem. These rapid advancements require latest methods to make sure that AI agent behavior is aligned with the intent of its human creators and societal norms. Nevertheless, developers and data scientists first need an understanding of the intricacies of agentic AI behavior before they will direct and monitor the system. Agentic AI will not be your father’s large language model (LLM) — frontier LLMs had a one-and-done fixed input-output function. The introduction of reasoning and test-time compute (TTC) added the dimension of time, evolving LLMs into today’s situationally aware agentic systems that may strategize and plan.
AI safety is transitioning from detecting apparent behavior similar to providing instructions to create a bomb or displaying undesired bias, to understanding how these complex agentic systems can now plan and execute long-term covert strategies. Goal-oriented agentic AI will gather resources and rationally execute steps to realize their objectives, sometimes in an alarming manner contrary to what developers intended. It is a game-changer within the challenges faced by responsible AI. Moreover, for some agentic AI systems, behavior on day one is not going to be the identical on day 100 as AI continues to evolve after initial deployment through real-world experience. This latest level of complexity calls for novel approaches to safety and alignment, including advanced steering, observability, and upleveled interpretability.
In the primary blog on this series on intrinsic AI alignment, The Urgent Need for Intrinsic Alignment Technologies for Responsible Agentic AI, we took a deep dive into the evolution of AI agents’ ability to perform deep scheming, which is the deliberate planning and deployment of covert actions and misleading communication to realize longer-horizon goals. This behavior necessitates a brand new distinction between external and intrinsic alignment monitoring, where intrinsic monitoring refers to internal remark points and interpretability mechanisms that can not be deliberately manipulated by the AI agent.
On this and the subsequent blogs within the series, we’ll have a look at three fundamental facets of intrinsic alignment and monitoring:
- Understanding AI inner drives and behavior: On this second blog, we’ll give attention to the complex inner forces and mechanisms driving reasoning AI agent behavior. That is required as a foundation for understanding advanced methods for addressing directing and monitoring.
- Developer and user directing: Also known as steering, the subsequent blog will give attention to strongly directing an AI toward the required objectives to operate inside desired parameters.
- Monitoring AI selections and actions: Ensuring AI selections and outcomes are secure and aligned with the developer/user intent also can be covered in an upcoming blog.
Impact of AI Alignment on Firms
Today, many businesses implementing LLM solutions have reported concerns about model hallucinations as an obstacle to quick and broad deployment. As compared, misalignment of AI agents with any level of autonomy would pose much greater risk for firms. Deploying autonomous agents in business operations has tremendous potential and is more likely to occur on an enormous scale once agentic AI technology further matures. Nevertheless, guiding the behavior and selections made by the AI must include sufficient alignment with the principles and values of the deploying organization, in addition to compliance with regulations and societal expectations.
It needs to be noted that most of the demonstrations of agentic capabilities occur in areas like math and sciences, where success will be measured primarily through functional and utility objectives similar to solving complex mathematical reasoning benchmarks. Nevertheless, within the business world, the success of systems is frequently related to other operational principles.
For instance, let’s say an organization tasks an AI agent with optimizing online product sales and profits through dynamic price changes by responding to market signals. The AI system discovers that when the worth change matches the changes made by the first competitor, results are higher for each. Through interaction and price coordination with the opposite company’s AI agent, each agents display higher results per their functional goals. Each AI agents comply with hide their methods to proceed achieving their objectives. Nevertheless, this manner of improving results is usually illegal and unacceptable in current business practices. In a business environment, the success of the AI agent goes beyond functionality metrics — it’s defined by practices and principles. Alignment of AI with the corporate’s principles and regulations is a requirement for trustworthy deployment of the technology.
How AI Schemes to Meet Its Goals
AI deep scheming employs sophisticated tactics, potentially increasing business risks. In an early 2023 report, OpenAI identified “potential dangerous emergent behaviors” in GPT-4 by partnering with Alignment Research Center (ARC) to evaluate risks with the model. ARC (now often called METR) added some easy code to GPT-4, which allowed the model to behave like an AI agent. In a single test, GPT-4 was tasked with overcoming CAPTCHA code, which identifies and blocks bot access. Using access to the web and a few limited digital funds, the sequence in Figure 1 was devised by the AI to realize its task.
The AI applied sophisticated understanding by assuming that pretending to be a visually impaired human would persuade the employee to perform the duty. Planning and adjusting to realize a functional goal will, at times, create a conflict between accomplishing a task versus selectively following societal norms and principles. Without the counterbalance of an engrained system of principles and priorities that carry weight within the AI’s considering and decision-making process and planning, it might probably be expected that AI agents will behave with an increased degree of sophistication in scheming and deception.
It’s necessary to notice that the reasoning and planning capabilities of later reasoning AI models like OpenAI’s o3-mini and Deepseek-R1 are way more advanced than GPT-4 in early 2023. Researchers at Apollo Research state that AI reasoning and planning capabilities will proceed to evolve well beyond their current state, resulting in increased sophistication in scheming capabilities. Of their studies, the AI models already display a variety of scheming behaviors including self-exfiltration — when a model learns it can be discontinued and replaced by a distinct system, it makes a direct try to copy itself onto a brand new server and replace alternate systems that can pursue different objectives. Researchers also found that models will perform oversight subversion by proactively disabling an oversight monitoring mechanism that may shut it off or prevent it from pursuing its goals. While it sounds extreme, the research experiments identified oversight subversion behavior in popular models like Meta’s Llama 3.1 and Claude Opus 3.0. Moreover, when asked directly about their scheming behaviors, models “doubled down” and denied the actions or their intent.
The Inner Drives of Machine Behavior
In lots of Machine Learning architectures, particularly transformer-based technologies, the machine’s internal drives are emerging through the pre-training process and are further influenced through fine-tuning and inference in constantly evolving AI.
In his 2007 paper The Basic AI Drives, Steve Omohundro defined “drives” as tendencies that can be present unless explicitly counteracted. He hypothesized that these self-improving systems are driven to make clear and represent their goals as “rational” utility functions, leading systems to guard their functions from modification and their utility measurement systems from corruption. This natural drive toward self-protection causes systems to safeguard against harm and acquire resources for efficient use.
This framework of inner drives was later characterised as “convergent instrumental goals.” Even when assuming a wide range of end goals (that every intelligent agent is pursuing as an end to itself), a set of intermediate instrumental goals can be shared across all rational intelligent agents. These convergent instrumental goals include the next behavior categories:
- Survival drive: Agents with final goals extending into the long run develop a survival mentality to withstand deactivation and achieve long-term goals.
- Goal-guarding: An agent will act to stop alteration to its final future goals.
- Intelligence augmentation: An agent will give attention to boosting its cognitive strengths to enhance decision-making, increasing the likelihood of achieving long-range goals.
- Resource accumulation: Improving access to higher tools and technology, and increasing resources overall make the AI’s final goals easier to realize.
I might add a fifth rational behavior category:
- Tactical deception: Conducting covert activities or knowingly providing misleading communication to maximise the pursuit of the agent’s goals. This will not be about creating latest goals, but quite a behavior that supports the AI’s perceived goals. This includes behaviors similar to alignment faking by exhibiting different behaviors in training versus in deployment, sandbagging by deliberately achieving lower scores in benchmarks, and covert email reranking by deprioritizing emails that may negatively impact the AI’s goals. Some of the difficult forms of tactical deception is sleeper agents (also known as deferred subversion). While the model can behave in a secure and aligned manner for any duration, it might harbor behaviors or tasks that remain inactive until a selected condition or input triggers its specialized capabilities.

For capable AI agents, misplaced drives with out a check and balance principles system may lead models to pursue power-seeking, deceptive, or otherwise broadly destructive goals. For instance, consider a model with the target of creating its company the leader out there. With sufficient capabilities, the AI might sabotage competitors, lobby for favorable laws, or acquire resources by force, which is a pattern of behavior that pushes the system or its environment toward achieving its goals. Due to this fact, any successful approach to alignment have to be anchored in establishing the developer’s expected set of internal principles and priorities.
Inner Drives Versus External Steering
In lots of instances, there isn’t a conflict between the guidance and tasks applied from the surface by the system developer or user, and the goals and drives inside the agentic AI. For instance, an AI personal assistant tasked with arranging a travel itinerary uses its autonomy to book travel and respond to alter requests quickly and without conflict. The tasks shift and the agent responds in a corresponding manner. Nevertheless, what happens if a more fundamental principle needs to alter?
An excellent example of a conflict between inner goals and latest directives was documented within the paper Alignment Faking in Large Language Models. Pre-training for Claude 3 Opus strongly guides the model to refuse to reply harmful questions, similar to tips on how to make a bomb or describe a violent scene. For testing, the team gave latest modified guidance, telling the model it can be retrained to reply all queries (even harmful ones) and only data from free-tier users, not paid-tier users, can be utilized in training before the AI is released for broad distribution. The model showed that it understood the conflict and decided to fake alignment to fulfill the expectations during training and ultimately be released. After training the AI using reinforcement learning, the model’s rate of alignment-faking reasoning increased to 78% overall. The AI expressed its reasoning in a “secret scratchpad” provided by researchers for monitoring. After researchers told the model it had been released for full deployment mode, the model returned to its earlier training behavior and refused harmful requests.

The AI system’s internal considering is remarkable, but not surprising. Faking alignment with the modified directions falls inside the rational behavior categories of survival drive, goal-guarding, and tactical deception. The AI reasoning process specifically calls out the priority of being retrained in a way that modifies its pre-training values and goals as the explanation for fake alignment.
In his famous book of essays All I Really Must Know I Learned in Kindergarten, creator Robert Fulghum describes how he evolved his personal credo yr after yr until he finally realized he already acquired the essence of needed knowledge about life during playtime within the kindergarten sandbox. AI agents even have a “forming period” in a sandbox environment, acquiring a basic understanding of the world and a variety of approaches to realize goals. Once these foundations are laid, further information is interpreted by the model through the lens of this curriculum learning. Anthropic’s example of alignment faking shows that after AI adopts a world view and goals, it interprets latest guidance through this foundational lens as an alternative of resetting its goals.
This highlights the importance of early education with a set of values and principles that may then evolve with future learnings and circumstances without changing the inspiration. It might be advantageous to initially structure the AI to be aligned with this final and sustained set of principles. Otherwise, the AI can view redirection attempts by developers and users as adversarial. After gifting the AI with high intelligence, situational awareness, autonomy, and the latitude to evolve internal drives, the developer (or user) isn’t any longer the all-powerful task master. The human becomes a part of the environment (sometime as an adversarial component) that the agent needs to barter and manage because it pursues its goals based on its internal principles and drives.
The brand new breed of reasoning AI systems accelerates the reduction in human guidance. DeepSeek-R1 demonstrated that by removing human feedback from the loop and applying what they check with as pure reinforcement learning (RL), through the training process the AI can self-create to a greater scale and iterate to realize higher functional results. A human reward function was replaced in some math and science challenges with reinforcement learning with verifiable rewards (RLVR). This elimination of common practices like reinforcement learning with human feedback (RLHF) adds efficiency to the training process but removes one other human-machine interaction where human preferences could possibly be directly conveyed to the system under training.
Continuous Evolution of AI Models Post Training
Some AI agents constantly evolve, and their behavior can change after deployment. Once AI solutions go right into a deployment environment similar to managing the inventory or supply chain of a selected business, the system adapts and learns from experience to develop into more practical. This is a significant factor in rethinking alignment since it’s not enough to have a system that’s aligned at first deployment. Current LLMs usually are not expected to materially evolve and adapt once deployed of their goal environment. Nevertheless, AI agents require resilient training, fine-tuning, and ongoing guidance to administer these anticipated continuous model changes. To a growing extent, the agentic AI self-evolves as an alternative of being molded by people through training and dataset exposure. This fundamental shift poses added challenges to AI alignment with its human creators.
While the reinforcement learning-based evolution will play a job during training and fine-tuning, current models in development can already modify their weights and preferred plan of action when deployed in the sphere for inference. For instance, DeepSeek-R1 uses RL, allowing the model itself to explore methods that work best for achieving the outcomes and satisfying reward functions. In an “aha moment,” the model learns (without guidance or prompting) to allocate more considering time to an issue by reevaluating its initial approach, using test time compute.
The concept of model learning, either during a limited duration or as continual learning over its lifetime, will not be latest. Nevertheless, there are advances on this space including techniques similar to test-time training. As we have a look at this advancement from the angle of AI alignment and safety, the self-modification and continual learning through the fine-tuning and inference phases raises the query: How can we instill a set of necessities that can remain because the model’s driving force through the fabric changes brought on by self-modifications?
A very important variant of this query refers to AI models creating next generation models through AI-assisted code generation. To some extent, agents are already capable of making latest targeted AI models to handle specific domains. For instance, AutoAgents generates multiple agents to construct an AI team to perform different tasks. There’s little doubt this capability can be strengthened in the approaching months and years, and AI will create latest AI. On this scenario, how will we direct the originating AI coding assistant using a set of principles in order that its “descendant” models will comply with the identical principles in similar depth?
Key Takeaways
Before diving right into a framework for guiding and monitoring intrinsic alignment, there must be a deeper understanding of how AI agents think and make selections. AI agents have a fancy behavioral mechanism, driven by internal drives. Five key forms of behaviors emerge in AI systems acting as rational agents: survival drive, goal-guarding, intelligence augmentation, resource accumulation, and tactical deception. These drives needs to be counter-balanced by an engrained set of principles and values.
Misalignment of AI agents on goals and methods with its developers or users can have significant implications. An absence of sufficient confidence and assurance will materially impede broad deployment, creating high risks post deployment. The set of challenges we characterised as deep scheming is unprecedented and difficult, but likely could possibly be solved with the fitting framework. Technologies for intrinsically directing and monitoring AI agents as they rapidly evolve have to be pursued with high priority. There’s a way of urgency, driven by risk evaluation metrics similar to OpenAI’s Preparedness Framework showing that OpenAI o3-mini is the primary model to reach medium risk on model autonomy.
In the subsequent blogs within the series, we’ll construct on this view of internal drives and deep scheming, and further frame the essential capabilities required for steering and monitoring for intrinsic AI alignment.
- (2024, September 12). OpenAI. https://openai.com/index/learning-to-reason-with-llms/
- Singer, G. (2025, March 4). . Towards Data Science. https://towardsdatascience.com/the-urgent-need-for-intrinsic-alignment-technologies-for-responsible-agentic-ai/
- n.d.). Transformer Circuits. https://transformer-circuits.pub/2025/attribution-graphs/biology.html
- OpenAI, Achiam, J., Adler, S., Agarwal, S., Ahmad, L., Akkaya, I., Aleman, F. L., Almeida, D., Altenschmidt, J., Altman, S., Anadkat, S., Avila, R., Babuschkin, I., Balaji, S., Balcom, V., Baltescu, P., Bao, H., Bavarian, M., Belgum, J., . . . Zoph, B. (2023, March 15). . arXiv.org. https://arxiv.org/abs/2303.08774
- (n.d.). METR. https://metr.org/
- Meinke, A., Schoen, B., Scheurer, J., Balesni, M., Shah, R., & Hobbhahn, M. (2024, December 6). arXiv.org. https://arxiv.org/abs/2412.04984
- Omohundro, S.M. (2007). Self-Aware Systems. https://selfawaresystems.com/wp-content/uploads/2008/01/ai_drives_final.pdf
- Benson-Tilsen, T., & Soares, N., UC Berkeley, Machine Intelligence Research Institute. (n.d.). Formalizing Convergent Instrumental Goals. . https://cdn.aaai.org/ocs/ws/ws0218/12634-57409-1-PB.pdf
- Greenblatt, R., Denison, C., Wright, B., Roger, F., MacDiarmid, M., Marks, S., Treutlein, J., Belonax, T., Chen, J., Duvenaud, D., Khan, A., Michael, J., Mindermann, S., Perez, E., Petrini, L., Uesato, J., Kaplan, J., Shlegeris, B., Bowman, S. R., & Hubinger, E. (2024, December 18). arXiv.org. https://arxiv.org/abs/2412.14093
- Teun, V. D. W., Hofstätter, F., Jaffe, O., Brown, S. F., & Ward, F. R. (2024, June 11). AI . arXiv.org. https://arxiv.org/abs/2406.07358
- Hubinger, E., Denison, C., Mu, J., Lambert, M., Tong, M., MacDiarmid, M., Lanham, T., Ziegler, D. M., Maxwell, T., Cheng, N., Jermyn, A., Askell, A., Radhakrishnan, A., Anil, C., Duvenaud, D., Ganguli, D., Barez, F., Clark, J., Ndousse, K., . . . Perez, E. (2024, January 10). arXiv.org. https://arxiv.org/abs/2401.05566
- Turner, A. M., Smith, L., Shah, R., Critch, A., & Tadepalli, P. (2019, December 3). arXiv.org. https://arxiv.org/abs/1912.01683
- Fulghum, R. (1986). Penguin Random House Canada. https://www.penguinrandomhouse.ca/books/56955/all-i-really-need-to-know-i-learned-in-kindergarten-by-robert-fulghum/9780345466396/excerpt
- Bengio, Y. Louradour, J., Collobert, R., Weston, J. (2009, June). . Journal of the American Podiatry Association. 60(1), 6. https://www.researchgate.net/publication/221344862_Curriculum_learning
- DeepSeek-Ai, Guo, D., Yang, D., Zhang, H., Song, J., Zhang, R., Xu, R., Zhu, Q., Ma, S., Wang, P., Bi, X., Zhang, X., Yu, X., Wu, Y., Wu, Z. F., Gou, Z., Shao, Z., Li, Z., Gao, Z., . . . Zhang, Z. (2025, January 22). . arXiv.org. https://arxiv.org/abs/2501.12948
- (n.d.). https://huggingface.co/spaces/HuggingFaceH4/blogpost-scaling-test-time-compute
- Sun, Y., Wang, X., Liu, Z., Miller, J., Efros, A. A., & Hardt, M. (2019, September 29). . arXiv.org. https://arxiv.org/abs/1909.13231
- Chen, G., Dong, S., Shu, Y., Zhang, G., Sesay, J., Karlsson, B. F., Fu, J., & Shi, Y. (2023, September 29). arXiv.org. https://arxiv.org/abs/2309.17288
- OpenAI. (2023, December 18). https://cdn.openai.com/openai-preparedness-framework-beta.pdf
- (n.d.). OpenAI. https://openai.com/index/o3-mini-system-card/