Let’s start with the term “agent” itself. Immediately, it’s being slapped on all the pieces from easy scripts to stylish AI workflows. There’s no shared definition, which leaves loads of room for corporations to market basic automation as something rather more advanced. That type of “agentwashing” doesn’t just confuse customers; it invites disappointment. We don’t necessarily need a rigid standard, but we do need clearer expectations about what these systems are imagined to do, how autonomously they operate, and the way reliably they perform.
And reliability is the following big challenge. Most of today’s agents are powered by large language models (LLMs), which generate probabilistic responses. These systems are powerful, but they’re also unpredictable. They will make things up, go astray, or fail in subtle ways—especially once they’re asked to finish multistep tasks, pulling in external tools and chaining LLM responses together. A recent example: Users of Cursor, a preferred AI programming assistant, were told by an automatic support agent that they couldn’t use the software on multiple device. There have been widespread complaints and reports of users canceling their subscriptions. Nevertheless it turned out the policy didn’t exist. The AI had invented it.
In enterprise settings, this type of mistake could create immense damage. We’d like to stop treating LLMs as standalone products and begin constructing complete systems around them—systems that account for uncertainty, monitor outputs, manage costs, and layer in guardrails for safety and accuracy. These measures will help be certain that the output adheres to the necessities expressed by the user, obeys the corporate’s policies regarding access to information, respects privacy issues, and so forth. Some corporations, including AI21 (which I cofounded and which has received funding from Google), are already moving in that direction, wrapping language models in additional deliberate, structured architectures. Our latest launch, Maestro, is designed for enterprise reliability, combining LLMs with company data, public information, and other tools to make sure dependable outputs.
Still, even the neatest agent won’t be useful in a vacuum. For the agent model to work, different agents have to cooperate (booking your travel, checking the weather, submitting your expense report) without constant human supervision. That’s where Google’s A2A protocol is available in. It’s meant to be a universal language that lets agents share what they’ll do and divide up tasks. In principle, it’s an ideal idea.
In practice, A2A still falls short. It defines how agents check with one another, but not what they really mean. If one agent says it may possibly provide “wind conditions,” one other has to guess whether that’s useful for evaluating weather on a flight route. With no shared vocabulary or context, coordination becomes brittle. We’ve seen this problem before in distributed computing. Solving it at scale is way from trivial.