Blog

Why Agentic AI Mesh Architectures Struggle in Production

AI Agents · AI Architecture · Design Principles · Advanced

A systems-level critique of enterprise agent mesh designs, explaining why more agents, more delegation, and more LLM-mediated decisions do not automatically create better outcomes.

Close-up of network wires representing complex enterprise agent connectivity

A Useful Architecture, but a Dangerous Assumption

McKinsey's Agentic AI Mesh article is professional, well structured, and valuable as a map of the problems enterprises face when agent experiments spread across teams. It correctly names fragmentation, inconsistent standards, cost growth, governance gaps, and the need for observability. The problem is not that the article is careless. The problem is that the reference architecture can still encourage a belief that more agents, more delegation, and more LLM-mediated decisions will naturally improve enterprise outcomes.

That belief is where many production implementations will fail. Intelligence does not compose linearly. A team of weakly constrained agents is not equivalent to a team of experts. A larger model in every step is not automatically safer than a smaller model with a narrow deterministic role. A mesh can reduce vendor lock-in and improve reuse, but it can also create a distributed stochastic system where every handoff adds ambiguity, latency, cost, and accountability gaps.

Multi-Agent Does Not Mean Multi-Smart

The most seductive idea in agent architectures is specialization. One agent plans, another retrieves, another validates, another executes tools, another checks compliance. On a diagram, this looks like organizational design. In production, it often behaves like a chain of probabilistic translators. Each agent receives partial context, interprets intent, transforms the task, and passes a new representation forward. Small errors compound quietly.

Multi-agent systems can be useful when the boundaries are explicit and the handoffs are typed. A pricing agent that can only call approved pricing APIs is different from a general reasoning agent that decides how to interpret commercial policy. A compliance validator with deterministic policy checks is different from a compliance agent that reads another agent's summary and produces a confidence statement. The first design constrains behavior. The second design multiplies plausible language.

Enterprise leaders should therefore ask a hard question before adding an agent: what uncertainty does this agent remove? If the answer is only "it adds another perspective," the architecture may be increasing surface area without increasing reliability.

LLM-Mediated Orchestration Is Still an Agent

Many agent mesh designs describe an orchestrator or planner that decomposes tasks, chooses agents, routes work, validates outputs, and decides what to do next. If those decisions are primarily made by an LLM, the orchestrator is not a stable control plane. It is another agent with more authority. That distinction matters.

A production orchestrator should behave like infrastructure: predictable, observable, testable, and bounded. It should know which workflow is allowed, which tools are callable, which retries are permitted, which output schema is required, and when a human must approve. LLMs can help classify ambiguous intent or draft intermediate plans, but the final planning state should be represented in a deterministic structure that can be tested before execution.

Without this separation, systems experience unexpected retries, inconsistent tool behavior, looping agent conversations, unbounded token spend, and hard-to-reproduce failures. The problem is not that the model is "bad." The problem is that the architecture asks probabilistic reasoning to carry responsibility that belongs in workflow control.

Tool Calling Is the Largest Failure Surface

Enterprise diagrams often assume tools are callable, available, authorized, and return clear results. Real systems do not behave that way. APIs time out. Permissions change. Data contracts drift. A downstream system returns a partial result. A tool succeeds technically but violates business timing. A model sees a tool description and selects it for the wrong reason. Tool calling is where agentic AI stops being a demo and starts touching operational reality.

The mitigation is not another agent watching the first agent. The mitigation is an enforcement layer. Tools need typed contracts, idempotency rules, rate limits, policy checks, compensating actions, and audit records. Tool calls should be mediated by gateways that can reject unsafe requests before they reach the system of record. For high-impact actions, the agent should propose an action package; a deterministic workflow should decide whether it can execute.

OpenTelemetry, API gateways, service meshes, and policy engines such as Open Policy Agent are not optional extras here. They are the difference between an agent platform and a stochastic integration bus.

The Economics Break Before the Vision Does

Agent mesh architectures can work beautifully in demos because demos compress scale. There are few users, few edge cases, low concurrency, friendly prompts, and limited tool diversity. At enterprise scale, every additional agent adds model calls, context transfer, validation work, logging, retries, and evaluation overhead. If the system uses large models for planning, routing, validation, and generation, cost grows before value is proven.

This is why "LLM everywhere" is economically fragile. Many steps in an agent workflow should be handled by smaller language models, classifiers, rules, or procedural services. Routing can often be a policy table. Validation can often be schema and contract checking. Memory retrieval can often be filtered by deterministic metadata before semantic search. The expensive model should be reserved for the steps where language reasoning genuinely matters.

A production business case should include cost per completed task, cost per failed task, retry rate, human review time, and infrastructure overhead. If the cost model only counts successful happy-path runs, the architecture is not ready for production planning.

Planning Should Not Be Probabilistic

The fundamental issue is simple: language generation can be probabilistic, and reasoning can include probabilistic elements, but planning in multi-agent systems should not be probabilistic at the point of execution. A plan should become a controlled artifact: versioned, inspectable, typed, authorized, and replayable. The system can use an LLM to propose the plan, but it should not execute free-form intent.

The better pattern is a constrained agent architecture. Use LLMs to interpret, draft, summarize, and handle ambiguity. Use deterministic orchestration to execute. Use policy gates to authorize. Use typed tools to act. Use evaluation suites to test negative scenarios. Use human approval where reversibility is low or business impact is high.

The Agentic AI Mesh idea can still be useful if interpreted as integration governance rather than as a license for agent proliferation. The production question is not "how many agents can collaborate?" It is "which decisions must never be delegated to probabilistic components?" That question will separate durable enterprise systems from impressive demos.

Featured image by Albert Stoynov on Unsplash.