Most enterprise AI agent pilots in financial services fail at the same point: the second tool call.
The first call usually works. It’s retrieval, or structured input, or a query that returns predictable output. Everyone is pleased. The demo has been successful. Then the agent makes its second call — maybe to fetch context from a different system, maybe to look up a record, maybe to trigger an action — and you hit the real infrastructure. Authentication that assumes a human is present. An error message written for a browser console, not a machine caller. An API that returns HTTP 200 with the actual error buried in the payload under a key called “result.”
This is the infrastructure problem that agent frameworks don’t solve, and it’s the one that actually determines whether pilots become deployments.
LangChain, AutoGen, and the rest of the orchestration layer ecosystem are doing what they advertise: they orchestrate. They manage tool calls, handle context windows, provide retry logic. What they can’t do is fix an internal bank API whose authentication flow expires in sixty seconds and requires an interactive browser step to renew — which describes a meaningful fraction of bank internal APIs built over the last decade. The framework wraps the tool call. The tool call still fails.
The firms making real progress on agent deployment in financial services have done three things that turn out to be necessary pre-conditions, not implementation details.
The first is mapping their APIs for agent-friendliness before writing a single orchestration line. Which endpoints work without interactive authentication? Which ones have rate limits that an agent loop will immediately breach? Which error responses are machine-interpretable versus designed for human reading? This survey is not exciting work. It produces a spreadsheet, not a demo. But it’s the most accurate predictor of whether a deployment will succeed, because it reveals whether the infrastructure can support the use case before you’ve invested in building it.
The second is building narrow agents rather than general ones. The appeal of general-purpose agents is obvious — a single agent that can handle the full range of tasks sounds more powerful than ten agents that each handle one task. In practice, a focused agent with three to five well-defined tools and a specific task scope is deployable in weeks. The general-purpose one is still in pilot twelve months later, because every new tool it needs to call reveals another infrastructure problem and every new task scope exposes another edge case. Narrowness is a deployability strategy, not a capability limitation.
The third is keeping humans only at genuinely irreversible decision points. The question isn’t “should humans be in the loop?” — they should, in most financial services contexts. The question is where. Agents execute deterministic, auditable steps; humans approve what can’t be undone. This sounds obvious but gets operationally muddled in practice. Teams end up designing agents that pause for human approval on every step because the governance framework treats all decisions equivalently. The result is an agent that does the analysis work but requires human confirmation for each action, which is useful but much less valuable than it could be. Calibrating which decisions genuinely require human approval — and which can be automated with appropriate audit trails — is where most of the value gets unlocked.
The underlying principle is accurate: agents automate the deterministic parts of financial services operations. But the conditional matters. The deterministic parts can only be automated if your infrastructure is clean enough to let them. API reliability, error interpretability, authentication patterns designed for machine callers — these aren’t enhancements to an agent deployment. They’re the pre-conditions for it.
The optimistic version of this story is that the current generation of AI agent deployments is revealing infrastructure debt that was always there. Every failed second tool call is a specific, fixable problem. The organisations that treat agent deployment as an infrastructure upgrade programme — rather than a model deployment programme — will have dramatically more capable deployments in two years. The ones treating it as a model selection problem will keep running pilots.
P.S. The clearest indicator that a bank is serious about agent deployment: check whether they’ve staffed the API modernisation work alongside the AI engineering work. If the team building the agents is different from the team responsible for the infrastructure the agents call, with no shared planning process, the deployment will stall. The organisational dependency is as real as the technical one.