Here’s something that surprised me: the biggest gap in enterprise AI isn’t reasoning, tool use, or even hallucination. It’s that every conversation starts from zero.
Your best colleague doesn’t do this. They remember that the compliance team rejected the last approach. They know the data pipeline breaks on public holidays. They’ve learned that stakeholder X cares about latency, not accuracy. This accumulated knowledge — the texture of actually working somewhere — is what makes someone effective after their first month versus their first year.
AI agents today are perpetual day-one employees.
Two extremes, no middle
The consumer AI world has started experimenting with memory. ChatGPT lets users toggle facts on and off: “User prefers Python.” “User lives in Hong Kong.” It’s a flat list of assertions with no structure, no provenance, no way to understand why something was remembered or whether it’s still true. Better than nothing. But imagine handing a new team member a sticky note that says “likes Python” and expecting them to be effective.
At the other extreme, power-user tools like Claude Code offer structured memory systems — typed categories (corrections, project context, user preferences), markdown files, cross-references. It works remarkably well if you’re the kind of person who enjoys maintaining configuration files. Which is to say: it works for maybe a few thousand people on Earth.
The gap between these two isn’t incremental. It’s architectural.
What enterprise memory actually needs
When I think about deploying AI agents in a regulated environment — banking, insurance, healthcare — the memory question immediately becomes a governance question.
An agent that learns “this customer always disputes fees” is useful. It’s also a data privacy liability if stored wrong, a model risk concern if it biases future decisions, and an audit finding if no one can explain where the knowledge came from. The useful part is easy. The governable part is where most organisations will stall.
The architecture that works looks something like this: agents accumulate knowledge during workflows, but that knowledge flows into a structured store with types and provenance. Think of it less like ChatGPT’s memory toggle and more like a notebook — dated entries, categorised by type, with a clear trail of when something was learned and from what interaction.
Then — and this is the part no one has built well yet — a review layer. Not just “the AI remembered something.” A dashboard where operations teams, compliance, or the agent’s human counterpart can see what was learned, approve or reject it, edit it, and understand the downstream effect if it’s removed.
The taxonomy matters
Not all memories are equal, and collapsing them into a flat list is how you get an agent that treats “user prefers dark mode” and “regulatory deadline moved to March 15” as the same kind of thing.
A useful taxonomy might separate: things learned about the user (role, expertise, communication style), corrections received (the agent got something wrong and was told how to fix it), project context (deadlines, decisions, stakeholder dynamics), and reference pointers (where to find things in external systems). Each type has different retention policies, different review urgency, different access controls.
Corrections are perhaps the most valuable category and the most overlooked. Every time a human says “no, not like that — do it this way,” the agent has received a training signal. In a consumer context, that signal might improve one person’s experience. In an enterprise context, it could improve every interaction across the organisation — if it’s captured, reviewed, and promoted to shared knowledge rather than trapped in one user’s session.
The notebook metaphor
The right mental model isn’t “memory” at all. Memory implies something automatic and internal — you don’t review your own memories, they just happen. What agents need is closer to a notebook: deliberate, structured, reviewable, and shared.
A good notebook has pages you can flip back to. It has sections. Someone else can pick it up and make sense of it. You can tear out pages that are wrong. You can promote a scribbled insight into a proper document when it proves valuable.
This is mundane infrastructure. It’s not a research problem. But mundane infrastructure is usually what determines whether a promising technology actually works in production. The difference between “AI agents are transformative” and “AI agents are a toy” will probably come down to whether someone builds the notebook layer well.
Most organisations deploying AI agents today are focused on making the agent smarter. They should be focused on making it remember — and making that memory something a human can trust.