If 2024 was the year everyone dreamed about the potential of what LLMs could do, 2025 was the year people woke up to the hard cold reality. LLMs can help rapid prototyping and build very cool demos. But getting it to work as it scales in size and complexity in production is another thing entirely.

We saw that gap and in 2025, we set out to build our “Digital Workforce” - agents that could actually think and do real work, in production, reliably. As we spent months building our ideal AI agent, we also spent the year comparing notes with the industry. Speaking at events like ML Con and JAX or debating on podcasts exposed us to the same recurring friction points across every team. These are some of the lessons we learned in 2025.

1. Frameworks Shouldn’t Be “Magic”

Xaibo is actually our fifth agent framework. We tried visual programming, abstraction layers, component libraries, and async rebuilds. Each one hit the same wall: agents would start strong but fail silently after extended operations. Tool calls would sometimes stop working. Hidden error logs and stack traces. Very frustrating.

We kept running into the same problem - and so did every team we compared notes with at Tokyo AI x Agents and ML Con San Diego. AI agent frameworks that promised “build an agent in 5 lines of code” collapsed the moment you needed anything production-grade. We call this the “tutorial cliff” - that moment when simplicity turns into an uncontrollable black box.

Each iteration taught us the same lesson: as we argued throughout the ML Con circuit - San Diego in May, Munich in June, and Berlin in November - if you can’t test it, you shouldn’t ship it.

That conviction shaped Xaibo’s entire architecture - dependency injection, radical transparency, protocol-based modularity - which we cover in detail in our 2025 Recap and the Xaibo Manifesto. But testability alone doesn’t solve the harder question: what happens when developers start trusting AI to write their code?

2. The 100x Developer Myth & Why Transparency Matters Everywhere

At the JAX Conference and Entwickler Summit in Berlin, we pushed forward a rather provoking argument about production AI systems: AI doesn’t make you 100x more productive - it makes good developers better and bad developers worse. (Watch the talk)

JAX Conference (May 8)

The industry was obsessed with “10x developers” becoming “100x developers” with AI coding assistants. But we’d seen the pattern:

Every line of code is a liability.

Producing 2,000 lines in a day makes you dangerous, not productive. The real value is the mental model - the understanding of why the code solves the problem.

AI has no mental model. It produces “dead code” that runs and passes tests but nobody understands.

Our solution: Omega Programming - the AI can type (driver), but the human must navigate and maintain the mental model. If you can’t explain the why behind generated code, the PR is rejected. Velocity slowed initially, but code quality went up.

We saw the same dynamic play out with our production agents. One of our early customer-support agents could resolve tickets faster than any human - but when it hit an edge case, nobody could explain why it had chosen a particular response. Debugging meant replaying entire conversations and guessing. The moment we added transparency layers - visible reasoning chains, auditable tool calls - resolution speed dipped slightly, but we could actually fix problems when they appeared.

This discipline - transparency over magic, understanding over velocity - became the foundation for how we approached every aspect of our agent architecture throughout 2025.

3. Context Engineering Over Prompt Engineering

By mid-year - around our interviews at Japan Innovation Campus in June and the NoFluff podcast in July - we’d identified the most common failure pattern: teams trying to “vibe code” with LLMs, disappointed when agents couldn’t reliably automate tasks.

Most of the time, the agent simply had no context. An LLM without context is like a new hire on their first day being handed a task with zero onboarding. They start making stuff up, hoping it’s right.

The solution: give agents tools to access context themselves. This is the difference between traditional RAG and agentic RAG:

Traditional RAG: User asks question → system queries vector DB → returns documents → sends to LLM. Problem: often retrieves irrelevant documents that confuse the agent.

Agentic RAG: Make RAG a tool the LLM can use. The LLM decides when to search. If the query is useless, the LLM tries a different one. The LLM stays in the driver’s seat until it finds what it needs.

This shift from “better prompts” to “better tools and context access” was fundamental. We spent weeks refining a prompt for one of our document-processing agents, squeezing out marginal accuracy gains - then gave it a single tool to query the company’s terminology database and accuracy jumped overnight. Prompt engineering isn’t worthless, but it hits a ceiling fast. Context engineering is what matters.

4. Teaching Agents with SOPs

Once agents have the right tools and context, the next question is: how do you teach them what to do with it? At the Databricks × Money Forward event in October, we presented what became our most-requested feature: SOPs for agents.

Think about how you onboard a new employee. You don’t just hand them a laptop and say “figure it out.” You give them a set of instructions: here’s how we handle customer complaints, here’s the process for approving invoices, here’s what to do when a shipment is late. Standard operating procedures. Your human employees follow SOPs. Why shouldn’t your agents?

The industry is moving in a similar direction - Anthropic released Agent Skills, for instance. In contrast, our SOPs are specifically about business procedures: the same plain-language instructions your human team already follows, handed directly to the agent like an onboarding manual.

The problem with traditional automation is that encoding SOPs into rigid workflows is complicated. You can’t graph every contingency in a drag-and-drop UI. It becomes an unmaintainable spaghetti mess. But SOPs written for humans are already in plain language - with conditionals, loops, and context-dependent decisions baked in.

That’s the insight: agents can read the same onboarding documents your humans read. Our system lets you define agent SOPs in natural language markdown - section-by-section steps with conditionals and loops, no coding required. Here’s a real example of a marketing workflow we demonstrated:

Step 1: Scan LinkedIn for trending topics in our industry
Step 2: If a topic is relevant to our business, create a research document with deep analysis
Step 3: Use company knowledge base to draft a post that ties the trend to what we do
Step 4: Email the draft to the marketing lead for approval
Step 5: When approved, post to Buffer (scheduled posting service)

The LLM interprets the natural language document and maintains reasoning flexibility, while our tool layer sets up invariants like permissions and safety checks. You can’t accidentally give the agent permission to delete your database because the tools themselves have guardrails.

The key insight: agents should follow the same procedures humans follow. Write it once in plain language, and both your human team and your AI team can follow it.

5. Preventing Hallucinations in Production

The hallucination problem is real, and there’s no silver bullet. But there are patterns that work. The most important thing to understand is this: LLMs can’t backspace. Once they start generating a response, they’re committed to it. The token they just generated influences the probability distribution for the next token, and there’s no ability to go back and fix mistakes. Every anti-hallucination strategy we use flows from this single constraint.

Multi-Agent Verification

We use a two-agent setup in Xaibo - a main agent and a thinker agent. The main agent delegates work to the thinker and is responsible for verifying the thinker’s responses. If the output isn’t good enough, the main agent gives feedback and the thinker tries again. Just like humans don’t produce final drafts in one pass, agents shouldn’t either - we build explicit review steps into workflows so the agent writes a rough draft, reviews it for factual inaccuracies, and produces a final version. By separating the “do the work” function from the “verify the work” function, we give agents the backspace ability they lack at the token level.

Tool-First Prompting

Agents are explicitly prompted to always look up information. We tell them: “Never trust your memories - the world changes every day. Use your tools to find current information.”

This shifts the agent from “I think the answer is X” to “Let me look that up.” The difference is massive. An LLM might confidently hallucinate that the CEO of XpressAI is “Dr. Andrew Chen” because those tokens have high probability. But if it has a tool to search the company knowledge base, it looks it up instead.

Both strategies feed into the same final defense: the human in the loop. The agent produces the draft, the human refines it. Too much human involvement and you’ve added an intern who creates more work than they solve. Too little and you ship hallucinations to customers. We aim for agents that produce 80-90% correct work - the human spends 10-20% of the time they would have spent doing it from scratch, but the quality stays high.

6. Sovereignty Is the Only Path to Reliability

The strongest validation of our strategy came in November, right around the time we spoke at the Mistral AI Japan launch event.

A silent update to a major public model changed its safety alignment, suddenly refusing to execute valid tool calls that had worked for months. For many teams, this broke production pipelines. No warning. No migration path. No rollback option. Systems that were working Friday failed Monday.

This wasn’t a hypothetical risk. This was real production systems going down because a vendor changed their model without notice.

This validated everything we’d been arguing: You cannot build a stable “Digital Workforce” on an API you don’t control.

Colors of Web3 & Entrepreneurship Interview (Jan 10, 2026)

We doubled down on Sovereign AI - running models like Mistral, Llama, and Gemma on-premise or in private clouds. This became a core design principle across everything we built in 2025, from Xaibo to XpressAI OS. Privacy matters, and so does version locking. We treat model weights like npm dependencies: locked, versioned, immutable.

When you run npm install, you don’t get whatever version the package maintainer uploaded yesterday. You get a specific version, locked in your package-lock.json. If that version has a bug, you can fix it or downgrade. You’re in control.

That’s how AI models should work in production. You test against Mistral-7B-v0.3, you deploy Mistral-7B-v0.3, and six months later you’re still running Mistral-7B-v0.3 unless you explicitly choose to upgrade.

As one of our financial services customers told us: “Our financial transaction data can’t leave our data centers under any circumstances. The privacy laws we’re required to comply with - there’s no room for compromise.”

But even for companies that can use cloud APIs, the reliability argument stands. If your business depends on agents working consistently, you can’t afford to have them break because a vendor changed something upstream.

7. The Small Models Thesis

Running your own models raises an obvious question: how big do they need to be? Here’s the counterintuitive part we’ve been testing since mid-year: most enterprises don’t need 400-billion-parameter models.

The industry obsession with bigger models misses the point. Yes, bigger models have more “knowledge” baked in - they’ve memorized more of the training data. What enterprises actually need is an AI that can find the right information from their systems. Intelligence shifts from memorization to retrieval.

We run a Trinity Architecture - three smaller models working together:

Model 1: Lookup specialist - determines what information is needed
Model 2: Tool user - executes the actual API calls and searches
Model 3: Response preparer - synthesizes information into a final answer

They watch each other in a self-checking loop. If Model 3 prepares a response and Model 1 determines it’s missing critical information, the loop runs again.

As we showed in the NoFluff podcast, we often prefer the output from this setup (using Mistral Small or Gemma-27B) over GPT-4 Turbo, because it actually looks up information instead of generating plausible text.

The cost implications are massive. Traditional invoice OCR services charge $2-5 per document or per page. With LLMs doing multimodal extraction, it’s hundredths of a cent per page. This unlocks services that were prohibitively expensive before - features companies wanted to offer but couldn’t justify the cost.

8. Memory Management Is Its Own Problem

One technical challenge with long-running agents: LLM performance degrades over long context windows. If you just keep appending everything to the conversation history, eventually the model slows down and starts producing worse outputs.

Our solution: keep working memory light by offloading long-term memory to persistent storage. The agent only surfaces relevant memories when needed - similar to human selective recall. You don’t actively remember everything you learned in high school chemistry while you’re writing code. That information surfaces when it’s relevant.

Memory size turns out to be surprisingly small. 100-200 MB is sufficient for most agents because they’re storing memories of what happened, not raw conversation logs. “I made this mistake and corrected it this way” is more useful than the full 10,000-token conversation where that happened. We built this insight directly into XpressAI CLI’s memory architecture - see the 2025 Recap for the technical breakdown.

That’s what 2025 taught us. The agents that will matter aren’t the ones with the biggest models or the most hype. They’re the ones you can test, understand, version-lock, and trust with real work.

🎙️ 2025-2026 Talks, Podcasts & Resources

We discussed these lessons across talks and podcasts throughout the year. Here’s the full list if you want to go deeper on any topic.

Podcasts & Interviews

The Eccentric CEO Podcast (April 17): Age of AI: The landscape of AI agents in 2025 – Listen here
AI Founders’ Mindset (June 3): Xpress AI vs ChatGPT: Building Team Members – Listen here
Cerebral Valley Deep Dive (June 11): The Enterprise Agents OS – Read here
NoFluff Podcast (July 21): Context Engineering & Multi-Agent Architectures – Listen here
Colors of Web3 & Entrepreneurship (Jan 10, 2026): Agents as Teammates: Inside Xpress AI – Listen here

Technical Keynotes & Talks

Tokyo AI x Agents Session (Jan 29): Introducing AI Agents as Coding Companions – Post
JAX Conference (May 8): From 10x to 100x Developer? (Productivity Realities) – Session | Video
ML Con San Diego (May 19-23): Unleashing AI Agents – Session | The Persistence Paradigm – Session
ML Con Munich (June 24-26): From Sci-Fi Dreams to Reality – Session
Entwickler Summit Berlin (Sept 18): AI Coding Assistants & The “Dead Code” Trap – Info
Databricks × Money Forward × Tokyo AI (Oct 22): Closing the Automation Loop: Agents that Do – Details
Mistral AI Japan Launch (Nov 6): Sovereign Self-Improving AI – Details
ML Con Berlin (Nov 24-28): AI Agents Unplugged: No Frameworks, Just Code – Session | Video

Community & Events

Builders Weekend Hackathon (Feb 21-23): Partner & Challenge Sponsor
AI Salon Kansai Launch (Sept 16): Event
Vibe Coding Workshop Osaka (Dec 2): Meetup

Want to build production AI agents? Check out the Xaibo framework at github.com/xpressai/xaibo or read the Xaibo Manifesto for our technical philosophy.

Operationalizing AI Agents: Lessons from 2025