How AI Is Changing Software Testing in The New Default (And What Can You Do About It)

1What Does "Done" Mean When Your Product Reasons in Natural Language?

2Why Traditional Unit Testing Fails for AI-Powered Products

3Building With AI Tools You Don't Fully Understand Yet: What Teams Get Wrong

4The 80/20 Problem in AI Development: When to Stop Relying on AI and Switch Tools

5Why AI Products Still Need Traditional Software Guardrails (And How to Build Them In)

6The Metrics That Actually Matter for AI Product Teams

7How to Rethink Testing When You're Building AI Products

1What Does "Done" Mean When Your Product Reasons in Natural Language?

2Why Traditional Unit Testing Fails for AI-Powered Products

3Building With AI Tools You Don't Fully Understand Yet: What Teams Get Wrong

and 4 more

4The 80/20 Problem in AI Development: When to Stop Relying on AI and Switch Tools

5Why AI Products Still Need Traditional Software Guardrails (And How to Build Them In)

6The Metrics That Actually Matter for AI Product Teams

7How to Rethink Testing When You're Building AI Products

What Does "Done" Mean When Your Product Reasons in Natural Language?

The build phase of product development is undergoing a quiet but profound transformation. As AI components move from novelty to necessity, teams are finding that the systems they create no longer behave the same way twice. Outputs shift. Edge cases multiply in ways no test suite fully anticipated. A feature that "works" in staging can surprise you in production, not because of a bug, but because of the nature of the thing you built.

This is the core tension of building with AI: you are no longer assembling deterministic logic. You are orchestrating probability. And that forces a question most teams haven't stopped to answer yet: What does "done" even mean when your product reasons in natural language?

Within The New Default, we argue that answering it requires rethinking testing from the ground up, not as a phase that follows building, but as a discipline woven into every architectural decision you make along the way.

Why Traditional Unit Testing Fails for AI-Powered Products

For most of software development's history, the mental model was simple. You wrote a function, you knew what it accepted and returned, and testing meant verifying that behavior — deterministically, repeatably, with clear pass/fail outcomes.

AI changes the authorship of logic.

As James LePage, Director of Engineering AI at Automattic, describes it in one of The New Default videos, engineers are now writing code that wraps natural language and feeds it into a larger system that orchestrates it into structured outputs. The logic is no longer entirely yours, and that model doesn't include a traceable call stack.

When you can't enumerate every possible output, unit tests stop being sufficient. The shift is away from unit-level certainty and toward system-level observability, evals, and human-in-the-loop checks that treat the model's output as a hypothesis to be validated, not a result to be trusted.

Building With AI Tools You Don't Fully Understand Yet: What Teams Get Wrong

Most teams adopting AI-assisted workflows share a common early experience: disorientation. The tools are powerful, the possibilities feel vast, but the mental models that made you a good engineer don't quite map onto what you're now being asked to build with.

The New Default speaker, Oji Udezu, captures it well: integrating AI into software development feels like someone handing you a tool that came out of the fourth dimension. You can't see its shape. You don't understand what it does. And yet you're expected to build with it.

This isn't just an onboarding problem — it has real architectural consequences. Teams are making decisions about system design, data flows, and failure handling using tools whose behavior they haven't fully internalized yet. Those early decisions are hard to undo.

The 80/20 Problem in AI Development: When to Stop Relying on AI and Switch Tools

AI-assisted development has a seductive early rhythm. Features come together quickly, prototypes materialize in hours, and the gap between idea and working software narrows dramatically. But that momentum tends to hit a wall, and it's always somewhere in the last 20%.

Chris Rickard, Founder of UserDocs, introduces the concept of a "breaking point architecture" on The New Default: the idea that every AI-assisted workflow has a complexity threshold beyond which the tool starts working against you, and knowing when to switch, whether to human expertise or to lower-level environments like Cursor or Windsurf, is itself a core engineering skill.

The failure mode is familiar to anyone who's pushed an AI system too far. You fix one thing, and something else breaks. You're not moving forward, you're trading problems. The system has exceeded its attention limits, and no amount of prompting recovers the coherence you had at 60% completion.

Why AI Products Still Need Traditional Software Guardrails (And How to Build Them In)

When you're moving fast with AI, established software practices can feel like a source of friction. Why add layers of validation when the model seems to handle it? Why instrument monitoring when things are working in testing? The answer is that "working in testing" is exactly the wrong signal to trust.

Alan Buxton, CTO of Simphony, flags one of the most underestimated risks on The New Default: out-of-distribution data, inputs the model was never trained on, and how unprepared most AI implementations are when they encounter it in the wild. Traditional software guardrails, he argues, are what catch these failures before they reach users.

In practice, this means building in human-in-the-loop review for high-stakes outputs, continuous monitoring that tracks outcomes not just uptime, and layered fallback mechanisms that degrade gracefully when the model's confidence drops. None of these are new ideas — they're just newly essential.

The Metrics That Actually Matter for AI Product Teams

Test coverage percentages and bug counts made sense when correctness was binary. With AI products, you can have zero failing tests and still ship something that frustrates every user who touches it. The old metrics don't lie — they just measure the wrong things.

The New Default points to a more useful set of signals: time-to-value, output volume, and "tweak time", the effort users spend adjusting AI outputs to get what they actually need, as more honest indicators of whether your AI feature is delivering or just functioning.

These aren't metrics you gather after launch. They should inform building decisions in real time. If your team is observing high tweak time in early testing, that's a signal about prompt design, model choice, or UX, not something to revisit in a future sprint. Tracking output volume tells you whether users are actually engaging with what the AI produces, or quietly working around it.

Key Takeaways:
Testing is architecture, not a phase. In AI-powered products, quality assurance can't be bolted on after building. It needs to be woven into every design decision from day one.
Unit tests are no longer enough. When your system reasons in natural language, correctness isn't binary. System-level observability, evals, and human-in-the-loop checks replace the old pass/fail model.
Know your breaking point before you hit it. AI accelerates the first 80% of development — but without a deliberate architecture for complexity thresholds, the last 20% will cost more than the first 80% saved.
Traditional guardrails are more important, not less. Moving fast with AI tempts teams to skip established practices. Out-of-distribution data, unexpected edge cases, and cascading failures make those practices newly critical.
Measure what AI actually changes for users. Coverage percentages and bug counts don't capture whether an AI feature delivers real value. Tweak time, time-to-value, and output volume are the metrics that tell the truth.

How to Rethink Testing When You're Building AI Products

The build phase has always been where ideas become real. What has changed is what "real" requires.

Building with AI means accepting that your product will never behave the same way twice, and designing for that reality rather than against it. The build phase is no longer just about writing code. It's about designing systems that can be trusted, observed, and corrected.

The New Default's Testing section is where practitioners working at the edge of this shift share what they're actually learning, and a good place to start if your team is still figuring out what the new rules are.

The teams that will build the best AI products aren't the ones who code the fastest. They're the ones who learn how to test for a world that doesn't guarantee the same answer twice.

Barbara Kujawa

Content Manager and Tech Writer at Monterail

Barbara Kujawa is a seasoned tech content writer and content manager at Monterail, with a focus on software development for business and AI solutions. As a digital content strategist, she has authored numerous in-depth articles on emerging technologies. Barbara holds a degree in English and has built her expertise in B2B content marketing through years of collaboration with leading Polish software agencies.