I Get Why You Don't Trust AI. Read This Anyway.

You’ve seen the demos. The confident wrong answers. The code that almost works. The hallucinated API calls. The “AI did it” retrospectives that glossed over how much a human cleaned up afterward.

Your skepticism is earned. This post isn’t going to tell you you’re wrong.

But I want to show you what a different approach looks like — one that treats your skepticism as a design requirement, not an obstacle.

What I Actually Built

I needed to wire Microsoft Entra ID (the corporate identity system most enterprises use) to AWS server access so that each person gets their own individual identity on the server — not a shared account. This matters for compliance audits, incident investigations, and basic accountability.

The technology to do this exists. The Entra-specific implementation wasn’t documented anywhere. AWS Support couldn’t point me to an answer. Weeks passed.

I used an agentic AI system to solve it in a day. Here is exactly what that means and what it doesn’t mean.

What It Doesn’t Mean

It doesn’t mean I typed “build me a cloud identity pipeline” into a chat window and shipped whatever came back.

There was no chat window. Claude Code ran directly in my terminal — reading actual files, writing actual code to disk, running actual scripts, reading actual error output. When a deployment failed, it read the error log, diagnosed the problem, fixed the code, and redeployed. I didn’t transcribe anything between steps. It worked where the code lives.

It doesn’t mean AI made the decisions.

I framed the problem. I defined the requirements. Every architectural call — which attribute to repurpose, how wide to scope the guardrail, whether to build one approach or two — was mine. The AI did not bring the weeks of domain knowledge I had going in. That context is what made the day productive rather than chaotic.

It doesn’t mean the output was unreviewed.

The Part Skeptics Should Care About Most

Before a single line of code was written, the system produced a formal specification. Not a doc. A spec — with typed properties:

SAFETY properties: things that must never happen. “A user must land as their correct Linux identity, never as the shared ssm-user account.”
LIVENESS properties: things that must eventually happen. “SSSD must resolve a user against Active Directory within 30 seconds.”
INVARIANT properties: things that must always hold. “No script or template may contain hardcoded AWS account IDs, resource IDs, or credentials.”

Each property had a concrete, runnable test that proves it holds. A test plan was built from those properties before implementation started — real tests: shellcheck, aws cloudformation validate-template, grep scans for hardcoded values, integration tests against live infrastructure.

The plan itself was peer-reviewed. A separate validation agent challenged the task list before implementation began. Untestable acceptance criteria? Missing dependencies? Gaps in property coverage? Fix it before building, or you’re building the wrong thing.

Implementation was written to pass the tests. Code was reviewed by a dedicated security agent (not the agent that wrote it) across five dimensions. The engineer — me — reviewed the final output before anything touched a real environment.

That is test-driven development with adversarial review. It happens to run across parallel agents instead of a sequential developer, but the discipline is the same. The output went through more structured rigour than most sprint tickets receive.

The Gotchas AI Didn’t Just Invent

The AI didn’t know the answers to this problem. Nobody did — that was the point. What it did was search the entire internet simultaneously — not just documentation, but community forums, GitHub issues, Stack Overflow threads, AWS re:Post questions, blog posts from three years ago — and cross-reference everything it found into a coherent picture.

Here’s one example of what that found:

Entra’s federation metadata URL returns 12 signing certificates. IAM Identity Center’s “Change Identity Source” wizard silently fails when you upload it — no error message, just “Retry Failed Steps.” The fix is to strip the metadata to a single active certificate and two SSO endpoints before uploading.

Nobody wrote that down. The AI found it by correlating a vague error message against fragments in multiple sources and reasoning through the cause. I validated it by testing it. It worked.

There are 17 of these in the white paper. All found by the same process. All validated by a human against real infrastructure.

The Honest Limitations

The AI didn’t validate against production infrastructure — I did that. The AI didn’t decide which compliance trade-offs were acceptable for our organisation — I did that. The AI didn’t know our HR data model when choosing which SCIM attribute to repurpose — I did that.

There are also things the approach cannot do: it cannot replace domain knowledge you don’t have. It cannot make judgment calls about your organisation’s risk tolerance. It cannot be the first person to notice that the plan is solving the wrong problem. That is what engineers are for.

The AI is not a replacement for engineering judgment. It is a system that removes the bandwidth constraint that forces engineers to work sequentially when the problem demands parallel investigation — while enforcing formal rigour that individual engineers under time pressure routinely skip.

What I’d Ask of You

Don’t trust AI outputs that haven’t been through a formal process. You’re right not to.

Do ask what process produced the output. Was there a spec? Were tests written before code? Was the plan reviewed before implementation started? Did a human validate the result against reality?

If the answer to those questions is yes — and you can see the spec, the test plan, and the review output — then the question isn’t “do I trust AI?” It’s “do I trust this engineering process?” That’s a question you already know how to answer.

The full white paper includes the property specifications, test plan, architecture, and every gotcha with its root cause and fix. The repository includes the working CloudFormation templates, scripts, and verification tooling.

This post was written in the same agentic model it describes. The spec, the review, and the human sign-off all happened.