Engineer

How I Built a Research Intelligence System in One Afternoon

A behind-the-scenes look at building a verified AI research engine — through iteration, mistakes, and a few smart shortcuts.


00 — The Problem That Started Everything

Windows servers in AWS were burning CPU. The culprit: Microsoft Defender aggressively scanning log files written by three agents running on every server — Amazon CloudWatch, AWS Systems Manager, and Rapid7.

The fix was known in principle: add folder exclusions. But the details were scattered across AWS documentation, Rapid7 documentation, and Microsoft’s own guidance. No single source of truth.

How do you know what you find is accurate? How do you know the paths are right? How do you know you're not introducing a security risk by excluding the wrong things?


01 — The First Investigation

The first action was to ask an AI agent to research the question properly. Not summarise from memory — actually go and read live documentation, cross-reference sources, and come back with structured findings.

The brief was clear: what are the documented exclusion paths for CloudWatch, SSM, and Rapid7? What does Microsoft say about the risks? Are there known ways attackers exploit exclusion configurations?

About 15 minutes later: a rich research document. 36 findings. 23 cited sources. A full breakdown of tensions and tradeoffs. A list of open questions. Paths documented. Security implications explained. Attack vectors named with MITRE ATT&CK technique IDs.

It was genuinely impressive. But "looked solid" isn't the same as "is correct."

02 — I Discovered I Needed a Fact-Checker

Reading through the findings, most of it looked solid. But AI systems can confidently state things that turn out to be slightly wrong. Paths with one folder name swapped. Malware behaviour attributed to the wrong variant. A source URL that no longer exists.

Any of those errors, if acted on, could mean misconfigured servers or a false sense of security. So a second agent was built: a Validator. Its only job was to go back and check the work.

The Validator ran through all 23 sources and spot-checked the key claims. It found two material errors:

The investigation said: C:\Program Files\Amazon\AmazonCloudWatchAgent\Logs\

The actual path is: C:\ProgramData\Amazon\AmazonCloudWatchAgent\Logs\

One folder name different. If you had configured your exclusion based on the original, it wouldn't have applied to the log file at all. The CPU problem would have persisted.

The investigation described a specific folder path that WhisperGate used to evade antivirus scanning. When the Validator checked primary sources — CISA's official advisory, MITRE ATT&CK, independent security research — they all documented a different path.

Every investigation needs a Validator, every time, no exceptions.

03 — Turning a One-Off into a System

At this point one investigation had been done well — but somewhat ad-hoc. Decisions made on the fly, no consistent structure, no guarantee the next investigation would follow the same quality bar.

So I wrote the rules down. A file called CLAUDE.md lives at the root of the project — the operating manual. Every investigation follows the same structure. Validation is not optional.

Persona templates were also created — detailed briefs defining how each agent type should behave. The Investigator: research only, no problem-solving, no code, no recommendations. The Validator: verify only, no new findings, every verdict backed by a source.

04 — The Naming Convention Argument

Small things matter when you’re building something you’ll use for a long time. I evaluated four options: kebab-case, snake_case, dot.notation, PascalCase.

PascalCase was chosen. MsDefenderAwsExclusions. AwsIamPrivilegeEscalation. Clean, readable, consistent. One rule, decided once, written down.

05 — The Scope Problem

A pattern emerged: vague questions produce vague investigations. I added a scope gate — a mandatory set of questions asked before any investigation starts:

What is the single core question? Can it be stated in one sentence? What is explicitly out of scope? Who is going to use the findings? Are there sub-topics that should be separate investigations?

06 — The Token Problem (and the Smart Fix)

Every AI conversation has a cost in working memory. The more you ask it to juggle simultaneously, the more chance something gets dropped or done sloppily.

The Validator was being asked to manually compare a human-readable document against a structured data file — reading both, comparing every field, reporting differences. That’s a lot of working memory spent on checking whether two pieces of text are the same.

Use AI for things that require reasoning. Use scripts for things that are deterministic. Blending the two is where the real efficiency comes from.

A small Python script — about 100 lines — does the comparison automatically. It reads both files, normalises the text, hashes the content, and compares. Takes a fraction of a second. Zero AI working memory used.

07 — One Source of Truth

The normalisation script led to an even cleaner solution. I was maintaining two versions of every investigation: human-readable markdown and machine-readable structured file. Both had to stay in sync.

The better approach: the AI writes the structured file only. The markdown is generated by a script. Now there’s only one thing to write and only one thing that can contain errors. Drift is impossible.

08 — The Answer Should Come First

A pattern in the generated documents: the answer was buried. Metadata, then question statement, then context, then dozens of findings. By the time you got to “here are the actual folders to exclude,” you’d read several pages.

A rule was added: every investigation with a concrete actionable output must put that at the very top. Before the question, before the context, before the findings. The rest of the document is the evidence. The table is the answer.

09 — Making the Hard Work Usable

Research findings are only useful if someone acts on them. I turned the open questions into tasks — but not vague tasks like “investigate further.” Tasks with a clear explanation of why they matter, exact commands to run step by step, what to record while running them, what a successful result looks like.

10 — Writing It Up for Everyone Else

By this point the system worked well — but it only made sense to someone who had been in the room for the whole conversation. I wrote two final documents: a README for anyone who might use the system, and the full build story — because the process of getting here is as valuable as the system itself.


What I Actually Built

  1. Question comes in
  2. Scoping questions
  3. Research agent investigates
  4. Sync script verifies formats
  5. Validator fact-checks
  6. Errors corrected
  7. Answer-first document generated
  8. Open questions converted to tasks
  9. Stored and findable

Total build time: one afternoon. The system will run every future investigation at the same quality bar, automatically, without needing to re-establish any of these rules.

  • Every investigation needs a Validator
  • AI for reasoning; scripts for deterministic tasks
  • One source of truth — generate everything else
  • The answer goes first, evidence follows
  • Scope gates prevent wide-and-shallow research
  • Naming conventions decided once, documented once
  • Iteration beats upfront perfect design