Harness Engineering: AI for Long-Running DevOps Tasks

A few weeks ago my “smart” AI coding agent started a zero-downtime migration, got halfway through the Helm chart, hit its context limit, woke up the next day, looked at the half-finished YAML, and cheerfully declared “All done! 🚀”. Classic. Then I stumbled on Anthropic’s November 2025 post “Effective harnesses for long-running agents” and suddenly everything made sense. They didn’t just add more tokens — they built an actual engineer harness with initializer agents, progress artifacts, and git discipline. I stole the playbook, sprinkled in OpenAI’s Operator for browser-heavy bits, and now I have a tireless Harness Engineer that finishes what it starts. (Most days.)

The Long-Running Agent Problem (We’ve All Been There)

Anthropic nailed the diagnosis: agents love to “one-shot” massive projects. They run out of context mid-implementation, leave undocumented half-features, and the next session either panics or prematurely ships broken code. Sound like any of your overnight CI runs?

OpenAI’s Operator (their browser agent powered by CUA) faces the same thing on long web workflows — it’s brilliant at filling forms or checking dashboards, but without scaffolding it drifts off-task like a developer on a Friday afternoon.

The fix? Treat the AI like a real DevOps engineer working in shifts: leave clear handoff notes, clean commits, and a todo list that actually survives reboots.

The Harness Blueprint (Anthropic + My DevOps Twist)

I grabbed their Claude Agent SDK quickstart and repurposed the whole thing for production deployments instead of building another chat app. Here’s the two-agent system that actually works:

1. Initializer Agent (The “First Day On The Job” Setup)

Runs once. Creates:

init.sh — boots kubectl, Helm, your monitoring stack, everything
feature_list.json — every single task broken down (e.g., “Add liveness probes”, “Set up canary traffic shift”)
Git repo with a clean initial commit
claude-progress.txt — the running diary

2. Coding Agent (The Daily Grind That Never Forgets)

Every new session:

Reads the progress file + git log
Picks ONE feature from the list
Runs ./init.sh
Implements, tests, commits with a proper message
Updates “passes: true” only after real health checks

No more guessing. No more half-implemented CRDs.

Here’s the magic JSON that keeps everything sane:

{
  "category": "deployment",
  "description": "Implement blue-green rollout for prod",
  "steps": ["Update Helm values", "Add traffic switch logic", "Test rollback"],
  "passes": false
}

OpenAI Operator: The Browser Sidekick

When the task needs to click through Grafana, update a Harness pipeline via UI, or check Downdetector (irony), I hand off to OpenAI’s Operator. It literally uses screenshots and clicks like a human SRE at 2 a.m. Combine that with Claude’s structured harness and you’ve got a full-stack AI engineer that never needs to be paged at 3 a.m. (well… almost never).

The Humour in the Chaos

Favourite log entry so far:

[Agent] Reading claude-progress.txt… Ah yes, yesterday’s me was an idiot. Fixing that typo now.

Or the time it committed “WIP: please don’t merge” and then immediately ran the canary test anyway. Classic junior-dev behaviour, but at least it left the note.

We’ve all had that agent that swears the deployment is green while prod is on fire. Now the harness forces it to actually run the /ready endpoint before bragging.

Lessons for Us Mere Mortals (DevOps Takeaways)

Artifacts are everything — git + JSON todo + progress.txt beats fancy vector memory every time.
One feature per session — force the agent to stay focused or it will try to rewrite your entire monorepo.
Test before you mark done — I added mandatory health-check + rollback tests. The agent now hates me but ships safer code.
Initializer vs Coder prompts — different personalities, same tools. Game changer.
Operator for UI tasks — Claude writes the YAML, Operator clicks the buttons. Teamwork.
Keep init.sh — your future self (and the AI) will thank you when the cluster is in a weird state at session 17.

Final Thought

Anthropic didn’t just give us better agents — they gave us engineering discipline for agents. Pair that with OpenAI’s Operator and you’ve got a Harness Engineer that can actually tackle the multi-day, context-heavy work we used to babysit.

My Kubernetes migration? Finished in four sessions while I was busy writing this post. The AI even left a polite commit message: “Migration complete. You’re welcome. ☕”

Until next time — may your agents stay in scope, your contexts never expire, and your on-call pager stay gloriously silent.

P.S. Full Anthropic post here: https://www.anthropic.com/engineering/effective-harnesses-for-long-running-agents
OpenAI Operator details: https://openai.com/index/introducing-operator/