Part 8: Make it a team standard

Everything in the first seven parts works for one developer. The moment a team is involved, an extra failure mode appears: the harness drifts. One person's rule file says one thing, another's says something else, agent behavior becomes irreproducible across the team, and the discipline quietly erodes. This part is about making the workflow a shared, durable standard.

The underlying principle to keep in mind: AI amplifies the engineering culture it lands in. A team with strong tests, clear standards, and healthy review gets dramatically more out of these tools. A team without those gets faster at producing problems. The point of standardizing is to make the good culture the path of least resistance.

Treat the harness as code

The rule files, system prompts, eval suites, and skill libraries are not personal config — they're shared infrastructure. Treat them exactly like code:

Version them with the project.
Review them in pull requests, like any other change.
Assign named owners, so they're maintained on purpose rather than rotting.

Without this, every developer's agent behaves slightly differently and nobody can reproduce anyone else's results.

Gate on the eval, not the demo

A working demo proves an agent can succeed once. A passing eval suite proves it succeeds reliably. The two are not the same, and shipping on the strength of a demo is how unreliable agents reach production.

Make eval coverage a precondition for shipping, the same way you'd gate a service on test coverage. But an eval without a clear rubric measures nothing — so define what you're scoring:

Task success
Tool-use quality
Trajectory compliance
Hallucination rate
Response quality

A CI gate makes it real:

# ci: block merge if the agent's eval suite regresses
agent-evals:
  run: eval-suite --rubric rubric.yaml --min-score 0.9
  on: [pull_request]
  required: true

Re-shape code review for generated code

Generated code needs the same scrutiny as human code, or more — and reviewers need to know its specific failure modes. Train them to look for hallucinated dependencies, thin error handling, and correctness gaps that look fine at a glance (Part 5). Tune the review checklist to those patterns rather than reusing the old human-code checklist unchanged.

Draw the prototype/production line explicitly

Fast, loose exploration and disciplined production work are both valid — but only when everyone knows which is which. Make the boundary explicit:

Which repos are production vs. sandbox.
Which branches require the full discipline.
Which environments an agent's output can reach.

Teams that leave this blurry produce prototypes that ship by accident. A written boundary keeps exploration fast and production safe at the same time.

Build the harness once, refine it many times

Reusable prompts, skill libraries, tool connections, and eval harnesses compound across projects. The teams that get the most out of AI development are the ones that build this shared harness once and keep improving it, rather than each person rebuilding their own from scratch. Treat it as infrastructure: documented, maintained, deliberately improved.

Hire and promote for judgment

As implementation gets cheaper, the bottleneck moves to specification, evaluation, and architectural judgment. The most valuable engineers over the next few years are the ones who can direct agents well — not the ones who can type the most code. Reflect that in how you hire, level, and develop people: weight specification skill, evaluation rigor, and system design over raw implementation speed.

The strongest setups are hybrid by design: humans set direction, agents implement, and clear handoff protocols govern the boundary. Code review, on-call, and team structure all evolve to treat agents as participants, not just tools.

Set up your own workflow

Move rule files, prompts, evals, and skills into the repo; require PR review for changes.
Assign an owner for the shared harness.
Add a CI gate that blocks merges when the eval suite regresses below a threshold.
Write a one-page review guide for generated-code failure modes.
Document the prototype-vs-production boundary: repos, branches, environments.
In your next hiring loop, add a specification-and-evaluation exercise, not just a coding test.

Source: The New SDLC With Vibe Coding (Google)

Treat the harness as code​

Gate on the eval, not the demo​

Re-shape code review for generated code​

Draw the prototype/production line explicitly​

Build the harness once, refine it many times​

Hire and promote for judgment​

Set up your own workflow​