~ writing/index.md
gregroy@gregsplace ~/writing $ ls -la && cat *.md

Writing

Notes on DevOps, incident response, and building sustainable engineering cultures. Short, opinionated, and written from direct experience — not the LinkedIn thought-leader version.

The 3-pager that fixed on-call

Clarified scope, severity levels, and paging rules.

Problem

  • Ambiguous scope: no one agreed what on-call "owned."
  • Severity levels meant different things to different teams.
  • Excessive noise → false pages → responder fatigue.
  • Runbooks existed but weren't authoritative or consistently used.

The solution (3 pages, one owner)

  1. Scope. Systems and services explicitly in scope; escalations for everything else.
  2. Severity. SEV-1 (customer/business critical), SEV-2 (degraded/functional), SEV-3 (nuisance/ops toil) — with concrete examples.
  3. Paging rules. Pages only for SEV-1 and high-confidence SEV-2 signals. Everything else → ticket + business hours.

Guardrails we added

  • Golden signals. Latency, traffic, errors, saturation — per service.
  • Runbook links inline for each common alert.
  • Comms script. Who says what, where, and when (Slack / status page / email).
  • Single owner for updates to prevent drift.

Results

  • ~40% fewer pages (noise removed), better sleep, better focus.
  • Meaningful pages → faster time-to-mitigation and clearer handoffs.
  • Post-incident reviews improved because severity was consistent org-wide.

Short, living, and owned. That's what made it work.

From IC to leader: a lightweight mentoring path

How I help strong ICs become calm, trusted incident leaders.

The path

  1. Shadow. Join incident channels as a quiet observer; review post-mortems together.
  2. Co-pilot. Run a small portion (notes, timeline, or comms) with a senior lead present.
  3. Lead a drill. Tabletop exercises with clear injects and measurable outcomes.
  4. Own an incident. Senior leader backstops; feedback within 24 hours.

What we practice

  • Clarity over certainty. Call the severity with available info, then adjust.
  • Small batches. One change at a time, explicit rollback plan.
  • Comms cadence. External and internal updates on a timer, not a feeling.

Artifacts

  • Incident commander checklist — roles, comms, handoffs.
  • Runbook skeleton — preconditions, steps, expected results, rollback.
  • After-action template — facts → findings → fixes → follow-through (owners + dates).

Mentoring is a system: reps, feedback, and a safe runway to try leading for real.

Blameless ≠ consequence-free: making post-mortems stick

Turn incidents into durable improvements without witch hunts or wheel-spinning.

Principles

  • Blame the system, not the person. Design makes errors likely or unlikely.
  • Bias to facts. Timeline first, opinions later.
  • Right-sized fixes. Priority is preventing recurrence, not boiling the ocean.

The template

  • Timeline. Facts with timestamps and sources (dashboards, logs, comms).
  • Customer impact. Who, how long, severity.
  • Contributing factors. Technical and organizational.
  • Actions. Fix now (days), fix next (weeks), invest (quarter) — with owners and dates.
  • Follow-through. Review action status weekly until done.

What changed when we did this

  • Repeat issues dropped because actions had owners and deadlines.
  • Engineers participated more; psychological safety increased.
  • Leaders got better signal on where to invest — people, tooling, or process.

Post-mortems pay off when they drive change. That means owners, dates, and visible follow-up.

Coming up

  • Runbook skeletons — how to author once and keep them useful.
  • Error budgets — how they guide trade-offs.
  • Preview envs — feature branches without fear.
  • Project Lattice — a developer log of building a visionOS infrastructure controller.

─────────────────────────────────────────────────────

Found this useful? Send me a note.