WorkInfrastructure2024

Infra Monitoring

Infrastructure observation system — synthetic probes, SLO tracking, paging, and post-mortems.

Role
Platform build
Duration
3 months
Stack
Go · Prometheus · Grafana · AWS

The problem

A fast-growing SaaS was discovering outages from customer tickets. Metrics existed, but nobody had wired them to SLOs or paging — and nobody wrote post-mortems.

Our approach

  1. 01

    Defined SLOs per critical user journey and wired Prometheus alerts to PagerDuty with sensible burn-rate windows.

  2. 02

    Built synthetic probes for the golden flows so we see what customers see, from three regions.

  3. 03

    Shipped a lightweight post-mortem template + runbook wiki so the oncall rotation could learn between incidents.

The outcome

MTTR dropped significantly, the team stopped learning about outages from customers, and SLOs became a shared vocabulary across engineering.

MTTR
-62%
Self-detected issues
98%
Oncall pages / wk
-40%

Have a project in mind?

Tell us what you're building. We'll come back within one business day with next steps — not a sales pitch.

Start a projectResponse within 1 business day