Infra Monitoring
Infrastructure observation system — synthetic probes, SLO tracking, paging, and post-mortems.
- Role
- Platform build
- Duration
- 3 months
- Stack
- Go · Prometheus · Grafana · AWS
The problem
A fast-growing SaaS was discovering outages from customer tickets. Metrics existed, but nobody had wired them to SLOs or paging — and nobody wrote post-mortems.
Our approach
- 01
Defined SLOs per critical user journey and wired Prometheus alerts to PagerDuty with sensible burn-rate windows.
- 02
Built synthetic probes for the golden flows so we see what customers see, from three regions.
- 03
Shipped a lightweight post-mortem template + runbook wiki so the oncall rotation could learn between incidents.
The outcome
MTTR dropped significantly, the team stopped learning about outages from customers, and SLOs became a shared vocabulary across engineering.
- MTTR
- -62%
- Self-detected issues
- 98%
- Oncall pages / wk
- -40%
Have a project in mind?
Tell us what you're building. We'll come back within one business day with next steps — not a sales pitch.