WorkInfrastructure2024

Infra Monitoring

Infrastructure observation system — synthetic probes, SLO tracking, paging, and post-mortems.

Role: Platform build
Duration: 3 months
Stack: Go · Prometheus · Grafana · AWS

The problem

A fast-growing SaaS was discovering outages from customer tickets. Metrics existed, but nobody had wired them to SLOs or paging — and nobody wrote post-mortems.

Our approach

01
Defined SLOs per critical user journey and wired Prometheus alerts to PagerDuty with sensible burn-rate windows.
02
Built synthetic probes for the golden flows so we see what customers see, from three regions.
03
Shipped a lightweight post-mortem template + runbook wiki so the oncall rotation could learn between incidents.

The outcome

MTTR dropped significantly, the team stopped learning about outages from customers, and SLOs became a shared vocabulary across engineering.

MTTR: -62%
Self-detected issues: 98%
Oncall pages / wk: -40%

Next project

NEXUS · Cloud AI

Cloud-based AI platform offering RAG, agent orchestration, and managed inference endpoints.

Have a project in mind?

Tell us what you're building. We'll come back within one business day with next steps — not a sales pitch.

Start a projectResponse within 1 business day