Key Takeaways
- Operations automation fails when it automates “happy paths” and ignores exceptions — exceptions are the work
- Define reliability in measurable terms (SLIs/SLOs) and use error budgets to prevent alert fatigue
- Treat operational workflows as event logs; this enables process mining, conformance checks, and prediction
- Route exceptions to owners with explicit thresholds, time windows, and escalation policies
- Start with 3–5 high-value user journeys, not a warehouse of metrics
The core idea: exceptions are the work
Most businesses don’t lose time because “nothing is automated.” They lose time because:
- a process deviates from expectation (an exception),
- the deviation is discovered late (low observability),
- routing is unclear (no owner),
- and recovery is manual (no playbook).
Automation that only accelerates the happy path often increases fragility, because it increases throughput without increasing detectability and recovery capacity.
Define “reliability” like a researcher, not like a slogan
Google’s SRE guidance provides a useful, rigorous framing:
- SLI: “the proportion of valid events that were good”
- SLO: the target for that SLI over a time window
- Error budget: the allowed “badness” implied by the SLO [1]
This framing matters in non-software operations too, because it turns vague goals (“faster”, “better”, “more accurate”) into measurable thresholds and tradeoffs.
Start from user journeys
Pick a small number of critical journeys (e.g., “quote → invoice”, “order → ship”, “ticket opened → resolved”). Then define one availability-style SLI and one latency-style SLI for each. Don’t start from dashboards.
Step 1: model your work as an event log
Process mining research is built on a simple but powerful representation:
- Each “case” (order, ticket, job, claim) generates a sequence of timestamped events.
The Process Mining Handbook emphasizes that event data engineering and log quality are major determinants of success. [3][4]
Minimum viable event log schema
For each event:
case_id(e.g., order id)event_name(e.g.,payment_captured)timestampactor(human role or system)- optional:
channel,location,amount,reason_code
When you have this, you can answer:
- What is the actual path distribution?
- Where do we loop (rework)?
- What variants correlate with failure or delay?
Step 2: define exceptions as hypotheses
An exception definition should be testable and falsifiable. Example:
- “If an order is in
awaiting_inventoryfor > 2 hours during business hours, then it is at risk of missing its ship-by date.”
That statement implies:
- a state definition,
- a time threshold,
- a time window,
- and a measurable outcome.
A practical exception taxonomy
- Latency exceptions: time in state exceeds threshold
- Quality exceptions: defect rate / error rate exceeds threshold
- Completeness exceptions: required data missing at a gate
- Conformance exceptions: process deviates from the reference model
- Volume exceptions: unexpected spikes/drops
Step 3: route, don’t just alert
Alerts are outputs; routing is a system.
An exception should have:
- an owner role (not a person),
- a response time target,
- a remediation playbook,
- an escalation path,
- and a post-incident learning loop.
Avoid alert fatigue mathematically
If every deviation becomes a notification, the system will be ignored. Error-budget thinking forces you to ask: “How much badness do we allow, and where should we spend attention?” [1]
Step 4: predict before you breach (where it’s worth it)
Predictive process monitoring extends event logs into operational support: predicting remaining time, next activities, or outcome risk while a case is still in flight. [5]
Prediction is valuable when:
- the remediation action exists (you can intervene),
- and the intervention is cheaper than the failure.
Otherwise, prediction becomes “interesting” but not “useful.”
Step 5: close the loop with outcome metrics
DORA’s research program is software-focused, but the measurement philosophy generalizes: you need a small set of outcome metrics that reflect throughput and stability, not vanity metrics. [6][7]
For general operations, a mapping might look like:
- Lead time: request → completion
- Change failure rate: % of cases requiring rework/escalation
- Time to restore: time from exception detection → resolution
A starter template (copy/paste into your ops design doc)
Journey: orderToShip
- SLI (availability-style): % of valid orders shipped successfully
- SLO: 99.5% over 28 days
- Error budget: 0.5% of orders (or equivalent time)
- Latency SLI: % of orders shipped within 24 hours
- Exceptions:
awaiting_inventory> 2 hours (owner: inventory)payment_pending> 15 minutes (owner: billing)label_print_failedevent count > baseline × 3 (owner: IT/ops)
- Playbooks: link each exception to a one-page runbook
Journey: ticketToResolution
- SLI: % of tickets resolved without reopening
- Latency SLI: % resolved within SLA by severity
- Exceptions: “stuck in
awaiting_customer> 48h” etc.
Next steps
If you want automation that holds up under real-world variability, start with event logs, define exceptions as testable hypotheses, and run the routing loop with explicit error-budget constraints.
Assessment
How efficient are your operations?
Answer a few quick questions and get a personalized breakdown of where manual work is costing you.
Start ops assessmentReferences
- Google: The Art of SLOs (handbook PDF) (SLI equation; outage math; error budgets)
- Google SRE Book: Service Level Objectives (SLOs as reliability targets and tradeoffs)
- Process Mining Handbook (open access, 2022) (Foundations; data challenges; monitoring)
- Van der Aalst: Process Mining — Data Science in Action (2016) (Canonical framing of process mining and event logs)
- Process Mining Handbook — Predictive Process Monitoring chapter (open access) (Predicting remaining time/outcomes for in-flight cases)
- DORA Research: 2024 report landing page (Outcome metrics orientation; research context)
- 2024 DORA Accelerate State of DevOps Report (PDF) (Throughput/stability measures and empirical findings)