This article discusses causal inference, an emerging field in machine learning that goes beyond predicting what could happen to focus on understanding the cause-and-effect relationships in data. The author explains how to detect and fix errors in a directed acyclic graph (DAG) to make it a valid representation of the underlying data.
Organizations with complex distributed systems that span dozens of teams can have a hard time following such practice without burning out the teams owning the client-facing services. A typical solution is to have alerts on all the layers of their distributed systems. This approach almost always leads to an excessive number of alerts and results in alert fatigue.
Adaptive Paging is an alert handler that leverages the causality from tracing and OpenTracing's semantic conventions to page the team closest the problem. From a single alerting rule, a set of heuristics can be applied to identify the most probable cause, paging the respective team instead of the alert owner.