Google explores the transition from traditional deterministic automation to agentic AI within Site Reliability Engineering. As system complexity grows due to microservices, cloud scale, and increased code generation, Google is implementing SRE AI across the entire software development lifecycle to enhance reliability. The approach includes using agents for automated runbook improvement, advanced anomaly detection, incident management orchestration, and autonomous investigation utilizing observability data.
- Moving from deterministic automation to agentic AI models
- Integration of AI in reliability design and documentation
- Using anomaly detection rather than static thresholds for alerting
- Orchestrating incident response via communication monitoring and automated summaries
- Leveraging historical data through AI Insights for risk management
- Adhering to principles of transparency, security, and agent identity
Launched in 2007, Chess.com is a premium platform for online chess and one of the largest of its kind. A Cloud SQL for MySQL shop, it transitioned to Cloud SQL Enterprise Plus edition, improving the user experience, cutting costs, and significantly reducing response times, decreasing p99 latency response from 14ms to 4ms. Read on to learn more.
This article introduces Google's top AI applications, providing a guide on how to start using them, including Google Gemini, Google Cloud, TensorFlow, Experiments with Google, and AI Hub.
A recent article by Google Cloud SREs describes how they use the AI-powered Gemini CLI internally to resolve real-world outages. This approach improves reliability in critical infrastructure operations and reduces incident response time by integrating intelligent reasoning directly into the terminal-based operational tools.
Google Cloud has explained how it accidentally deleted a customer account belonging to UniSuper, a $135 billion Australian pension fund. The incident led to two weeks of downtime for UniSuper's 647,000 members. Google admits that a Google employee made an inadvertent misconfiguration during the initial deployment of a Google Cloud VMware Engine (GCVE) Private Cloud for the customer using an internal tool.
Google Cloud has announced native support for the OpenTelemetry Protocol (OTLP) in its Cloud Trace service, allowing developers to send trace data directly using OTLP and eliminating the need for vendor-specific exporters. This includes increased storage limits for attributes and spans.
This article details how Google SREs are leveraging Gemini 3 and Gemini CLI to accelerate incident response, root cause analysis, and postmortem creation, ultimately reducing Mean Time To Mitigation (MTTM) and improving system reliability.
How to unnest / extract nested JSON data in BigQuery