Introducing Pulse AI: Root Cause Analysis

An alert fires at 2:14 AM. Payments are failing. You are three minutes into squinting at check logs, dashboards, and status pages — trying to reconstruct what just broke — when you realize you already know the answer. The upstream payment provider is returning 502s from all three regions, which means it is not a regional network issue on our side, which means you should check the provider's status page before doing anything else.

That three minutes of pattern-matching is the part of on-call that almost every engineer hates. It is also exactly what Pulse AI does for you now — except it does it in 30 seconds, and it does it before you have even opened the incident page.

Today we are launching Pulse AI, automatic root cause diagnosis on every incident PulseAPI detects.

What Pulse AI does

When an incident is created, Pulse AI reads every signal we have collected about it — the raw check results across every monitoring region, the HTTP response details, recent latency and error-rate history, related incidents on connected services — and produces a structured diagnosis in under 30 seconds.

That diagnosis includes:

A root cause category. One of eleven labels: auth_failure, server_error, timeout, regional_outage, ssl_issue, rate_limited, payload_change, dns_failure, third_party, flaky, or unknown when the signal is not strong enough to commit.
A confidence score from 0 to 100%. Low confidence is a feature, not a failure — it tells you when the model is not sure, so you can treat the diagnosis accordingly.
A plain-English summary you could read to a non-engineer on a status page.
A technical root cause with the specifics an on-call engineer needs.
A user impact assessment — who is affected and how badly.
3 to 5 prioritized action items. Not generic advice; specific next steps based on the diagnosis.
Supporting evidence. The exact signals that led to the diagnosis, so you can verify the reasoning before acting on it.

By the time you click the link in the alert email, the analysis is already waiting on the incident page.

Why we built it

Every on-call engineer does the same thing when an alert fires. Open the incident. Check which regions saw it. Look at the HTTP status code. Check response-time trends for the last hour. Check related endpoints. Pattern-match against incidents you have seen before. Decide what broke.

Most of that work is mechanical. Most of it follows patterns. And most of it is exactly the kind of work large language models are genuinely good at — reading a lot of structured signals and producing a coherent interpretation.

The question was never whether AI could help with incident triage. The question was whether we could make it reliable enough to trust, fast enough to be useful, and safe enough to run on production data without creating new problems.

How we made it safe

Before anything is sent for analysis, request and response data pass through a sanitization layer. Authorization headers, API keys, and JWT tokens are stripped. Email addresses are redacted. The model sees the shape of the problem, not credentials from your systems.

This is a hard guarantee, not a configuration toggle. There is no version of Pulse AI that transmits raw credentials or PII. If you want to verify this, the sanitization rules are applied server-side before any outbound call to Anthropic, and the sanitized payload is logged for audit.

How we made it fast

Analysis runs on a dedicated Horizon queue so it never blocks incident creation or notification delivery. The moment an incident record is committed, the job is dispatched. Median end-to-end time — from check failure to finished analysis — is well under 30 seconds.

If something goes wrong with analysis (rate limits, a transient model outage, a malformed response), the incident still gets created and notifications still get sent. Pulse AI is purely additive. It never stops the core alerting loop.

Who can use it

Pulse AI is available on every paid plan, with daily limits based on how many incidents a team typically triages:

Starter — 25 analyses per day
Professional — 100 analyses per day
Team — Unlimited

Every team member can read the analysis on the incident page. Owners, admins, and members can re-run the analysis on demand — useful when new check data arrives mid-incident and you want a fresh read.

What Pulse AI is not

Pulse AI is a diagnostic assistant, not an autopilot. It does not take action on your infrastructure. It does not silence alerts. It does not close incidents. Its job is to take the first five minutes of an on-call triage — the part that is mostly pattern matching — and hand you the summary so you can spend your time on the part that actually needs a human.

We also want to be clear about what a confidence score means. A 92% confidence diagnosis is not a guarantee. It is a strong prior. For a small fraction of incidents, the evidence will point one way and the truth will be something else. The supporting-evidence section exists specifically so you can sanity-check the diagnosis before acting on it.

What is next

Pulse AI is the first of several AI features we are building. The next ones are designed to work together:

Incident clustering — automatic grouping of related incidents across endpoints and services
Anomaly narration — natural-language summaries of unusual patterns before they become incidents
Post-incident reports — draft postmortems generated from the full incident timeline

Read more about how intelligent detection fits into the broader product in our post on why we built PulseAPI, or dig into the detection side in API alerting mistakes.

PulseAPI monitors your APIs from multiple regions, validates response structure, and now diagnoses every incident automatically. Start monitoring free →

Introducing Pulse AI: Root Cause Diagnosis in Under 30 Seconds