Alerts · Sovereign event bus

The nervous system your fleet should own.

Alerts is the service that ingests every notable event your platform produces and dispatches it across six channels — email, signed webhook, unsigned webhook, Telegram, Slack, Discord, and Twilio SMS. Producers fan in through one envelope. Routing rules and channel instances live in your database, not a vendor's. Every event is a row before it is a notification.

Owned ingest. Owned dispatch. Owned history. No PagerDuty rent. No vendor reading your platform events.

Discuss platform events Start consultation

Surface signal

Status

LIVE

Channels

Six shipped

Service

alerts-api:8040

Why this exists

Every service in a fleet emits events. Most teams lose them.

The signal a platform produces — a publish intent failing, a critical finding from a scan, a deploy succeeding, a consultation form submitted, a queue stalling — is the operational truth of the business. Most teams scatter that signal across vendor inboxes, throw the loud ones into a shared Slack channel, and pay PagerDuty rent for the on-call lane. The shape drifts per service. The history dies on retention. The webhook keys live in someone's password manager.

Alerts is the OWNED nervous system. Producers fan in through one envelope. Channels fan out through a typed contract. Every event lands in a row before it lands in an inbox — and the routing rules are yours to read, edit, and replay.

Self-sustained by design

Owned, not rented.

An alerting layer is one of the few pieces of infrastructure that should never depend on a vendor having a good day. The hub is built so the people who run the platform also own the substrate that tells them how it's doing.

Own ingest.

One FastAPI surface. POST `/v1/events` with `{event, payload, severity?, idempotency_key?}`. No vendor mailbox sees the body before you do, and the API key is one you rotated yesterday — not one Datadog issued in 2023.

Own dispatch.

Six channel types ship today — SMTP email, signed webhook, unsigned webhook, Telegram, Slack, Discord, SMS over Twilio. Each channel is one file behind a `BaseChannel` ABC. Adding Teams or PagerDuty is a class, not a procurement cycle.

Own history.

Every received event is a row in `events`. Every delivery attempt — succeeded, failed, skipped, with the error string — is a row in `deliveries`. SQLite under `alerts_data:/data`, WAL mode, trivial backup. Replay is a POST, not a vendor support ticket.

Own routing.

Per-event routing rules — exact match, prefix wildcard, severity floor — live in a `routes` table you can edit through the operator UI or the admin API. No Zapier middle-man. No "contact your CSM to enable per-channel filtering."

Own signing keys.

The signed webhook channel HMACs the body with a per-instance secret and presents it as `x-e11-signature: sha256=<hex>`. The receiver verifies with a key you both hold. There is no third party reading the platform events your fleet emits.

The primitive

Three things you can name. Typed.

Alerts is built on three primitives — an event to ingest, a channel to deliver through, and a route to map between them. Every other behaviour in the service is a function of these three records.

α · Ingest

event

POST /v1/events

The producer envelope. `{event, payload, severity?, idempotency_key?}`. 22 events registered today across pr, blog, cal, mail, dhara, web, platform — but unknown event names are accepted and default to `info`. Idempotency is producer-keyed and Redis-deduped for 24 hours.

β · Channel

channel

BaseChannel.deliver()

A named instance of a channel type — `ops-email`, `security-slack`, `kosh-harvester`. Type-specific config lives in `config_json`. Adding a channel type is a single file behind one abstract method. Failures on the primary channel trigger Celery retries; secondaries are best-effort by design.

γ · Route

route

match_pattern → channels[]

The mapping from event names to channel instance lists. `dhara.scan.critical_finding` to `security-slack` plus `oncall-pd`, `platform.*` with `min_severity: urgent` to the on-call lane. Priority-ordered. Multiple matches fan out — a route is not a stop.

How it fits the fleet

Every service that has something to say.

Alerts is the place every Eleven11 producer eventually points its outbound POST. Six producers are wired in production today, four more are scheduled — and a customer's own services slot in with the same envelope.

Wired. `pr.run.{started,completed,failed}`, `pr.publish.{scheduled,delivered,failed}`, `pr.llm.stage_failed`, `pr.moderate.rejected`, `pr.search.failed`. The LLM-exhaustion compat hook predates the generic ingest and is preserved.

cal

Wired. `cal.sync.{completed,failed}`, `cal.source.unreachable`, `cal.source.401`. A customer's Google source going 401 surfaces as an urgent event, not a silent stale calendar.

blog

Wired. `blog.ingest.{received,failed,signature_invalid}`. A signature-invalid event is urgent — that's a potential attack on the publish path, not a parsing issue.

web

Wired. `web.consultation.{submitted,failed}` from this site's consultation form. Submissions become sales-lead signals routable to a sales channel; failures surface SMTP issues that would otherwise drop inbound interest.

discovery

Wired. `discovery.asset.{new,changed}`, `discovery.finding.critical`, `discovery.scan.failed`. An asset's fingerprint changing without a deploy is the kind of signal that should never sit in a log.

harvester → kosh

Wired via the signed webhook channel. Harvester emits `harvester.job.completed`, alerts forwards it to `http://e11-kosh:4070/api/inbound/harvester` with HMAC verification — one hop, two trust zones.

Surfaces & contracts

Six routes. Each with a clear job.

The producer surface is one envelope. The admin surface is a small CRUD plane. The operator surface is the human view. Nothing else.

POST /v1/events

Generic ingest.

The one envelope every producer learns. Required `event` and `payload`. Optional `severity` and `idempotency_key`. Returns 202 with the Celery task id. Auth via `x-api-key` or `Authorization: Bearer`.

POST /v1/hooks/pr/llm-failure

Compat shim.

Matches `pr/src/lib/llm-failure-alert.ts` exactly so the existing pr deploy didn't have to change shape on the day the hub landed. Internally maps to `pr.llm.stage_failed` with auto-keyed idempotency.

GET /health

Liveness.

Returns `{status: "ok", service: "e11-alerts"}`. No dependency checks today — `/ready` with Redis ping plus queue depth is on the close-out list.

/v1/admin/channels

Channel CRUD.

List, create, patch, soft-delete channel instances. `POST /v1/admin/channels/{id}/test` fires a synthetic event through one channel and returns the result — for verifying a Slack webhook before it goes on a route.

/v1/admin/routes

Route CRUD.

Manage the mapping from event names to channel lists. Match patterns are exact (`dhara.scan.critical_finding`), prefix-wildcard (`platform.*`), or catch-all (`*`). Priority-ordered. Optional severity floor per route.

operator.eleven11.pro/alerts

Operator UI.

The same channel and route surfaces, with event history and per-channel delivery success rates. The view a human reaches for when an event didn't arrive — to read the row that says skipped, failed, or delivered, with the error string attached.

Senior engineering, visible

The proofs are in the substrate.

Five decisions visible in the queue topology, the retry semantics, the channel ABC, and the deploy shape — not adjectives, design choices.

Two processes, one image.

`alerts-api` runs uvicorn on `:8040`. `alerts-worker` runs Celery against `alerts.urgent,alerts.default`. Same Docker image, two commands. No shared library, no service mesh — just FastAPI plus Celery plus Redis on `e11-edge`.

Severity controls the queue.

`urgent` events go to `alerts.urgent`, everything else to `alerts.default`. Celery pulls left-to-right, so an urgent event jumps a backed-up default queue. `worker_prefetch_multiplier=1` keeps slow channels from gobbling unfair shares of the slot.

Primary fails loud, secondaries fail quiet.

`PRIMARY_TYPES = {"email"}`. Email failures raise — Celery retries with `max_retries=4` and exponential backoff. Slack, Discord, Telegram, signed-webhook failures are logged, swallowed, and recorded on the deliveries row. Secondary noise never blocks the primary path.

Idempotency is a 24h Redis SET NX EX.

Producers pass `idempotency_key`; alerts does `SET idempotency:<key> 1 NX EX 86400`. A duplicate producer retry returns `{skipped: true}` without re-sending. The pr LLM hook auto-keys to `pr.llm:{runId}:{stage}` so the same stage failure can't double-email.

FastAPI built-in docs are disabled on purpose.

`docs_url=None, redoc_url=None, openapi_url=None`. The repo was briefly public and the endpoint shape was enumerable from `/openapi.json`. The fix is in `app/main.py` and not negotiable — security posture, not preference.

Who this is for

Teams whose platforms have a pulse.

Alerts earns its keep when the cost of losing a signal — or paying someone else to read it for you — starts to exceed the cost of running a small Python service.

Platforms with three or more services, where "how did the deploy go" deserves a row, not a glance at a Slack channel.

Security teams who refuse to let a third party read every critical-finding event before they do.

Operators who want one envelope every producer learns, instead of six per-vendor SDKs and six different retry shapes.

Tenants on per-customer boxes who want their own alerts hub — not a multi-tenant queue someone else can throttle.

Anyone whose monthly PagerDuty bill has crossed the threshold where running their own urgent-channel routing is now the cheaper option.

FAQ

Final friction, reduced.

Is this a PagerDuty replacement?

It's the substrate that makes PagerDuty optional. Today the urgent lane goes to SMS over Twilio, signed webhooks to anything you already run, and Telegram to a phone you carry. A PagerDuty channel class is one file behind `BaseChannel` if you want it — but most teams find that a routed signed webhook into their own on-call rotation is enough.

Why SQLite for event history?

Event volume is tens to thousands per day at the e11 fleet's scale. SQLite in WAL mode handles that without breathing hard, the file lives on a Docker volume, backup is a copy. The driver is SQLAlchemy — the day Postgres becomes the right answer is one DSN change and an `alembic upgrade`.

How do I add a channel type that isn't shipped?

Subclass `BaseChannel`, implement `deliver(event, severity, payload)`, register the type in `channels_factory.py`. About forty lines for Slack, fifty for Discord, twenty for a signed webhook variant. The contract is one method — the rest is rendering.

What about producers that don't speak HTTP?

Every producer in the fleet that emits today is doing it with twenty lines of stdlib HTTP. There's no SDK to install, no Celery worker to run on the producer side, no shared library to version-pin. If your service can `POST` JSON, it can be a producer this afternoon.

Discuss Alerts

Bring your platform events home.

Alerts is partner-deployed today — bundled with the e11 fleet or run standalone on a tenant box. Talk to us about producer wiring, custom channel types, or how to migrate off your current paging vendor.

Start consultation

Direct line

Consultation requests stay owned. We reply from e11 after reviewing fit and timing.