Webhook Operations Handbook
This page is the operations counterpart to/sources/client-api-webhooks.
Use it to run webhook ingestion with Stripe-grade operational discipline.
Minimum production posture
Your receiver must be:- secure: signature + replay-window checks when signing is enabled
- durable: enqueue before ACK
- idempotent: dedupe on
event.id - observable: track ingest, processing, and retry outcomes
- recoverable: DLQ + replay tooling
Reference architecture
- ingress endpoint receives raw body + headers
- signature/timestamp checks run before parse
- payload is durably enqueued
- endpoint ACKs immediately (
2xx) - worker processes idempotently
- failures route to retry or DLQ
Required persisted fields
event.idevent.typeevent.createdevent.requestId- ingress
X-Request-Id - signature verification result and reason
- attempt number and retry-exhausted state
- enqueue timestamp and completion timestamp
Signature verification contract
Signing payload:- HMAC-SHA256
- lowercase hex digest
- reject outside replay tolerance (default
300s) - strict raw-body verification
- constant-time compare
- dual-secret validation during secret rotation
TypeScript snippet
Unsigned delivery handling (current beta reality)
IfOmni-Signature is absent:
- treat delivery as unsigned, not malformed
- require unguessable endpoint path
- restrict source ingress with network controls where possible
- emit security alert on unsigned delivery
- continue durable enqueue + idempotent processing
Retry policy
Current internal-beta behavior (live)
Retryable classes:- network failure (
status=0) 408,429,5xx
maxAttempts=3baseDelayMs=500- exponential backoff + jitter
attemptsretryExhausted
Public-beta-ready target policy
| Attempt | Delay target |
|---|---|
| 1 | immediate |
| 2 | 1 minute |
| 3 | 5 minutes |
| 4 | 30 minutes |
| 5 | 2 hours |
| 6 | 12 hours |
- route to DLQ with full context
- require explicit replay decision
SLOs and alerting
Recommended SLOs:- ingress ACK p95
< 500ms - successful processing ratio
>= 99.9% - replay success
>= 99% - sustained DLQ growth
= 0
- signature failures >
0.5%over 5m - enqueue failures >
0over 5m - retry-exhausted deliveries >
0over 15m - DLQ backlog growth over 30m
Incident response matrix
| Symptom | Likely cause | Immediate action | Follow-up |
|---|---|---|---|
| Signature failures spike | secret mismatch, raw-body mutation, clock skew | fail closed for signed events; verify secret + clock | add dual-secret rollout tests |
| Duplicate business side effects | weak dedupe or non-idempotent consumer | pause processors; enforce event.id lock | add unique constraints and side-effect idempotency keys |
| High ACK latency | heavy logic in ingress path | shift work to queue-first pattern | enforce handler timeout budgets |
Rising retryExhausted | persistent downstream dependency failure | isolate failing dependency and throttle replay | add dependency health checks and circuit breaker |
Replay runbook
- identify target
event.idand failure class - patch the root cause
- replay event with original correlation metadata
- verify no duplicate side effects
- close with preventive action and monitor window
Gameday checklist
- signature failure simulation
- clock-skew simulation
- queue outage simulation
- downstream timeout simulation
- replay exercise from DLQ
- postmortem template completion
Related docs
/sources/client-api-webhooks/sources/client-api-errors/sources/client-api-retries