Studio webhooks delivery partial outage
Incident Report for Onfido
Postmortem

Summary

On the 10th February 2025 around 17:34 UTC, a dependency upgrade, scoped to the component responsible for sourcing webhooks system, was deployed to production which caused a partial outage in the delivery of webhooks for Studio, affecting ~38% of traffic in the CA and US regions, for the duration of incident. There was no impact in EU.

Root Causes

A third party dependency upgrade in a component which buffers events for internal broadcasting, caused a fraction of them to be dropped, which resulted in the corresponding webhooks not being sent. Investigation was unusually hard due to difference in impact across regions and being intermittent in nature.

Timeline

10/2 17:38: change with the dependency upgrade was deployed to the CA region

10/2 18:19: change with the dependency upgrade was deployed to the US region

10/2 18:58: first instance of webhooks not delivered in the US region

10/2 20:38: first instance of webhooks not delivered in the CA region

11/2 18:27: incident was reported and investigation started

12/2 02:14: change was reverted in the CA region

12/2 02:48: change was reverted in the US region

Remedies

  • Additional per-region monitoring will be employed to identify these partial outages of critical services, such as webhooks, in a more timely manner, such as setting more aggressive per-region thresholds. The fact that EU, region with largest volume, was unaffected, diluted the measurement globally.
  • New standard operating procedure: only upgrade a single dependency per deployment on this system.
  • End-to-end testing of webhook delivery will be expanded to validate this additional scenario.
  • Reliance on the affecting third party dependency will be phased out.
Posted Feb 18, 2025 - 17:53 UTC

Resolved
On the 10th February 2025 around 17:34 UTC, a dependency upgrade, scoped to the component responsible for sourcing webhooks system, was deployed to production which caused a partial outage in the delivery of webhooks for Studio, affecting ~38% of traffic in the CA and US regions, for the duration of incident. There was no impact in EU.
Posted Feb 11, 2025 - 18:30 UTC