Webhook event deliveries in the EU region faced delays of up to 22 mins between 18:44 to 20:44 UTC, following an extreme level of webhook notification request timeouts. This necessitated scaling up the relevant service to mitigate.
There was an unusual amount of webhook notification request timeouts which saturated the resources available to process webhooks. Our scale up strategy was not setup correctly to handle that particular situation.
17:34 UTC: There began to be a large increase in request timeout to specific customer webhook endpoints
17:46 UTC: We get notified with modest abnormal webhook latency for some events
18:04 UTC: We scaled up the service’s resources, latency was stabilised
18:44 UTC: Webhook latency began to degrade significantly on all events
20:20 - 20:40 UTC: Further adjustments made to the service's autoscaling configuration
20:44 UTC: Incident fully recovered, including clearing of any queued requests
We updated our auto-scaling configuration. Follow up work planned to improve handling of abnormal request timeouts.