Summary

On Monday 4 Mar 2024 we ran a routine infrastructure component upgrade on the computing layer of our EU region. This upgrade is regularly performed with a Blue/Green strategy: a secondary compute cluster is created, all services are deployed, and traffic is then progressively switched from the primary to that secondary cluster. This process was validated on multiple test environments and had already been successfully applied to our other production regions. However, the networking configuration in the EU installation had an inconsistency, which led to traffic to our API authorization service remaining on the primary cluster. As a result, when the primary cluster was scaled down, it led to the outage of our API for 15 minutes.

During the outage, many customers accumulated requests on their side and proceeded with retries when we restored the service. This means that after the API was restored, we faced a particularly high surge of traffic. Because the new cluster was without traffic for 15 minutes, automated scaling procedures started to scale it down. As a result, a key Document processing service did not handle the traffic surge. It took more time than we aim for to scale it back up, during which document reports were unavailable.

Overall this means 15 minutes of report creation downtime, and a further 50 minutes of disruption to document report creation.

Root Causes

Inconsistent network configuration of our EU environment compared with all others.
Autoscaling up isn’t fast enough for some services when faced with extreme demand.
Downscaling of the primary cluster was too aggressive.

Timeline

10:20 UTC: traffic switch to secondary cluster triggered.
10:37 UTC: primary cluster is scaled down automatically in response to lower traffic.
10:37 UTC: our authentication component fails to authorize API requests.
10:52 UTC: traffic is properly routed to the secondary cluster and the API can handle authorization requests again.
10:58 UTC: alerts for high error rate on the API is triggered.
11:06 UTC: key document service not able to handle the load is identified.
11:27 UTC: key document service scale up is manually accelerated on both clusters.
11:29 UTC: the primary cluster upgrade is done. Traffic is progressively moved back to it.
11:42 UTC: all services are upscaled and stable. Full functionality is restored.

Remedies

Fix traffic switching network configuration of our EU region.
Change upgrade process such that any downscaling of the primary cluster is done conservatively.
Improve the responsiveness of the failing document service to handle surge demand.

Posted Mar 08, 2024 - 08:36 UTC

Resolved

Following subsequent monitoring, our systems continue to be stable since services were fully restored at 11:42 UTC, and this incident is now closed. There remains a very small backlog of impacted reports that will be completed in the next few hours.

This incident was caused by the failure of a routine maintenance operation. The failure led to our API being unavailable, resulting in all EU customers facing downtime from 10:37 UTC to 10:52 UTC.

There was a further problem in the subsequent recovery period that prevented documents from being uploaded from 10:52 UTC to 11:42 UTC. This blocked the creation of checks that depended on document capture. For Studio customers, workflows were created, advancing until the document capture step. After document upload recovered, applicants were able to resume from that workflow step.

A more detailed postmortem will follow.

We pride ourselves on the reliability of our service, and apologise for the disruption caused by this incident.

Posted Mar 04, 2024 - 18:12 UTC

Monitoring

The API is now stable and correctly accepting checks and documents. We are processing all reports.
We will keep monitoring.

Posted Mar 04, 2024 - 11:47 UTC

Update

A fix for the issue has been built, we are in the process of deploying it to all infrastructure.
Dashboard and API are available except to create checks.

Posted Mar 04, 2024 - 11:25 UTC

Identified

The issue has been identified and a fix is being implemented.

The API should already be partly available.

Posted Mar 04, 2024 - 11:05 UTC

Update

We are continuing to investigate this issue.

Posted Mar 04, 2024 - 11:04 UTC

Investigating

We are currently investigating this issue.

Posted Mar 04, 2024 - 10:49 UTC

This incident affected: Europe (onfido.com) (API, Dashboard, Document Verification).