Dashboard and check creation unavailable

Incident Report for Onfido

Postmortem

Summary

We were in the process of a routine upgrade to our network ingress controller from the “prior” version to a “next” version. This component is packaged and applied to API-related services as and when those services are deployed. A problem occurred when we packaged the “next” version while one of those services was in the middle of a deployment.

This problem rendered our external API, as well as our Dashboard, inaccessible in the EU region, for 25 minutes.

The issue was as follows:

  1. At the start of the service deployment pipeline, the system set the network component dependency to the “prior” version (current at the time), and dynamically created configuration settings based on that version.
  2. Meanwhile, the “next” version was released while the service deployment pipeline was still in progress.
  3. Later in the pipeline, another step overrode the dependency version to "latest” (now pointing to “next”). However, the original configuration settings created in the pipeline were incompatible with this new version, which resulted in a corrupted final network configuration.

This situation had not previously occurred because the network ingress controller had never changed while an API service was mid deployment.

Root Causes

The root causes were:

  • The final network configuration state depending on the output of multiple pipeline steps.
  • The dependency version not being pinned for the entire deployment process, allowing a mid-pipeline version change.

Timeline

12:39 UTC: An API service deployment pipeline starts

12:56 UTC: The new network ingress controller is released

13:04 UTC: The API service pipeline triggers an EU Production deployment

13:05 UTC: A corrupted ingress configuration is installed, preventing traffic from being routed to the external API

13:10 UTC: We receive monitoring alerts for the loss of API traffic

13:15 UTC: Incident response mobilized and investigating the problem

13:22 UTC: API service rollback is initiated

13:25 UTC: The rollback for the API completed, restoring a valid ingress configuration and allowing traffic again

Remedies

The following improvements will be made to the standard deployment pipeline to prevent similar network configuration inconsistencies from occurring:

  1. Use explicit version numbers for dependencies instead of "latest" during deployments.
  2. Consolidate the steps required to build the network configuration.
Posted Dec 04, 2025 - 20:29 UTC

Resolved

We have experienced delays and errors in check creations in EU Region, as well as inability to log into the dashboard from 13:05 to 13:25 UTC. The incident is now resolved and we are processing traffic.

We will write a post-mortem with additional information.
Posted Dec 02, 2025 - 13:00 UTC