On Monday, Jan 9th UTC, a configuration error resulted in approximately 10 minutes of service unavailability between 13:37 and 13:47 UTC for the Onfido API and dashboard. Most requests made to *.onfido.com
failed during this time, returning a 503
status.
We work hard to deliver a stable and reliable service to our customers. When failures occur, we put together a timeline of the events, consider what we've learned, and decide on what remedies we can take to avoid similar issues in the future. We want to be transparent with you about what happened and what we've learned from this incident.
On Monday, Jan 9th at 13:29 UTC, we began to roll out a new production release. This release had previously been tested in our staging environment. At 13:37 UTC, our uptime monitoring began to report that service was unavailable on our API and dashboard.
At 13:38 UTC, a site reliability engineer was notified; the wider engineering team was also alerted at this time. On investigation, the engineer on point determined that new server instances were failing to start up correctly, but were being erroneously registered with load balancers.
By 13:47 UTC, we had cancelled the rollout of the new version and restored baseline capacity of the previous version. By 14:11 UTC we had brought our service back to full capacity. Since then, we have seen no further availability issues.
Our deploy process is intended to work on a rolling basis: our tooling progressively brings new instances up, runs health checks, and removes old instances if those health checks pass successfully. This allows us to maintain a stable service whilst supporting multiple releases per day.
In this case, our new instances failed to start correctly due to incompatible configuration in a proxy service. The health-checking process did not work correctly and unhealthy instances were put into service.
This indicated two issues in our deployment process:
Over the last two days, we have adjusted our health checks to effectively identify this issue, which will prevent unhealthy instances being added to service in future.
Over the next week we intend to restructure our deployment process to make the dependency between these two services explicit, thus removing the need for manual coordination.