API and Dashboard Availability
Incident Report for Onfido
Postmortem

Summary

On Monday, Jan 9th UTC, a configuration error resulted in approximately 10 minutes of service unavailability between 13:37 and 13:47 UTC for the Onfido API and dashboard. Most requests made to *.onfido.com failed during this time, returning a 503 status.

We work hard to deliver a stable and reliable service to our customers. When failures occur, we put together a timeline of the events, consider what we've learned, and decide on what remedies we can take to avoid similar issues in the future. We want to be transparent with you about what happened and what we've learned from this incident.

Timeline

On Monday, Jan 9th at 13:29 UTC, we began to roll out a new production release. This release had previously been tested in our staging environment. At 13:37 UTC, our uptime monitoring began to report that service was unavailable on our API and dashboard.

At 13:38 UTC, a site reliability engineer was notified; the wider engineering team was also alerted at this time. On investigation, the engineer on point determined that new server instances were failing to start up correctly, but were being erroneously registered with load balancers.

By 13:47 UTC, we had cancelled the rollout of the new version and restored baseline capacity of the previous version. By 14:11 UTC we had brought our service back to full capacity. Since then, we have seen no further availability issues.

Contributing factors

Our deploy process is intended to work on a rolling basis: our tooling progressively brings new instances up, runs health checks, and removes old instances if those health checks pass successfully. This allows us to maintain a stable service whilst supporting multiple releases per day.

In this case, our new instances failed to start correctly due to incompatible configuration in a proxy service. The health-checking process did not work correctly and unhealthy instances were put into service.

This indicated two issues in our deployment process:

  1. Manual coordination required to merge implicitly dependent features. While both backend changes and the proxy service change were tested in our staging environment, the merge of these to our deployment branch needed to be coordinated; unfortunately the proxy service change was not merged ahead of the release.
  2. Our health checks were not sufficient to detect this issue on a new instance.

Remediations

Over the last two days, we have adjusted our health checks to effectively identify this issue, which will prevent unhealthy instances being added to service in future.

Over the next week we intend to restructure our deployment process to make the dependency between these two services explicit, thus removing the need for manual coordination.

Posted Jan 13, 2017 - 14:09 UTC

Resolved
This issue has now been resolved.

We'll follow this update with a postmortem in the next few days, after we've finished analysing the impact of this issue and the best possible course to avoid any future occurrence.
Posted Jan 09, 2017 - 14:11 UTC
Monitoring
Service has resumed and we are working on returning to full capacity.
Posted Jan 09, 2017 - 13:51 UTC
Identified
We're currently experiencing issues with API, dashboard and applicant form availability.

The issue has been identified and we're currently working on a resolution; update expected at 14:00 GMT.
Posted Jan 09, 2017 - 13:45 UTC