Elevated API Error rates
Incident Report for Onfido
Postmortem

Summary

From 07:00-07:47 BST we experienced an outage on the API. During this time all requests to the API resulted in a 500 status. No checks could be submitted or retrieved in that time. This was caused by a database failover event at 07:00 BST. We returned back to normal operation by 07:50 BST.

After recovery, facial checks were delayed due to some missing metadata in the checks. The missing data meant the check could not be loaded in the tool our analysts use to process these checks. This was corrected by 09:00 BST, all checks were processed by 10:35 BST and we returned back to normal operation.

Timeline

API Incident

  • 07:00 BST - Failover of database from master to slave
  • 07:01 BST - On-call engineers alerted of a service issue
  • 07:10 BST - Identification of the issue as related to a database failover
  • 07:10-07:30 BST Investigation of impact of database failover
  • 07:30-07:45 BST Fix to restart affected applications implemented
  • 07:24 BST Status Page Update (Elevated API Error rates)
  • 07:47 BST Status Page update (Investigating)
  • 07:50 BST All applications back to normal working operation

Facial Processing Incident

  • 08:00 BST Internal agents reported being unable to process facial checks
  • 08:00-08:57: Investigation into why facial checks were failing
  • 08:57-09.30: Fix applied to all affected checks
  • 09:30-10:35: Processing of facial checks backlog
  • 10:35 Return to normal operation of facial check processing

Root Cause

After the failover, specific ruby applications connected to postgres did not change connection to the new instance - instead they stayed connected to the old master. Because the old instance became a reader, any write operations failed. This is a highly unusual failure to reconnect - we have had other failovers where recovery happened quickly.

The API has some very early write operations. As a result any calls to the API failed and returned 500 because the write operation could not be carried out. Other downstream systems stopped processing any checks because the API was no longer receiving or processing checks.

The database failed over because the temporary storage volume became full and ran out of disk space. This was caused by a long running bad query which has subsequently been fixed. **

Facial Checks
**
Facial check processing became delayed because checks did not have the metadata required to load in our check processing tool. This was caused by an automated recovery process that was triggered after the earlier database failure which did not add the necessary metadata.

These bad checks meant that analysts were unable to process facial checks and this delayed facial checks until there was a fix to add the missing metadata.

Remedies

Immediate:

  • Restarts of all applications connected to the database server corrected the application failure and the elevated API error rate.
  • The unusually long running query which caused the initial failover has been fixed
  • The addition of the required metadata allowed facial checks to be processed correctly and the backlog cleared
  • We have released a fix to ensure this metadata is always added on these checks to improve resilience in this scenario.

Future:

  • We will be investigating why our current failover mechanisms did not work in this scenario, and implementing a recovery mechanism for this particular type of failure to prevent this happening in the future.
  • We will run simulations of various failover types to test this scenario and others, to ensure that our recovery mechanisms are adequate
Posted Jul 06, 2020 - 13:34 UTC

Resolved
The issue is now resolved and the API is back to normal.

We take a lot of pride in running a robust, reliable service and we're working hard to make sure this does not happen again. A detailed postmortem will follow once we've concluded our investigation.
Posted Oct 20, 2019 - 08:42 UTC
Monitoring
The API is now fully operational. We will continue to monitor over the next hour. Next update at 9.30 BST
Posted Oct 20, 2019 - 07:36 UTC
Identified
We have identified the cause of the issue and are in the process of implementing a fix. API calls should return to normal over the next couple of minutes. Next update 08:30 BST.
Posted Oct 20, 2019 - 06:47 UTC
Investigating
We're currently experiencing elevated error rates impacting API, affecting all clients.
Posted Oct 20, 2019 - 06:24 UTC
This incident affected: Europe (onfido.com) (API).