Elevated API Error rates

Incident Report for Onfido

Postmortem

Summary

From 07:00-07:47 BST we experienced an outage on the API. During this time all requests to the API resulted in a 500 status. No checks could be submitted or retrieved in that time. This was caused by a database failover event at 07:00 BST. We returned back to normal operation by 07:50 BST.

After recovery, facial checks were delayed due to some missing metadata in the checks. The missing data meant the check could not be loaded in the tool our analysts use to process these checks. This was corrected by 09:00 BST, all checks were processed by 10:35 BST and we returned back to normal operation.

Timeline

API Incident

07:00 BST - Failover of database from master to slave
07:01 BST - On-call engineers alerted of a service issue
07:10 BST - Identification of the issue as related to a database failover
07:10-07:30 BST Investigation of impact of database failover
07:30-07:45 BST Fix to restart affected applications implemented
07:24 BST Status Page Update (Elevated API Error rates)
07:47 BST Status Page update (Investigating)
07:50 BST All applications back to normal working operation

Facial Processing Incident

08:00 BST Internal agents reported being unable to process facial checks
08:00-08:57: Investigation into why facial checks were failing
08:57-09.30: Fix applied to all affected checks
09:30-10:35: Processing of facial checks backlog
10:35 Return to normal operation of facial check processing

Root Cause

After the failover, specific ruby applications connected to postgres did not change connection to the new instance - instead they stayed connected to the old master. Because the old instance became a reader, any write operations failed. This is a highly unusual failure to reconnect - we have had other failovers where recovery happened quickly.

The API has some very early write operations. As a result any calls to the API failed and returned 500 because the write operation could not be carried out. Other downstream systems stopped processing any checks because the API was no longer receiving or processing checks.

The database failed over because the temporary storage volume became full and ran out of disk space. This was caused by a long running bad query which has subsequently been fixed. **

Facial Checks
**
Facial check processing became delayed because checks did not have the metadata required to load in our check processing tool. This was caused by an automated recovery process that was triggered after the earlier database failure which did not add the necessary metadata.

These bad checks meant that analysts were unable to process facial checks and this delayed facial checks until there was a fix to add the missing metadata.

Remedies

Immediate:

Restarts of all applications connected to the database server corrected the application failure and the elevated API error rate.
The unusually long running query which caused the initial failover has been fixed
The addition of the required metadata allowed facial checks to be processed correctly and the backlog cleared
We have released a fix to ensure this metadata is always added on these checks to improve resilience in this scenario.

Future:

We will be investigating why our current failover mechanisms did not work in this scenario, and implementing a recovery mechanism for this particular type of failure to prevent this happening in the future.
We will run simulations of various failover types to test this scenario and others, to ensure that our recovery mechanisms are adequate

Posted Jul 06, 2020 - 13:34 UTC

Resolved

The issue is now resolved and the API is back to normal.

We take a lot of pride in running a robust, reliable service and we're working hard to make sure this does not happen again. A detailed postmortem will follow once we've concluded our investigation.

Posted Oct 20, 2019 - 08:42 UTC

Monitoring

The API is now fully operational. We will continue to monitor over the next hour. Next update at 9.30 BST

Posted Oct 20, 2019 - 07:36 UTC

Identified

We have identified the cause of the issue and are in the process of implementing a fix. API calls should return to normal over the next couple of minutes. Next update 08:30 BST.

Posted Oct 20, 2019 - 06:47 UTC

Investigating

We're currently experiencing elevated error rates impacting API, affecting all clients.

Posted Oct 20, 2019 - 06:24 UTC

This incident affected: Europe (onfido.com) (API).