From 07:00-07:47 BST we experienced an outage on the API. During this time all requests to the API resulted in a 500 status. No checks could be submitted or retrieved in that time. This was caused by a database failover event at 07:00 BST. We returned back to normal operation by 07:50 BST.
After recovery, facial checks were delayed due to some missing metadata in the checks. The missing data meant the check could not be loaded in the tool our analysts use to process these checks. This was corrected by 09:00 BST, all checks were processed by 10:35 BST and we returned back to normal operation.
API Incident
Facial Processing Incident
After the failover, specific ruby applications connected to postgres did not change connection to the new instance - instead they stayed connected to the old master. Because the old instance became a reader, any write operations failed. This is a highly unusual failure to reconnect - we have had other failovers where recovery happened quickly.
The API has some very early write operations. As a result any calls to the API failed and returned 500 because the write operation could not be carried out. Other downstream systems stopped processing any checks because the API was no longer receiving or processing checks.
The database failed over because the temporary storage volume became full and ran out of disk space. This was caused by a long running bad query which has subsequently been fixed. **
Facial Checks
**
Facial check processing became delayed because checks did not have the metadata required to load in our check processing tool. This was caused by an automated recovery process that was triggered after the earlier database failure which did not add the necessary metadata.
These bad checks meant that analysts were unable to process facial checks and this delayed facial checks until there was a fix to add the missing metadata.
Remedies
Immediate:
Future: