Facial Similarity and Known Faces service degradation

Incident Report for Onfido

Postmortem

At around 6pm UTC on 13th March 2025, we were alerted for higher turnaround times (and consequent delays) in processing Facial Similarity and Known Faces reports in the EU region.

This will have affected all clients running reports during this 15 minute time period. These reports didn’t fail, but were only delayed in the end.

Summary

Higher turnaround times (and consequent delays) in processing Facial Similarity and Known Faces reports.

Root Causes

  • Known Faces and Facial Similarity reports took longer than expected to be processed
  • because a database was struggling (heavy CPU usage)
  • because an ongoing query was monopolising the database
  • because the query was not optimised (and not configured to time out)
  • because the depending service is an internal operational tool for report drill down and investigation

Timeline

17:56 UTC: We get alerted to a high number of pending reports, due to higher turnaround times in processing

18:03 UTC: A suspected feature is turned off as a potential culprit, but nothing changes – not root cause

18:07 UTC: Problem stops, ongoing reports are now being normally processed (although it is unrelated with feature that was turned off, upon further investigation)

18:10 UTC: Investigation shows high CPU usage in database

18:34 UTC: Query originating in internal operation tool is identified as culprit

18:35 UTC: Pending reports are seen as dropping, which should indicate process for graceful recovery is being handled. But a quirk in the metric tricks us, and we realise pending reports are stuck

18:36 UTC: Pending reports seem stuck, and are not automatically being recovered, so we resort to manual action to re-run them

18:37 UTC: Search feature in internal operational tool causing bad query is disabled (functionality removed)

19:00 UTC: We retrieve all of the affected reports from our logging platform

19:12 UTC: We have re-run all affected reports and incident is over

Remedies

In order to make sure this doesn’t happen again:

  • We will remove the search feature from the internal operational tool for report drill down whilst we optimise the query powering it
  • We will only reinstate the search feature after the query is optimised and set to use a read replica instead of a write replica for our PostgreSQL database
  • We will only reinstate the search feature after the query is optimised and adequate query timeout is set
  • We will fix the Cron job for automatic and graceful recovery of pending reports
  • We have fixed the operational dashboards to use the right metric for pending reports monitoring
Posted Mar 21, 2025 - 11:32 UTC

Resolved

All reports have been recovered. We're now back to normal processing and the incident is over.
Posted Mar 13, 2025 - 19:21 UTC

Monitoring

We're monitoring the run of pending reports. Almost done now. As previously stated, ongoing processing is back to normal. We'll update again once all pending reports affected during the incident have been recovered.
Posted Mar 13, 2025 - 19:07 UTC

Identified

A bad query has been identified as the main culprit. We continue to investigate the issue.
Posted Mar 13, 2025 - 18:37 UTC

Update

Processing times are back to normal for ongoing reports. There are some pending reports being automatically re-run by our graceful handling of errors as we update this incident page. We are continuing to investigate the issue.
Posted Mar 13, 2025 - 18:24 UTC

Investigating

We are currently investigating higher processing times for Facial Similarity and Known Faces reports in the EU region.
Posted Mar 13, 2025 - 18:06 UTC
This incident affected: Europe (onfido.com) (Facial Similarity, Known faces).