At around 6pm UTC on 13th March 2025, we were alerted for higher turnaround times (and consequent delays) in processing Facial Similarity and Known Faces reports in the EU region.
This will have affected all clients running reports during this 15 minute time period. These reports didn’t fail, but were only delayed in the end.
Higher turnaround times (and consequent delays) in processing Facial Similarity and Known Faces reports.
17:56 UTC: We get alerted to a high number of pending reports, due to higher turnaround times in processing
18:03 UTC: A suspected feature is turned off as a potential culprit, but nothing changes – not root cause
18:07 UTC: Problem stops, ongoing reports are now being normally processed (although it is unrelated with feature that was turned off, upon further investigation)
18:10 UTC: Investigation shows high CPU usage in database
18:34 UTC: Query originating in internal operation tool is identified as culprit
18:35 UTC: Pending reports are seen as dropping, which should indicate process for graceful recovery is being handled. But a quirk in the metric tricks us, and we realise pending reports are stuck
18:36 UTC: Pending reports seem stuck, and are not automatically being recovered, so we resort to manual action to re-run them
18:37 UTC: Search feature in internal operational tool causing bad query is disabled (functionality removed)
19:00 UTC: We retrieve all of the affected reports from our logging platform
19:12 UTC: We have re-run all affected reports and incident is over
In order to make sure this doesn’t happen again: