Increased latency on check processing
Incident Report for Onfido
Postmortem

Summary

From Jan 2nd 2024, 14:28 UTC, until Jan 2nd, 16:06 UTC, there was an increase in Turn Around Time (TaT) for a subset of document checks being processed. The percentage of document reports impacted was as follows:

  • EU: ~50% reports;
  • US: ~40% reports;
  • CA: Not impacted.

The majority of these reports were completed immediately after the problem was resolved.

The remaining that needed manual review were progressively cleared, with all reports completed by Jan 3rd, 2024, 06:00 UTC.

Root Causes

A functional bug was introduced upon releasing a service responsible for processing part of the document check, resulting in incomplete processing of some document reports.

Timeline

Times are displayed in UTC

Jan 2nd 2024

  • 14:28: Our automatic monitoring triggers raising an elevated timeout rate
  • 14:53: Incident is declared once impact has been assessed
  • 15:00: Public status page is created
  • 15:08: We rolled back some services that were released close to the issue starting, as a precaution, to remove potentially impactful changes while investigations continued
  • 15:34: The above-mentioned rollback is finished, but no improvements were observed, so the investigation continues
  • 15:52: The root cause is identified, and so a revert of the faulty code is done
  • 16:04: The revert is complete, and we start processing newly submitted live document reports normally
  • 16:06: Incomplete reports submitted during the incident start to complete

Jan 3rd 2024

  • 06:00: All the affected reports were complete

Remedies

Additional E2E tests are being added to cover relevant edge cases (ETA: Q1 2024).

Review our observability procedures to improve the recovery time for document report processing incidents (ETA: Q2 2024).

Posted Jan 08, 2024 - 18:33 UTC

Resolved
This incident is resolved. Most of the reports that were queued as a result of this issue have been cleared and turnaround times are returning to normal. Any remaining queued reports are expected to complete within the next five hours. We apologize for the degraded service.
Posted Jan 02, 2024 - 19:26 UTC
Update
We are continuing to monitor for any further issues.
Posted Jan 02, 2024 - 16:26 UTC
Monitoring
We have implemented a fix for this issue in the EU region.

We are monitoring closely to make sure issue has been resolved and everything is working as expected. Please bear with us while we get back on our feet and we appreciate your patience during this incident.
Posted Jan 02, 2024 - 16:23 UTC
Identified
The issue has been identified and a fix is being implemented.
Posted Jan 02, 2024 - 16:05 UTC
Investigating
We are currently investigating this issue.
Posted Jan 02, 2024 - 15:00 UTC
This incident affected: USA (us.onfido.com) (Document Verification), Canada (ca.onfido.com) (Document Verification), and Europe (onfido.com) (Document Verification).