Delays in Document checks processing times in all regions
Incident Report for Onfido
Postmortem

Summary

On Wednesday the 12th of May, we applied an infrastructure change to provision resources for a new service. This change unintentionally caused Document Check processing to fail for all checks created in EU, US and CA regions. As a result of this between 14:22 UTC and 15:16 UTC all Flash checks were being completed as Failed and all Boost checks were sent to Manual processing.

Root Causes

The aforementioned change in our infrastructure removed database access for a core component responsible for processing Document Checks. The database access was protected under a feature flag, but later we discovered the flag was removed and this database was being queried for all checks created at Onfido.

The incoming traffic in our pre-production environment was not enough to raise alerts that would allow us to detect the issue. This particular issue would only be visible in pre-production environment through our end-to-end tests, which are not automatically triggered for infrastructure changes.

Timeline

  • 15:22 UTC: Infrastructure change was applied on CA region with unintended additional prior pending changes.
  • 15:39 UTC: Engineering team was alerted for issues in Document Check processing in CA.
  • 15:44 UTC: Infrastructure changes automatically applied for US and EU regions.
  • 15:49 UTC: US and EU alerts fired for Document Check processing.
  • 15:49 UTC: Infrastructure changes started being reverted.
  • 16:16 UTC: Issue resolved when infrastructure changes were reversion completed.
  • 16:19 UTC: Document Check processing was restored to normal.

Remedies

Immediately:

  • Improve our alerts infrastructure alert team about abnormal error rates earlier and more clearly.

In addition, we will:

  • Review infrastructure change release process to prevent batching of unintended pending changes.
Posted Jun 24, 2021 - 13:38 UTC

Resolved
This issue is now resolved and our backlog has been cleared.

We take a lot of pride in running a robust, reliable service and we're working hard to make sure this does not happen again. A detailed postmortem will follow once we've concluded our investigation.
Posted May 12, 2021 - 19:58 UTC
Monitoring
We have applied the fix to all regions. We are currently monitoring and working through our check backlog.
Posted May 12, 2021 - 16:19 UTC
Identified
We have identified the root cause and applying a fix to all regions now.

We will provide another update in 15 minutes
Posted May 12, 2021 - 16:16 UTC
Investigating
We're currently experiencing delays in document checks processing times


We will provide an update in 15 minutes
Posted May 12, 2021 - 16:01 UTC