Delays in check processing in the US
Incident Report for Onfido
Postmortem

Summary

Substantial spikes in check creation resulted in increased API latency and delays to check processing for api.us.onfido.com between 19:30-21:00 UTC.

Root Causes

A significant increase in volume resulted in a backlog of checks to be processed. This queue started to exceed our allowable threshold for check turnaround time. The mitigating actions for this event led to a short-term increase in contention for a critical database, degrading API latency, particularly for the create check operation.

Timeline

  • 19:40 UTC: Received on-call alarm regarding high number of queued requests
  • 20:22 UTC: Scaled up queue workers to reduce outstanding queue size. Given high load, this exacerbated already high database load, degrading API latency
  • 20:24 UTC: Began operation to increase the size of the database to reduce contention and provide greater capacity to account for the surge
  • 20:58 UTC: Shed load from high-traffic customers to reduce traffic backlog
  • 20:59 UTC: Backlog stopped increasing and recovery began; latency gradually returned to normal values
  • 21:58 UTC: Re-executed backlog of outstanding reports; at 22:10 incoming queue was cleared

Remedies

  • We will review our rate limit and approaches for maintaining fairness across tenants. Our current implementation is relatively simple and could be improved to give more flexible load management tools for surge events, both for customers and operationally within Onfido
  • We've adjusted our runbooks to account for the saturation scenario during scale-up events
  • We have re-forecast capacity within the region and proactively adjusted, notably for our database tier
Posted Jun 01, 2021 - 12:07 UTC

Resolved
This issue is now resolved.

We take a lot of pride in running a robust, reliable service and we're working hard to make sure this does not happen again. A detailed postmortem will follow once we've concluded our investigation.
Posted May 06, 2021 - 07:01 UTC
Monitoring
We are monitoring closely to make sure issue has been resolved and everything is working as expected. Please bear with us while we get back on our feet and we appreciate your patience during this incident.
Posted May 05, 2021 - 21:30 UTC
Investigating
We are currently experiencing check processing delays in the US region. We are currently investigating.
Posted May 05, 2021 - 20:01 UTC