Increased latency on check processing in the EU region
Incident Report for Onfido
Postmortem

Summary

On Friday 17th March at 19:30 UTC, we were alerted to a sharp increase in check volumes. This resulted in extended processing times over a 3 hour period, with the highest impact being 10 minutes delays for automatically process checks before verification completion. Document checks requiring human-assisted review experienced extended processing times due to a huge backlog of checks created. We were not able to process a limited set of checks during this period and had to re-submit and process at a later date. A limited set of Facial similarity checks have experienced extended processing times and degradation in clear rates.

Root Causes

  • Our rate limiting capabilities were not enough to prevent the sharp increase in check volumes
  • Parts of our system were slow to scale to accommodate the sharp increase in check volumes

Remedies

  • We have since reviewed our rate limiting approach and capabilities and devised a plan for improvements
  • We are reviewing all parts of our system that failed to scale accordingly and will immediately address identified bottlenecks in our system.

Timeline

March 17th 2023

  • 19:30 UTC: Our on-call team were alerted to a sharp increase in check volumes
  • 19:40 UTC: we have identified the source of the traffic and attempted to manually scale parts of the system as the automatic scaling rules were not sufficient.
  • 21:49 UTC: rate limiting thresholds have been adjusted which immediately reduced the load on our system.
  • 22:08 UTC: Processing times for automatically processed checks went back to normal.

March 18th 2023

  • 06:00 UTC: Backlog of document checks requiring manual review has been cleared

March 22nd 2023

  • 10:00 UTC: We were alerted to a set of checks failed to process during the incident period.
  • 14:21 UTC: We have re-submitted and successfully processed the set of checks that failed to process during the incident period.
Posted Mar 24, 2023 - 14:57 UTC

Resolved
This issue is now resolved: Increased latency on check processing in the EU region.

As the issue stands resolved, we expect a higher TaT for checks that need manual review as our backlog grew during this incident. So please bear with us while we clear all the backlog created and resume the typical TaT.

We take a lot of pride in running a robust, reliable service and are working hard to ensure this does not happen again. A detailed postmortem will follow once we've concluded our investigation.
Posted Mar 17, 2023 - 22:45 UTC
Update
We are continuing to monitor for any further issues.
Posted Mar 17, 2023 - 22:20 UTC
Monitoring
We have identified the source causing the increase of latency and implemented a fix for this issue.

We are monitoring closely to make sure issue has been resolved and everything is working as expected. Please bear with us while we get back on our feet and we appreciate your patience during this incident.
Posted Mar 17, 2023 - 22:19 UTC
Investigating
We've currently experiencing issues that are negatively impacting latency on check processing in the EU region.

Our next update will be in 15 minutes. Thank you for your patience.
Posted Mar 17, 2023 - 22:01 UTC
This incident affected: Europe (onfido.com) (API, Applicant Form, Document Verification).