Document report processing disrupted in EU

Incident Report for Onfido

Postmortem

Summary

For the EU region, one critical service struggled to reprocess the traffic affected by a previous faulty release which led to higher Turnaround Time (TaT) for all Document reports created between 10:32 and 10:50 UTC.

All impacted Document Reports were completed successfully with an average TaT of ~6 minutes.

Root Causes

The re-processing batch caused a spike in traffic and the auto-scaling didn’t work as expected for one critical service. The service entered a crash loop state and had to be manually up-scaled to recover. An unbounded number of in-flight requests were accepted leading to memory exhaustion and I/O event loop non-responsiveness, while waiting for a downstream ML inference service to scale up.

Timeline

10:32 UTC: The critical service went up to a 100% error rate

10:33 UTC: Engineers who initiated the report backlog reprocessing become aware of the issue through our monitoring and start investigating

10:48 UTC: We manually scaled up the critical service

10:50 UTC: The service went back to normal and errors stopped

Remedies

Investigate how to reduce memory footprint on this service, allowing bigger request queues while it waits for downstream ML model serving to scale up
Change auto-scaling parameters to be more aggressive (i.e. scale with lower CPU targets)
Add concurrent requests monitoring
Reduce ML model serving image sizes for faster scaling of inference services
Improve back-pressure mechanisms to be able to sustain minimum traffic levels independent of spikes while auto-scaling kicks-in
Change our weekly load testing scripts to specifically test for accelerated traffic spikes

Posted Jan 30, 2026 - 11:38 UTC

Resolved

Between 10:30 UTC and 10:47 UTC there was disruption to document report processing in EU, as a side effect of reports delayed from the earlier incident being re-processed. Further details will follow in a postmortem.

Posted Jan 21, 2026 - 10:30 UTC