For the EU region, one critical service struggled to reprocess the traffic affected by a previous faulty release which led to higher Turnaround Time (TaT) for all Document reports created between 10:32 and 10:50 UTC.
All impacted Document Reports were completed successfully with an average TaT of ~6 minutes.
The re-processing batch caused a spike in traffic and the auto-scaling didn’t work as expected for one critical service. The service entered a crash loop state and had to be manually up-scaled to recover. An unbounded number of in-flight requests were accepted leading to memory exhaustion and I/O event loop non-responsiveness, while waiting for a downstream ML inference service to scale up.
10:32 UTC: The critical service went up to a 100% error rate
10:33 UTC: Engineers who initiated the report backlog reprocessing become aware of the issue through our monitoring and start investigating
10:48 UTC: We manually scaled up the critical service
10:50 UTC: The service went back to normal and errors stopped