Document Reports not being processed in US

Incident Report for Onfido

Postmortem

Summary

Around 10:29 UTC, around 10 minutes after restoring service in the EU and CA from the 10:05-10:17 UTC outage, we go alerted for high error rate in the US region and we noticed the new ML model version was back online in that region. This impacted Turnaround Time (TaT) for all Document Report and Studio Autofill until 12:11 UTC.

Autofill Classic had an average 100% error rate during the same time frame.

All impacted Document Reports and Studio Autofill tasks were completed successfully with an average TaT of ~40 minutes.

Root Causes

A previously rolled back version of an ML model went back live automatically in the US region. The automated canary reversal did not succeed in that region, leaving the deployment in an inconsistent state. As a result, the model routing labels were not updated as expected during the next forced deployment, and traffic continued to be served by pods running the incorrect model version until we manually rolled back the model in the cluster. This was the result of a bug in our version of Helm.

Timeline

10:29 UTC: The new model release went back live in the US region only, after the previous automated canary rollback

11:44 UTC: We manually re-deployed a previous version via CI/CD pipeline. This was not successful and errors continued.

12:00 UTC: We manually rolled back the version of the model directly in the cluster.

12:11 UTC: All services went back to normal

Remedies

  • We will fix our deployment tooling to reliably apply model-routing label changes to these resources by upgrading Helm.
Posted Jan 30, 2026 - 11:39 UTC

Resolved

This issue is now resolved:

Between 10:30 and 12:00 UTC, document reports could not be processed and autofill requests were failing in the US region. Starting from 12:00, live traffic was unaffected. From 12:00 to 12:30 UTC, pending reports were processed.

We take a lot of pride in running a robust, reliable service and we're working hard to make sure this does not happen again. A detailed postmortem will follow once we've concluded our investigation.
Posted Jan 21, 2026 - 12:34 UTC

Monitoring

We have deployed a fix. New document reports and autofill requests are being processed normally. We will rerun the earlier impacted reports.
Posted Jan 21, 2026 - 12:07 UTC

Identified

We have identified the issue as a configuration issue between services, and are working on a fix.
Posted Jan 21, 2026 - 11:49 UTC

Investigating

We have a high error rate in document report processing and autofill. Our team is investigating
Posted Jan 21, 2026 - 11:05 UTC
This incident affected: USA (us.onfido.com) (Document Verification, Autofill).