Document report processing disrupted in EU

Incident Report for Onfido

Postmortem

Summary

The combined deployment of a new Document Extraction ML model and respective service that support Document Verification and Autofill, replaced the previous model version instantly, while rolling out the application layer via canary.

When we got alerted for errors during the canary release, we triggered a manual rollback (both application layer and ML model), which failed due to a bug in the model deployment framework.

Subsequently, the canary release for the application layer failed to progress due to high error rate and automatically reverted both the application and model layer reverting the service to a healthy state.

The service unavailability resulted in increased Turnaround Time (TaT) for all Document reports and Studio Autofill created between 10:05 and 10:17 UTC. Autofill Classic had an average 70% error rate during that time frame.

All impacted Document Reports and Studio Autofill tasks were completed successfully with an average TaT of ~6 minutes.

Root Causes

Our model deployment system doesn’t support model version replacement without downtime. Application and model pods are deployed without synchronization or any specific order requiring a 3-step release (add new version, update application layer, remove old version).
The lack of automated guardrails preventing releases to replace a model version in a single step has led to the service unavailability due to human error.

Our rollback mechanism for models didn’t restore correctly the model config map in the Kubernetes cluster, which led the rollback procedure to fail for both application and models when triggered manually. It succeeded during canary automated reversal.

Timeline

10:05 UTC: Production release of extraction service updating an ML model

10:08 UTC: We unsuccessfully tried to manually rollback after noticing errors on a metrics dashboard

10:10 UTC: Automatic monitors alerted the On-call team due to high error rate

10:16 UTC: The canary release was aborted automatically due to high error rate

10:18 UTC: All services went back to normal

Remedies

  1. Add guardrails to the release pipeline in order enforce a 3-step safe release process (new ML model deployment, application layer update, ML model removal)

    1. We will separate CI/CD pipelines for ML model deployment and application layer
    2. Model switching will be exclusively done on the application layer
  2. Fix the manual rollback mechanism for models, in particular for releases where we remove a model version

Posted Jan 30, 2026 - 11:40 UTC

Resolved

Between 12:15 UTC and 12:40 UTC there was disruption to document report processing in EU. Autofill requests also saw a disruption between 12:16 UTC and 12:27 UTC. A postmortem will follow.
Posted Jan 26, 2026 - 12:00 UTC