The combined deployment of a new Document Extraction ML model and respective service that support Document Verification and Autofill, replaced the previous model version instantly, while rolling out the application layer via canary.
When we got alerted for errors during the canary release, we triggered a manual rollback (both application layer and ML model), which failed due to a bug in the model deployment framework.
Subsequently, the canary release for the application layer failed to progress due to high error rate and automatically reverted both the application and model layer reverting the service to a healthy state.
The service unavailability resulted in increased Turnaround Time (TaT) for all Document reports and Studio Autofill created between 10:05 and 10:17 UTC. Autofill Classic had an average 70% error rate during that time frame.
All impacted Document Reports and Studio Autofill tasks were completed successfully with an average TaT of ~6 minutes.
Our model deployment system doesn’t support model version replacement without downtime. Application and model pods are deployed without synchronization or any specific order requiring a 3-step release (add new version, update application layer, remove old version).
The lack of automated guardrails preventing releases to replace a model version in a single step has led to the service unavailability due to human error.
Our rollback mechanism for models didn’t restore correctly the model config map in the Kubernetes cluster, which led the rollback procedure to fail for both application and models when triggered manually. It succeeded during canary automated reversal.
10:05 UTC: Production release of extraction service updating an ML model
10:08 UTC: We unsuccessfully tried to manually rollback after noticing errors on a metrics dashboard
10:10 UTC: Automatic monitors alerted the On-call team due to high error rate
10:16 UTC: The canary release was aborted automatically due to high error rate
10:18 UTC: All services went back to normal
Add guardrails to the release pipeline in order enforce a 3-step safe release process (new ML model deployment, application layer update, ML model removal)
Fix the manual rollback mechanism for models, in particular for releases where we remove a model version