Studio Degradation

Incident Report for Onfido

Postmortem

Summary

On 19 November 2024 18:45 UTC, a database change on an index in the EU and US regions seriously impacted the creation and management of workflow runs via the Studio API and the SDK.

Root Causes

The addition of a column to an existing index in a core database table, aimed at improving the performance of a specific combination of filters in the Dashboard results page, was performed by first dropping the existing index, and then recreating it with the additional column. The first operation resulted in a spike in CPU overhead in all database operations involving that table, which deprioritized the second operation. The instability of the system continued until the new index was force-created.

Timeline

The timeline below refers to timestamps for the the 19 November 2024; all entries are in UTC):

18:44:25 - operation dropping the index was started
18:45:37 - first request fails due to statement timeout
18:49:00 - alarm triggers for high surge of 5xx HTTP errors for the Studio API
19:18:02 - incident was reported
19:47:00 - index starting being manually force-created in the US
19:57:00 - US region recovered
19:59:00 - index starting being manually force-created in the EU
20:09:00 - EU region recovered
20:47:09 - incident resolved

Remedies

Integrate database migration acceptance rules and broaden list of reviewers;
Introduce a kill switch to enforce Studio API “maintenance mode” in order to be able to prioritize recovery actions and reduce overall Mean Time To Recovery;
Full split of data migration pipeline from code deployment pipeline;
Spilt Dashboard read-only post-execution traffic from critical path.

Posted Nov 28, 2024 - 10:27 UTC

Resolved

This incident is resolved. Post-modern with more details will be provided soon.

Posted Nov 19, 2024 - 20:43 UTC

Monitoring

A fix was deployed 15 mins ago, clients should see all systems back to normal. We're still monitoring.

Posted Nov 19, 2024 - 20:24 UTC

Identified

Clients using studio feature are seeing 5xx errors. We're restoring the service

Posted Nov 19, 2024 - 19:37 UTC

This incident affected: Europe (onfido.com) (API, Dashboard) and USA (us.onfido.com) (API, Dashboard).