Summary
On 19 November 2024 18:45 UTC, a database change on an index in the EU and US regions seriously impacted the creation and management of workflow runs via the Studio API and the SDK.
Root Causes
The addition of a column to an existing index in a core database table, aimed at improving the performance of a specific combination of filters in the Dashboard results page, was performed by first dropping the existing index, and then recreating it with the additional column. The first operation resulted in a spike in CPU overhead in all database operations involving that table, which deprioritized the second operation. The instability of the system continued until the new index was force-created.
Timeline
The timeline below refers to timestamps for the the 19 November 2024; all entries are in UTC):
- 18:44:25 - operation dropping the index was started
- 18:45:37 - first request fails due to statement timeout
- 18:49:00 - alarm triggers for high surge of 5xx HTTP errors for the Studio API
- 19:18:02 - incident was reported
- 19:47:00 - index starting being manually force-created in the US
- 19:57:00 - US region recovered
- 19:59:00 - index starting being manually force-created in the EU
- 20:09:00 - EU region recovered
- 20:47:09 - incident resolved
Remedies
- Integrate database migration acceptance rules and broaden list of reviewers;
- Introduce a kill switch to enforce Studio API “maintenance mode” in order to be able to prioritize recovery actions and reduce overall Mean Time To Recovery;
- Full split of data migration pipeline from code deployment pipeline;
- Spilt Dashboard read-only post-execution traffic from critical path.