We were in the process of a routine upgrade to our network ingress controller from the “prior” version to a “next” version. This component is packaged and applied to API-related services as and when those services are deployed. A problem occurred when we packaged the “next” version while one of those services was in the middle of a deployment.
This problem rendered our external API, as well as our Dashboard, inaccessible in the EU region, for 25 minutes.
The issue was as follows:
This situation had not previously occurred because the network ingress controller had never changed while an API service was mid deployment.
The root causes were:
12:39 UTC: An API service deployment pipeline starts
12:56 UTC: The new network ingress controller is released
13:04 UTC: The API service pipeline triggers an EU Production deployment
13:05 UTC: A corrupted ingress configuration is installed, preventing traffic from being routed to the external API
13:10 UTC: We receive monitoring alerts for the loss of API traffic
13:15 UTC: Incident response mobilized and investigating the problem
13:22 UTC: API service rollback is initiated
13:25 UTC: The rollback for the API completed, restoring a valid ingress configuration and allowing traffic again
The following improvements will be made to the standard deployment pipeline to prevent similar network configuration inconsistencies from occurring: