On Monday 05/02/18 we experienced degraded document and facial similarity processing times which resulted in slower-than-usual check completion time between 8pm and 11pm GMT.
At 8pm the on-call team was notified regarding an increase in error rates regarding DNS resolutions for multiple services and started investigating the issue.
At 10:30pm we reduced the load on our Kubernetes cluster which reduced the number of errors, but didn't fully mitigate the issue. We continued our investigation. At around 10:45pm we identified that the networking within our cluster was degraded due to increased error rates in etcd.
At 10:55pm we restarted the components within our cluster that manage etcd and networking and the issue was fully remediated by 11pm.
In the past few weeks, we migrated several microservices to our Kubernetes cluster, which have scaled in line with demand. This increase in number of containers and nodes stressed the etcd configuration to the point where it maxed out disk operations and had a negative cascading effect on the availability of services within the cluster.
We will increase the depth and thoroughness of etcd monitoring, particularly to identify anomalous load on disk operations.
We have changed our resource configuration to ensure that etcd and our cluster masters have adequate headroom for future scale; we are continuing to review this configuration to minimise the risk of future issues.