Degraded document and facial similarity processing times

Incident Report for Onfido

Postmortem

Summary

On Monday 05/02/18 we experienced degraded document and facial similarity processing times which resulted in slower-than-usual check completion time between 8pm and 11pm GMT.

Timeline

At 8pm the on-call team was notified regarding an increase in error rates regarding DNS resolutions for multiple services and started investigating the issue.

At 10:30pm we reduced the load on our Kubernetes cluster which reduced the number of errors, but didn't fully mitigate the issue. We continued our investigation. At around 10:45pm we identified that the networking within our cluster was degraded due to increased error rates in etcd.

At 10:55pm we restarted the components within our cluster that manage etcd and networking and the issue was fully remediated by 11pm.

Contributing Factors

In the past few weeks, we migrated several microservices to our Kubernetes cluster, which have scaled in line with demand. This increase in number of containers and nodes stressed the etcd configuration to the point where it maxed out disk operations and had a negative cascading effect on the availability of services within the cluster.

Remediations

We will increase the depth and thoroughness of etcd monitoring, particularly to identify anomalous load on disk operations.

We have changed our resource configuration to ensure that etcd and our cluster masters have adequate headroom for future scale; we are continuing to review this configuration to minimise the risk of future issues.

Posted Feb 07, 2018 - 15:39 UTC

Resolved

This incident has been resolved and monitoring observed no further issues.

A post-mortem will follow after our investigation is complete.

Posted Feb 06, 2018 - 08:02 UTC

Monitoring

We've resolved the underlying issue and processing times are returning to normal for the affected checks. We will continue to monitor new requests.

Posted Feb 05, 2018 - 23:13 UTC

Investigating

We've currently experiencing issues that are negatively impacting processing times for document and facial similarity checks

Investigation is pending. Next update 23:30

Posted Feb 05, 2018 - 22:35 UTC