API and Dashboard Availability
Incident Report for Onfido
Postmortem

Summary

On Wednesday, March 22nd UTC, our Redis cluster experienced a higher than normal rate of evictions, resulting in approximately 4 hours of service instability between 07:42 and 11:44 UTC for the Onfido dashboard as well as our internally-used applications.

In keeping with our core value of maintaining transparency, the following is a report of the issues encountered, the factors that contributed to these issues and ultimately what we have done and will do in order to ensure we don't find ourselves in this same situation in the future.

Timeline

On Wednesday, March 22nd UTC at 07:42, we started receiving notifications and error messages regarding reduced memory availability and high rates of key evictions for our Redis cluster. This affected the persistence of session tokens and the processing of some backend queue jobs.

Around this same time we started to receive internal support tickets because our internal users had noticed that they could not access our applications and if they did, sometimes they were kicked out of their sessions for no apparent reason.

Our incident response team began gathering symptoms, identifying causes and notifying internal and external parties of relevance and by 10:00 UTC we had identified that the root cause of the issue was that one of our systems had gradually generated enough Redis keys over the course of several months to take up 77% of our Redis server's capacity, and which at that moment had lead our services to become unstable.

At 11:40 UTC these keys were removed after having evaluated the negligible impact of taking such a course of action. Once this had happened, our applications began responding and functioning normally.

Contributing Factors

We use Resque queues in a number of areas of our infrastructure to queue up small portions of work or jobs to be handled at a later time. This allows the applications that request these jobs to be detached from the execution of these jobs so they can carry on with what they do without being concerned with these jobs completing quickly or at all. One feature of Resque queues is that if an error is encountered during the execution of a job, it will create a failure job and add it to a failure queue in order to allow you to act on that failed job (say with a retry) at a later time.

On top of this, queues are commonly made to be persistent. This means that the queued up jobs can be stored so that if the process running the queue fails and exits unexpectedly, it can be restarted and the queued jobs will still be there waiting to be handled.

These two features lead to the scenario whereby a very gradual increase in jobs held in the failure queue caused the Redis cluster to reach the limits of its capacity.

Remediations

Our immediate course of action was to remove the contents from the failure queue. Further to this we carried out measures to reduce the number of ways in which failure jobs that are generated in our code. We also improved the monitoring and alerting around the Redis cluster to ensure we were alerted early about memory usage levels.

Over the next few weeks we intend to review our Redis configuration in order to devise ways of limiting and constricting the impact Redis issues have on our other dependent services.

Posted Mar 30, 2017 - 13:39 UTC

Resolved
Our services remain in normal operation after our monitoring period.

We will follow up with a post-mortem once our investigation has been completed.
Posted Mar 22, 2017 - 14:31 UTC
Monitoring
Normal service has been restored to the API, dashboard and applicant form.

We are continuing to monitor the underlying cause.
Posted Mar 22, 2017 - 12:23 UTC
Identified
We're currently experiencing issues with API, dashboard and applicant form availability.

The issue has been identified and we're currently working on a resolution.
Posted Mar 22, 2017 - 11:22 UTC