Elevated error rates affecting check processing
Incident Report for Onfido
Postmortem

Summary

On Monday 11/9/2017, during a scheduled end-of-sprint release, error rates in two of our internal applications increased, which resulted in slower-than-usual check completion time between 12:21 BST and 12:37 BST.

In keeping with our commitment to transparency, the following is a report of the issue we encountered, the factors that contributed to that issue, and ultimately, what we’ve done and plan to do to ensure we don't find ourselves in this situation again.

Timeline

On Monday, 11th September at 12:21 BST, the engineering team were alerted to elevated 500 errors during applicant check processing. We began our investigation immediately. Our error logs indicated that the new instances being brought online were having intermittent problems marshalling cached user authorisation data into application domain objects.

As this only affected new servers, not all checks were impacted; however, given that the error was in a critical system component, at 12:30 BST we decided that the best option would be to perform a rollback to the previous version. At 12:37 BST the rollback had been performed and our applications began responding and functioning normally.

Contributing Factors

We make extensive use of authorisation throughout check processing in order to restrict data and service access on a need-to-know and need-to-use basis.

Claims and permissions are fetched from a separate service by internal services upon request to access a resource. We use short-term Rails caching in order to reduce the load on this permissions service. The caching system serializes the Ruby objects and then stores them in our cache cluster. Subsequent requests for this data will first try to marshal the cached, serialised data, if available, back into Ruby objects.

The issue arose with the fact that the new servers had a newer version of Ruby which used new classes to generate a subset of the permissions data. Thus new instances would have had their permissions data cached with new objects. If subsequent requests were routed to an old server, it would have tried to rehydrate a serialisation which contained classes that it did not know about (that had not been loaded).

We did not spot this in our staging environment for two reasons. Firstly, our staging environment had a smaller cache TTL window, which eased some automated testing scenarios. Secondly, our automated regression tests, as well as exploratory testing, are initiated on completion of a deploy: this meant that the incompatibility between the old and new cache values did not present itself in our staging tests.

Remediations

Our immediate course of action was to rollback to the previous production deploy. This allowed us more time to investigate the issue and to apply a fix before we re-deployed the new release. To avoid this happening in the future, we intend to:

  1. Improve the way cache failures are handled and ensure these fail gracefully
  2. Investigate running stability tests during staging deployments to more closely mimic and validate the system’s state under change
  3. Ensure this scenario is included in our test plan checklist for future Ruby/Rails upgrades
Posted Sep 20, 2017 - 14:14 UTC

Resolved
Check processing has returned to a stable state and shown no further issues during monitoring.

A post-mortem will follow once we have completed our investigation.
Posted Sep 11, 2017 - 14:34 UTC
Monitoring
We're no longer experiencing elevated error rates and continuing to monitor as processing speed returns to normal.
Posted Sep 11, 2017 - 11:37 UTC
Investigating
We're currently experiencing elevated error rates impacting check processing speed for document verification.

Our team is investigating and will update at 12:45 BST.
Posted Sep 11, 2017 - 11:21 UTC