On Monday 11/9/2017, during a scheduled end-of-sprint release, error rates in two of our internal applications increased, which resulted in slower-than-usual check completion time between 12:21 BST and 12:37 BST.
In keeping with our commitment to transparency, the following is a report of the issue we encountered, the factors that contributed to that issue, and ultimately, what we’ve done and plan to do to ensure we don't find ourselves in this situation again.
On Monday, 11th September at 12:21 BST, the engineering team were alerted to elevated 500 errors during applicant check processing. We began our investigation immediately. Our error logs indicated that the new instances being brought online were having intermittent problems marshalling cached user authorisation data into application domain objects.
As this only affected new servers, not all checks were impacted; however, given that the error was in a critical system component, at 12:30 BST we decided that the best option would be to perform a rollback to the previous version. At 12:37 BST the rollback had been performed and our applications began responding and functioning normally.
We make extensive use of authorisation throughout check processing in order to restrict data and service access on a need-to-know and need-to-use basis.
Claims and permissions are fetched from a separate service by internal services upon request to access a resource. We use short-term Rails caching in order to reduce the load on this permissions service. The caching system serializes the Ruby objects and then stores them in our cache cluster. Subsequent requests for this data will first try to marshal the cached, serialised data, if available, back into Ruby objects.
The issue arose with the fact that the new servers had a newer version of Ruby which used new classes to generate a subset of the permissions data. Thus new instances would have had their permissions data cached with new objects. If subsequent requests were routed to an old server, it would have tried to rehydrate a serialisation which contained classes that it did not know about (that had not been loaded).
We did not spot this in our staging environment for two reasons. Firstly, our staging environment had a smaller cache TTL window, which eased some automated testing scenarios. Secondly, our automated regression tests, as well as exploratory testing, are initiated on completion of a deploy: this meant that the incompatibility between the old and new cache values did not present itself in our staging tests.
Our immediate course of action was to rollback to the previous production deploy. This allowed us more time to investigate the issue and to apply a fix before we re-deployed the new release. To avoid this happening in the future, we intend to: