Summary
On July 10th our customers experienced increased latency for Proof-of-Address (POA) tasks.
Root Causes
An routine infrastructure change was introduced on July 10th at 10:41 UTC. It was slowly rolled out to all our systems over the day.
The change was incompatible with an internal service that served Proof-of-Address document images. As a result, our analysts performing manual Proof-of-Address tasks verification would sporadically fail to load images.
Due to the slow roll out, it took some time before there was noticeable impact on our analysts and made identifying the root cause harder.
Timeline (UTC)
- 10:41 - infrastructure change is published and progressive rollout started
- 18:20 - 67% of our systems were updated, a growing and significant analyst impact led to incident being opened
- 18:22 - Investigation starts. Some analysts are still able to process some tasks, the backlog slowly increases
- 19:42 - The root cause is identified and a fix (rollback) is in preparation
- 20:07 - The rollback is ready, preparing the release
- 20:30 - The fix is fully rolled out and analysts no longer experiencing problems
- 20:50 - After 20 minutes of stability and no more error reported we close the incident. Analysts are able to process the backlog quickly.
Remedies
- The incompatible application will be updated prior to reapplying the infrastructure upgrade
- A new monitor and alert has been added to the internal service that serves POA document images