SDK Crashes
Incident Report for Onfido
Postmortem

Summary

From 16:55 to 17:58 UTC approximately 90% of applicants using mobile applications running the Onfido Android SDK were unable to complete the media upload flow. The other SDKs (iOS and web) had no issues.
A small percentage of these impacted users had subsequent errors until 20-02-2020 12:20 UTC due to a cache on their devices containing an invalid value.

Timeline

  • 18-02-2020 16:55 UTC - Onfido deployed a change in the analytics endpoint used by mobile sdks, causing the Android SDK to crash when it received the new response.
  • 18-02-2020 17:53 UTC - Onfido identified the issue
  • 18-02-2020 17:58 UTC - Reverted the change
  • 20-02-2020 12:20 UTC - Implemented fix to stop errors in a small subset of users that had a invalid value in their cache and because of that were unable to complete the flow.

Root Cause

There was a change in an Onfido public endpoint causing it to return values in a format that wasn’t able to be parsed by the client Android SDK.

Contributing Factors

We didn’t catch this error in our testing pipeline due to local device caching. Our tests for this change used the cached value from previous tests and thus weren’t impacted by the error introduced by the backend change.

Remedies

  1. Once the errors on the Android were detected, the API change was reverted at 17:58 UTC.
  2. To solve the small percentage of users with the bad cache error, a fix to refresh the cache with a valid value was deployed at 20-02-2020 12:20 UTC.

Future improvements

  1. Improve our test flow to include applications using Android SDKs with previous applicant media upload flow, and applications that had no previous data. This will eliminate the risk of cached values that would cause tests to test incorrect system state.
  2. Better monitoring on media upload flow to segment error rate and traffic monitors by individual SDKs.

    1. None of our automatic alarms were triggered immediately after the issue started because both iOS and Web SDKs were working as expected. The aggregate high volume across all platforms made the decrease in traffic caused by problems in the Android SDK insufficient to trigger the existing alarms.
Posted Mar 30, 2020 - 14:53 UTC

Resolved
Our monitoring indicates that the SDK crashes have stoped and that that the issue is now resolved. We apologise for the disruption caused by this issue.
Posted Feb 18, 2020 - 19:30 UTC
Monitoring
A fix has been implemented and we are monitoring the situation. The issue only effected the Android SDK.

We apologise for any inconvenience this may have caused.

Next update: 19.30
Posted Feb 18, 2020 - 18:40 UTC
Investigating
We are currently investigating an issue that is causing crashes in the Android Onfido SDK. This is preventing users from completing the flow and starting a check on Android devices.

Next update at : 18:45
Posted Feb 18, 2020 - 18:08 UTC
This incident affected: Europe (onfido.com) (API) and USA (us.onfido.com) (API).