Apology for Rate Limiter outage & fixes implemented
Yesterday at 10:49 a.m., we had an issue that affected 2.7% of our customers.
The problem was that for these customers assemblies were falsely rejected with a
RATE_LIMIT_REACHED
error, for a period of two hours.
The issue was caused by a minor improvement (or so we thought) to our rate limiter. Normally, our testsuite should have spotted any invalid behavior before hitting production, but this was an edge case scenario that we did not cover in our tests and the problem thus hit production after all.
Needless to say, we are very sorry about this.
As soon as we were notified about this, we took the following steps to ensure this can never happen again:
-
We have raised the account based rate limits of all affected accounts to 50,000 as a hotfix to prevent further Assemblies from being rejected.
-
We have fixed the bug in the rate limiter and deployed the fix within thirty minutes of the report.
-
We have improved our automated tests, to ensure that invalid behavior cannot pass into production in the future.
-
We have improved our monitoring of this, so that we will receive an automated phone call as soon as two or more customers get rate limited at the same time. So, in the unlikely event that this ever happens again, we will know it before everybody else does and be able to address it in the early stages.
We have sent a personal email to all affected customers, informing them about this issue. We have also added a 25% discount to the upcoming invoices for all affected customers.
We are very sorry that this happened and we believe to have taken the necessary steps for this to never happen again.