Addressing elevated queue times & ensuring reliability
Last night we experienced an outage that we promised to shed some more light on. At 22.35 CET it came to our attention that we had elevated queue times.
These were primarily caused by 20k outstanding /http/import jobs. Our autoscaler did not factor those in for scaling up because compared to video encoding, these are very cheap and usually few.
It was clear that this queue wasn't going down fast enough so we scaled up manually.
At around 23.00 we noticed the queue still wasn't going down very rapidly and it looked like some boxes were having problems coming online (we have preflight tests before taking one in production, some of those where timing out)
It turned out redis was unavailable to many (but not all) machines even though the service was still up & running.
We did some tests, and noticed that there was (what we cannot explain otherwise than) EC2 LAN outage between our boxes:
# LAN chan -> Redis
root@chan:~# telnet redis.transloadit.com 6379
Trying 10.192.211.175...
# LAN Redis -> chan (one of our drones)
root@redis2:# telnet chan.transloadit.com 80
Trying 10.241.127.187...
# WAN chan -> Redis
root@chan:~# telnet 107.21.80.77 6379
Trying 107.21.80.77...
Connected to 107.21.80.77.
Escape character is '^]'.
# WAN Redis -> chan (one of our drones)
root@redis2:# telnet 50.17.94.169 80
Trying 50.17.94.169...
Connected to 50.17.94.169.
Escape character is '^]'.
This explained why the /http/import queue didn't shrink much, jobs would error out and get re-queued automatically.
At 23.10, we deployed a fix:
- Connect to redis by public IP
At 23.15 there were 18,115 http/import
jobs in the queue, 15 minutes later they were all
processed. Still, to make sure we don't scale down with those numbers in queue, we deployed another
fix:
- Have autoscaler factor in the queue-size of
http/import
(and all bots except file/filter, no matter if deemed insignificant)
Going forward we'll deploy more:
- Cycle through LAN & WAN redis IPs so that if there are routing issues on one network, we automatically switch to the other
- Improved logging to spot these errors at an earlier stage
We've asked Amazon to confirm this outage and we'll update you with more information when they do.
We're very sorry for the inconvenience this outage caused. As you can see we take these matters seriously and are constantly working towards an ever more robust platform for our customers to build their business on.
If you have questions feel free to ask me on twitter or create a support ticket.