Quick Summary
On Thursday 15th April 07.12 am AEST (Corresponding UTC (GMT) Wednesday, 14 April 2021 at 21:12:00), we received ping alerts for the Switzerland Production P4 Website indicating that the page was not responding in a timely fashion. We also received a slack message from New Zealand that their site was showing as offline at about 07.23 am.
Mitigation
Upon investigation the Swiss website was being inundated with traffic however it had already stopped by time we were alerted of the issue. Cloudflare reported stopping a DDOS (denial of service) attack at the same time. The impact on the Swiss site spilled into the cluster impacting most of the staging sites and about 4 of the production sites for about 4 minutes until the attacked stopped of it’s own accord.
Investigation
Follow up investigation has been conducted to determine the cause of the outage and longer term mitigation strategies.
The attack was a series of requests at the root of the page with a random character string. Whilst the pods scaled up at the time there was no way to meet the speed of the requests coming in, at one point the load balancing gateway also became overwhelmed consuming four times the normal load requirements of the production cluster. This further impacted the remaining sites in the cluster for a short time.
Actions
- Essentially we rely on Cloudflare to provide protection from these kinds of attacks, which to a great extent it did. Had this protection not been there the fallout could indeed have been a lot worse.
- A review of what is an acceptable outage will need to be conducted as a result of our first DDOS attack. Ultimately systems can be designed to counter such attacks however the cost to do this is often prohibitive.