Quick Summary
On Sunday 9th of August (20:00 pm AEST, 12:00:00 noon CEST), we received ping alerts for the GPI International P4 Website indicating that the page was not responding in a timely fashion.
The only site affected by the outage was: GP International, however other sites have reported similar attacks.
Mitigation
At this time we checked on the site in the event it was something that was broken that could be resolved however the site was under attack by something sending approximately 45k search requests which proceeded for about 20 mins causing intermittent 404 errors on the page.
Investigation
Follow up investigation has been conducted to determine the cause of the outage and longer term mitigation strategies. It would appear that a bot originating from the Asia region was running search requests on the web sites for malicious purposes.
The attack was a series of search SQL injection strings, and as they contained a GET parameter query string, they bypassed the full page cache and hit the rest of the application stack. Further, as they were ?s query strings, they were primarily affecting the CloudSQL instance and barely mitigated by the redis cache. What is remarkable is that ElasticSearch was not significantly affected by this attack, which deserves further investigation – why would searches containing these query strings not also affect ElasticSearch?
CloudSQL went quickly to 100%, and the very high latency in response from CloudSQL caused the PHP pods to continue forking child processes to respond to new requests, which each then stalled while awaiting a response from CloudSQL. Services were restored to normal when the attack ceased of its own accord. The autoscaling worked to both respond to load, and also limit CPU impact as it only scaled within parameters. Additional pods would have exacerbated the problem by increasing the request queue to CloudSQL and potentially increase CPU load on-cluster
It has since been discovered that this has happened on a number of sites, however the intensity has not been enough to cause notifiable outages, but damage to the analytics and search results have been noted and details can be found here.
Actions
- Enable Cloudflare Rate Limiting and configure it to rate limit requests including query strings, to ensure that any particular query that exceeds certain thresholds can be blocked from further execution.
- Review the Wordfence settings to ensure this product is also configured to protect against these types of attacks.
- Review the search page implementation to ensure that queries are not disproportionately affecting MySQL when they could instead be directed to the dedicated ElasticSearch instance.
- Investigate implementing micro-caching in Openresty for GET requests that contain a query_string.
- Load test these changes