On Monday 12th of July all Planet 4 sites went down due to an unusual traffic load.
Forensic:
Ingress controller (Traefik) was not set to scale
Strengthen protection against automated queries in Cloudflare (set by Loïc)
Longer term action to be taken to prevent it to happen again :
- Process to switch from Traefik to Nginx as ingress controller is already planned
- this is pending some action on planification
- Set auto scaling on ingress controller
Post Incident Actions – Sprint 87
Incident description
P4 went down due to an unusual traffic load.
Ingress Control process (Traefik) runs out of memory due to load.
Expected behavior when fixed
P4 up and running.
Traffic load back to normal.
SLO
Not currently applicable for this incident
Communication
Audience : IT-Announcement, web-folks
Last communication : 12 July 2021 – 12:40 UTC
Issue resolved
Minutes
12 July 2021 – 9:20 am UTC
Communication : Currently experiencing problems with our P4 cluster, we are working to resolve this issue ASAP. This affects all our P4 sites. We are upping resources now to our Traefik deployment to sort this out.
12 July 2021 – 9:30 UTC
Action : upped the pods running the traefik service from 3 to 5, will change the max RAM per pod to 1GB from 512mb and continue to monitor the problem.
12 July 2021 – 12:40 UTC
Communication : P4 is back up.
It was down from 11:00 to11:08 for a total of 8 minutes.
We upped the pods running the traefik service from 3 to 5, will change the max RAM per pod to 1GB from 512mb and continue to monitor the problem.
We received unusually high traffic. We have scaled resources to cope but still have a large number of requests and are looking at mitigation.