On the night (CET) of Friday 29th of November, we received some alerts, and reports from editors, that specific P4 websites were unresponsive. This was caused by an excessive load on our Redis cache instances that resulted in making the affected instances unresponsive.
The affected sites were: GP Colombia, GP Greece, GP International, GP Norway. The team response was rather quick, and the total time of outage was slightly over an hour.
As a quick mitigation, we restarted the Kubernetes pods running the Redis cache instance for each of the affected websites. This process took only a few minutes and the websites quickly recovered after this.
As the source of the outage was obvious from the very beginning, an investigation has started to find the root cause of this abnormality. Up until now we haven’t yet a definite answer on what caused this excessive load on Redis instances. Since this affected 4 different environments we excluded the possibility of increased traffic and auto-scale failure. Up until now we don’t have a definite conclusion, but we closely monitor Redis behavior.
We also connected all of our P4 environments to a Sentry real-time error monitoring system. This will help us discover and triage any application errors that may cause similar problems in the future.