Following the release v1.23 three of our planet4 websites experienced a problem. This is a report of what caused it, what went wrong, and the steps we are taking to ensure that this does not happen again.

(Editors Note: I don’t promise you that we will never have an outage again. But we will do what we can to not have the same outage again)

The problem was expressed as the three websites (GP Netherlands, GP India, GP Canada) not having css for 4 hours (until the issue was resolved) and the sites were not looking good.

As usually is the case there are multiple factors/errors , all contributing to the single result. Every each one of those things could have made us avoid the outage.

a) We are still using two environments for hosting (VMs and k8s) , which have differences.
One change was tested only on the one environment, and not on the other.

b) One such difference is how composer scripts are being run.
The difference was introduced on 20th of June in the jira issue  PLANET-2346 (Make custom scripts on planet4-base-fork be picked up)

When Konstantinos introduced the developers to the process of how to do a release he did not communicate to them  about this peculiarity/difference between the two environments. So when such a change was included in the v1.23 and the change was introduced in the composer file that handles the one hosting environment, it was not introduced to the similar file that controls the k8s as the k8s environment expected.

c) We do not have the written tests being run automatically on each release (neither in the VM, nor in the k8s environment). Tests are being run manually, and only against one website.

d) Because the automated tests we have are not integrated in the release cycle, one part of the process was a visual confirmation that everything works on the staging site before pushing a release to production.
This was part of the documentation, but it’s importance was not stressed to the developers.

" 3) When the process is finished, check that the sites work, and if all is ok, then run the following process for their master: "

So, while doing the release, the developer did not do a manual inspection of every site.

e) The rollback process is not clearly described for the k8s.

What we are doing to improve the process:

1) The sprint 50 (starting on July 25th), will have a single goal: Automated testing. In detail, “Introduce automated testing in the release cycle and write more automated tests”. The whole team will have that one as the primary goal for the whole sprint.

2) The importance of manual testing (and visually checking the staging sites) before go live has been stressed to the whole team, and will be happening until we are certain that automated testing is live and has enough tests to make sure that nothing breaks.

3) After automated testing is introduced, the next step will be the migration of the sites from the VM stack to the k8s stack, so that we don’t have two different setups that we need to test. (Aka, simplify the whole setup).

4) All developers are now aware of the difference in the two composer.json scripts and what needs to happen in the planet4-base (VM) and planet4-base-fork (k8s) whenever such a change happens.

5) Roll back process for both environments will be documented before the next release