We have two properly sized dedicated machines for live and standby operation. Both servers run several virtual machines in an OpenVZ environment. In the past we had quite a few times issues with their RAID controllers. Under heavy load, the system was just freezing. This problem was reported to the server vendor. One of the servers (the standby machine) was taken down on November 16 to update the firmware which seemed to resolve the issues that we had. Yesterday, the machine was brought back into the data center at punkt.de so we could reinstall and put it into operation again.
While one machine was (hopefully) fixed by the firmware update, the same problem started again to hit the live machine as well. We decided to put back the fixed spare machine into operation and to migrate the virtual servers to that hardware. In general this is not such a big issue anymore as we have proper experience and our infrastructure is quite automated thanks to using Chef and Steffen Geberts high engagement for improving this setup.
Unfortunately, after the machines were migrated and waiting to be started, around 1:00 pm some weird problems with the network setup started to hit us. For some unknown reason all network interfaces were unavailable, and on top of that, the configuration for the remote serial console was not properly configured yet. We contacted punkt.de as someone of their admins had to go to the data center to get the serial console working for us again. Being one step further now, we tried to tackle the issue with the failing network interfaces. However this took way longer than expected and during late afternoon we finally decided to move typo3.org to some other machine for the time being.
Around 10:30 pm, typo3.org was put live again on the temporary location. However, we were aware that the hardware was not meant for such a high-traffic site on top of the existing machines that were already running on it. That's why typo3.org was working, but performance was pretty bad (even worse than we expected). This was even worsened by the fact that the associaton website got updated just at the same time as well (which is great of course, but in this situation it would be been good to know for us, so we could have postponed it)...
After resetting the firmware of the ethernet controllers late in the evening, the last remaining issue with the network setup could be solved in the next morning.
Finally, we could move back typo3.org to the machine which is properly dimensioned to host it. This was finished on December 4 at 2:00 pm. Since that time typo3.org is served again at proper speed without any problems and we are back at "regular operation". There is still no spare machine available right now, but we are of good hope that the other server will be fixed before we will need it again...
Of course this is just a summary and as every sysadmin probably knows, situations like this can happen no matter of your worst case scenario. Last Monday was just a classic "murphy" day for us and we are verry sorry for every minute of failed operation.
We have asked the admins at punkt.de to repair the spare machine as well, so we'll have more capacities again. Also we have incorporated the automatic setup of the remote/serial console so chances are lower that we miss that important part of the server setup in the future...
In order to prevent further outages, punkt.de will even offer a 3rd server for us, allowing us to be even more flexible during hardware failures. We really like to thank their admins, Wolfgang Zenker and Patrick Hausen, for supporting us whenever we needed their assistance. Thank you!
Thanks to everyone for your patience and understanding!
Peter, Steffen and Michael on behalf of the server team