buzz.typo3.org: Information about the recent typo3.org issues

December 4, 2012

Information about the recent typo3.org issues

By: Peter, Steffen and Michael on behalf of the server team

During the last two weeks we had some severe troubles operating the TYPO3.org website. While we think that they have been fixed completely now, we'd like to give you a short insight on what happened.

Known problems with the RAID controller

We have two properly sized dedicated machines for live and standby operation. Both servers run several virtual machines in an OpenVZ environment. In the past we had quite a few times issues with their RAID controllers. Under heavy load, the system was just freezing. This problem was reported to the server vendor. One of the servers (the standby machine) was taken down on November 16 to update the firmware which seemed to resolve the issues that we had. Yesterday, the machine was brought back into the data center at punkt.de so we could reinstall and put it into operation again.

Problems on the live machine on December 3

While one machine was (hopefully) fixed by the firmware update, the same problem started again to hit the live machine as well. We decided to put back the fixed spare machine into operation and to migrate the virtual servers to that hardware. In general this is not such a big issue anymore as we have proper experience and our infrastructure is quite automated thanks to using Chef and Steffen Geberts high engagement for improving this setup.

Unfortunately, after the machines were migrated and waiting to be started, around 1:00 pm some weird problems with the network setup started to hit us. For some unknown reason all network interfaces were unavailable, and on top of that, the configuration for the remote serial console was not properly configured yet. We contacted punkt.de as someone of their admins had to go to the data center to get the serial console working for us again. Being one step further now, we tried to tackle the issue with the failing network interfaces. However this took way longer than expected and during late afternoon we finally decided to move typo3.org to some other machine for the time being.

Going Live on spare machine

Around 10:30 pm, typo3.org was put live again on the temporary location. However, we were aware that the hardware was not meant for such a high-traffic site on top of the existing machines that were already running on it. That's why typo3.org was working, but performance was pretty bad (even worse than we expected). This was even worsened by the fact that the associaton website got updated just at the same time as well (which is great of course, but in this situation it would be been good to know for us, so we could have postponed it)...

Recovering dedicated hardware on December 4

After resetting the firmware of the ethernet controllers late in the evening, the last remaining issue with the network setup could be solved in the next morning.

Finally, we could move back typo3.org to the machine which is properly dimensioned to host it. This was finished on December 4 at 2:00 pm. Since that time typo3.org is served again at proper speed without any problems and we are back at "regular operation". There is still no spare machine available right now, but we are of good hope that the other server will be fixed before we will need it again...

Final Remarks

Of course this is just a summary and as every sysadmin probably knows, situations like this can happen no matter of your worst case scenario. Last Monday was just a classic "murphy" day for us and we are verry sorry for every minute of failed operation.

We have asked the admins at punkt.de to repair the spare machine as well, so we'll have more capacities again. Also we have incorporated the automatic setup of the remote/serial console so chances are lower that we miss that important part of the server setup in the future...

In order to prevent further outages, punkt.de will even offer a 3rd server for us, allowing us to be even more flexible during hardware failures. We really like to thank their admins, Wolfgang Zenker and Patrick Hausen, for supporting us whenever we needed their assistance. Thank you!

Thanks to everyone for your patience and understanding!
Peter, Steffen and Michael on behalf of the server team

<- Back to: Server Administration

comments

comment #1

Peter December 5, 2012 08:39

A big thank you to the whole team. Your work is much appreciated!

comment #2

François December 5, 2012 08:46

This must have been a pretty rough ride. Thanks for your efforts!

comment #3

Peter P. December 5, 2012 09:08

Thanks a lot to everybody involved!! You are server rodeo heroes! Everybody appreciates your efforts and especially that you had to drop your all-day-business and jumped in.

comment #4

stbc December 5, 2012 09:23

Hey, heads up, thx for the work.

comment #5

Steffen Gebert December 5, 2012 09:35

Just for clarification: Daniel Lienert asked me, if I would be available, if trouble while updating of the Association web site would occur (and I gave my OK). So I knew it, but didn't inform the whole team prior to the other trouble. So this bad coordination was coming from my side, not from Daniel.

comment #6