We started getting spurious temperature fluctuations in our server room a while ago. Each night, temperature would rise to above 30C for about half an hour, and then go down again.
Over the past couple of days, it grew increasingly erratic and threatened the entire server room functionality — outages grew more and more frequent and finally we had to put another A/C unit in there, with open doors (because of the vent pipe) to try to contain it.
With something like 15-20 servers, it’s incredible how much heat a server room can generate.
The problem is that once temperature starts rising, some servers start malfunctioning. We had one server stop responding to TCP/IP, so it had to be restarted. And hard disks start crashing soon if temperature doesn’t go down again. One colleague of mine had backup tapes literally melt inside the backup unit once, he told me. So it’s been with a mild state of panic that we’ve watched the temperature go up and down this past week.
Of course, we tried to make repairs and isolate the error. Is it a malfunctioning control board inside the unit? The compressor? Is the current on to the contactor in the external unit? … Digging into a 380V external appliance to measure voltages was a first for me :)
In the end, it turned out to be a current protector that was faulty. It’s designed to shut off if the current grows too strong, to protect the compressor, and it was misbehaving. It was replaced, and voila, everything ran perfectly again.
It’s amazing how such a small little cheap thing can affect so much.
Of course, all this talk about “current protection” is just to cover up for the *real* culprit we found.