Saturday, September 27, 2008

Improvements, and a real weasel

I have doubled the amount of RAM on the server, going from 1 to 2 GB. This should definitely help on the hangs. Also, I have added management again, by using a Real Weasel PCI card. If there is any trouble with the machine now, I should have a fair chance to fix it from home or whereever I am.






I took some pictures of the current server setup. On top you can see the backup server for DNS/MX, which also controls the 4-port USB serial adapter, for management.








Serial cables:

Maintenance today

I am adding more RAM, and fixing management of the PC so that I can do out-of-band management. This means I can fix it more easily from home/remote location if something is not working.

Friday, September 26, 2008

Update

Ok, so we still have hangs. The blog neppe.no (which uses a lot of memory for PHP) and ZFS are the culprits. I had to undo some "fixes" to have neppe.no working at all. Until we have neppe.no using less memory, or more memory in the machine, we are going to have some occasional hangs. When the hangs happens, sometimes processes die. Because of this I have extended the monitoring on Totem to also check things like the mail queue length, and that Amavis/Postgrey are up and running. Should help to keep things up and running, and mail flowing.

Tuesday, September 23, 2008

Getting closer on fixing the hangs

It seems the recent server hangs are related to rewrite rules in Apache and the generation of one big PHP page several times in parallel. This makes the server use a lot of swap within a short time, and for the users -- hang. I have made some changes to the setup so that it hopefully doesn't happen again.

Sunday, September 14, 2008

Server hangs

Arrgh. This morning it happened again. The whole server freezed from 07:50 to 08:15. I'll be looking into why this is happening. But it might take a little while before I get any results I'm afraid.

Wednesday, September 10, 2008

Downtime today

The last weeks we've had some hangs, lasting upto and hour or two, which I still have now been able to track down. It could be related to the fact that we are using ZFS now. In an attempt to avoid the hangs, I upgraded the mother OS (on the physical server) today. Unfortunately it did not come up after a reboot, which happened during a busy day at work. It took some time before I could look at it and also to find out what was wrong. According to Nagios, we were down from 12:46 to 20:52. And yes, I do get an SMS message if the server goes down.