Dear Server Hosting Helpdesk staff,
I’ve had this exact problem many times before. One of your other customers has software that has gone wild and is consuming all of our shared CPU.
(I used to believe you when you said you monitored for this, but I learnt that you did nothing when you discovered it, so now I monitor for it myself.)
No, it isn’t me.
Yes, I know I have a single process consuming 20-30% of my allocated physical RAM. That’s why I am paying you for your server – to run that process for me. It’s okay, though. I am not running out of RAM. Please locate the problem user’s processes and kill them, like your colleagues have always done in the past.
No, it isn’t me.
Yes, I know I have a fair number of processes, but they are all sitting idle. They may be using RAM, but I am not running out of RAM. Please locate the problem user.
No, it isn’t me.
Yes, I know I have 4 users logged into my machine. They are all me, logged in as different accounts, sitting idle. There look, I’ve logged out of all but one. That should clear up some of the RAM I am not running out of.
Please locate the problem user.
No, it isn’t me.
Yes, I know my big process sometimes wakes up and consumes some CPU. That’s why I want you to fix the problem; so it is very responsive when it does. But it doesn’t wake up often, and when it does, it doesn’t do it for long. Seriously, it accounts for a couple of percent of CPU over a 5 minute period. It doesn’t explain why the 15-minute average system load is about 500% higher than it normally is.
Please locate the problem user.
No, it isn’t me.
Oh look, after spending 30 minutes trying to prove it isn’t me, the problem went away. Either the user fixed their own problem or maybe you did and didn’t want to tell me.
No, it wasn’t me.
Sigh.
Comment by Code Incantation on February 14, 2011
You should ask the server: “Are you a good Helpdesk staff?”
Helpdesk: “No, it isn’t me.”
Better quit on their product.
Comment by Julian on February 14, 2011
In an unrelated incident, the provider caused a 4-hour outage today. It caused me stress (they would give no pro-active indication of progress and refused to answer direct questions about approximate ETAs) and it also cost me a fair whack of money – more than the hosting costs of their competitor’s dedicated server plan for a few months – so I’ve decided to migrate over during the next few weeks.
Almost as a parting gift, they moved my Virtual Private Server to another hardware node, promising “zero downtime”. I was impressed, figuring they had magic virtualisation tricks to do that.
Nope – they rebooted. My non-daemon processes were killed and didn’t restart until I got back to my desk.
Meanwhile, they claimed “All services of your vps are back online”, which was both wrong and something they couldn’t know and that my “websites are working fine” which was another bold claim that they couldn’t know. (Port 80 refuses connection. My web-sites aren’t running on standard ports. They have no idea where to look or what my web-sites should look like if they found them.)
Sigh.
Comment by Chris on February 14, 2011
I think all SLAs are written with a dash of wishful thinking. Even the ones you get with dedicated hardware. You might not get the contention issues, but with one node (if that’s what you’re gunning for) you might still be a victim of a freakish blemish on an immaculate record of perfect availability.
Sounds like you need redundancy of some kind. Duplication, or something like a dormant EC2 instance.
Comment by Julian on February 19, 2011
[I wrote this comment on Feb 14th, but got distracted and didn’t hit submit.]
I have some network latency requirements that preclude EC2, or else I would have jumped at it on day 1; it sounds ideal. Even the idea of having separate – but identical – testing, staging and live servers appeals to me, rather than merely separate accounts on the same machine.
I thought a VPS offered good hope of quick failover – just boot the same file-system on new hardware – but it wasn’t to be.
Most of the problems I had could be attributed to the VPS system, which is why I am migrating away. (I was warned that their VPS system, Virtuozzo, wasn’t the greatest, so I may be overreacting.)
I am currently executing my failover plan – which involves migrating to a new server within 24 hours (including finding a new ISP). Actually, it is more rehearsing, rather than executing. I’m not really in a rush, because the old server is running fine at the moment.
My biggest gripe is the help-desk’s behaviour during the last few weeks.
They failed to identify the true problem with the CPU hog, which is just frustrating.
There has been a persistent problem for over two weeks with the VPS Control Panel which they keep giving overconfident estimates for. Now, they are asking for my root password, and given what I have seen them do recently, I don’t wish to give it to them.
This four-hour outage was the final straw.
Rather than give optimistic estimates, they flatly refused to give any at all. After two hours of down time, with no communication from them, I had no idea whether to expect the system to be back in 10 minutes or 10 days. I had to assume the worst, and to start deciding which of their competitors was getting my business. By the time the system did come back, I had lost all confidence in them, and decided, with some sense of relief, to proceed with the migration, just at a more relaxed pace.
Comment by Chris on November 27, 2012
I just remembered this post, and wanted to note that latency issues might be solved with now that Amazon have Australian EC2 instances on offer. Also, optimising spot pricing might be entertainingly recursive.