Many years ago, I was testing and debugging some software for a set of machines on a rack. The system was designed to have high availability – the four nodes were in constant communication across eight cables, mirroring their databases and transparently recovering if there was a failure in the system. During my testing and debugging, each component had low reliability – some part of the system crashed or otherwise required rebooting frequently – the MTBF for each component was about 2 hours.
I was pleasantly surprised one day to notice that the system had been up for a whole week! No single node had lasted more than a half-day, but the system had been robust in the face of that. Woohoo!
I was reminded of this experience this week – without the pleasant surprise – as my PC repeatedly froze up at random intervals during extended, intensive data recovery operations. It froze up about 15 times – once it survived 8 hours before locking up, once it lasted 33 seconds.
The difference between this experience and the last one was that there was no synchronisation with other machines and no recovery points. Every time it froze, I lost everything since my last (manual) save. As a result, I couldn’t just walk away and ignore the computer while the data recovery software running – I had to come back every half-hour and interrupt it to save the progress that had been made.
Sigh.
Comment by Alan Green on October 6, 2005
I’m working on software that has a personal working philosophy of “what’s past is past, let’s look to the future”. It’s not that it ignores errors, it’s just that it works around them whenever possible. I have to take care to examine the logs regularly, because bad things can happen – like the database falling over, or whole flocks of webservices becoming uncontactable – without impacting the immediate user experience.
Comment by Julian on October 6, 2005
Alan,
An interesting coding philosophy: I am still working out what it might mean in a practical sense – especially with whatever low-level driver or hardware errors that were causing my entire machine to completely freeze.
Certainly, one of the risks of RAID is that if failover is too transparent, you may never fix the problems until the last disk fails.