NO MORE UPDATES TO THIS PAGE PLEASE. SUBMIT ALL FUTURE COMMENTS TO |
Just a comment about the whole chapter.
In large companies (eg banks) "Disaster Recovery" is an outmoded concept. Instead they perform "contingency planning" or "business continuity planning". Typically this involves a second site with fast WAN links to the primary site(s), and with critical applications (eg trading systems) replicated; the servers at the contingency site are in "warm standby" mode. How that's implemented is application specific (they may mount files from a SAN or NAS which has WAN replication, they may have regularly DBMS log files shipped and applied, whatever). This means that on an
application by application basis the decision could be made to switch to contingency. Typical infrastructure at the contingency site includes DNS servers, SMTP gateways and so on. Another advantage of such a location; over night the WAN links are unused so it makes sense to place your backup infrastructure there. The primary site servers are backed up to tapes at the contingency site, which are therefore
automatically off-site in case of the primary site being inaccessable and also readily available at the contingency site if a restore is needed.
Anecdote: A major Investment Bank has such a contingency environment. At 9pm the on-call SA was notified by the automated monitoring system that a production server running an critical trading application was down. He logged in and determined that there was a hardware fault. He escalated to the hardware vendor, who came on site and decided that a power supply had failed in such a way that the n+1 redundancy wasn't working. This surprised the SA, but he had to trust the vendor. A new power supply was shipped, but after installing the server did not reboot. The vendor then did some more diagnosis and proclaimed the server backplane had failed. It would take 4 hours to get a new backplane shipped. Including installation and testing it was questionable whether the server would be running before the traders arrived. At this point business management made the decision to switch to contingency. The application team was woken up and they switched the contingency server from warm standby to live, ran their tests and the application was live and running an hour before the traders needed it. The next night, after the production server had been repaired, the application was switched back to the primary server. Through all this the trader didn't notice a thing was wrong. Proper planning had saved the company from actual financial loss through inability to trade.
[Tom's reply. Yes, this chapter is showing its age. Maybe the 3rd edition needs to rewrite it from scratch. I've made a note about this in the chapter., and saved this paragraph in our "for edition 3" file. Status: DONE ]
--
StephenHarris - 15 Aug 2006