|Read the Digest in
You need the free
The digest of current topics on Continuous Availability. More than Business Continuity Planning.
BCP tells you how to recover from the effects of downtime.
CA tells you how to avoid the effects of downtime.
Thanks to This Month's Availability Digest Sponsor
In this issue:
Browse through our Useful Links.
Check our article archive for complete articles.
Sign up for your free subscription.
Join us on our Continuous Availability Forum.
Check out our seminars.
Check out our technical writing services.
Where are the Fallback Plans?
Just after completing this issue’s article on the continuing failures impacting Australia’s four major banks, I found myself the victim of a U.S. banking outage. I had to complete a critical transaction prior to the weekend, so I went on Friday directly to my branch of PNC Bank to solicit assistance. “No problem,” the bank officer said as he turned to his terminal and entered the pertinent information
Then, silence. Hitting the Enter key did nothing. After two or three more tries, he threw up his hands in disgust. He explained that PNC had just acquired the U.S. branches of Royal Bank of Canada and were in the process of consolidating the IT services of the two banks. Something must have gone wrong. Come back later, and try again.
The Australian banking problems are being caused by the modernization of their aging infrastructures. The PNC problem was (presumably) caused by a major migration of RBC applications to the PNC environment.
In my seminars on availability, I talk extensively about the risks involved with major upgrades. The ultimate protection is a solid fallback plan so that services can be restored if the upgrade goes bad. It seems that many enterprises do not want to attend to this important defense. Please don’t fall into the same trap.
Postscript: My transaction was late - by a week! The system was down that long.
Dr. Bill Highleyman, Managing Editor
Shades of Y2K! Microsoft’s Windows Azure Cloud went down for over a day on Wednesday, February 29, 2012. Starting around midnight as the clock ticked to Leap Day, various subsystems of the Azure Cloud started to fail one-by-one. Soon, applications for many customers became unresponsive. By 8 AM Thursday morning, thirty-two hours later, Microsoft reported that recovery efforts were complete but that "a small number of customers may face long delays during service management operations."
Smells like a Leap-Year bug.
It is troubling that after the Y2K hysteria, we should be experiencing once again a calendar-related failure. A raft of date-simulation products were developed back then to allow systems to simulate dates without changing the system clock, thereby permitting the Y2K transition to be tested while the system remained in production. Many of these products are still around today. If the Azure cloud had been tested for the Leap-Year problem to the extent that most systems were checked for the Y2K problem, Microsoft may have avoided this disaster.
A recent online banking outage suffered by the National Australia Bank continued a series of such outages at Australia’s four largest banks over the last two years. The National Australia Bank (NAB), Commonwealth Bank, the Australia and New Zealand Bank (ANZ), and Westpac all have had their shares of outages affecting ATMs, retailers’ POS devices, and online banking. The outages have occurred as these historic banks engage in multi-year replacements of their aging core legacy systems, some dating back to the 1980s. Apparently, these systems have become quite fragile in their old age.
It has been suggested by some that Australians can expect for the next decade regular outages of key banking services as progress is made in replacing the banks’ legacy systems. However, in today’s high-technology world, there is an expectation of high availability and high resilience for critical services such as banking. Institutions now cannot cover up IT failures. There is no place to hide from Twitter and Facebook.
In this article, we look at the string of online banking failures and the response of Australia’s financial regulatory authorities to the consequent loss of confidence by Australians in their rickety banking system.
HP’s Enterprise Servers, Storage and Networking (ESSN) Business Unit markets two lines of servers – Proliant servers (acquired from Compaq) and Integrity servers. Proliant servers are based on the Intel x86 Xeon processor and support Windows and Linux operating systems. Integrity servers are Itanium-based and support HP mission-critical operating systems – HP-UX, NonStop, and OpenVMS.
On November 22, 2011, HP announced a major new initiative dubbed “Project Odyssey.” It is intended to extend the mission-critical features of HP-UX from Itanium blades to Windows and Linux x86 blades over the next two years. Project Odyssey raises many questions for those involved with HP’s current, highly available operating systems – HP-UX, NonStop, and OpenVMS. In this article, these concerns are explored.
If HP customers embrace the move to highly reliable standard operating systems, HP-UX may be the first to go since migrating Unix applications to Linux is a reasonable task. But achieving the fault tolerance provided by NonStop systems and OpenVMS Split-Site Clusters is probably not in the cards. Sadly, if the reliability provided by hardened Linux and Windows systems is good enough, the market may see a declining need for great, continuously available systems. Let’s hope that great triumphs over good enough!
In many respects, a company’s data center is part of its lifeblood. Significant investments are made to ensure that corporate data centers never fail. Unfortunately, they do.
Industry studies have shown that the human factor plays a role in about 70% of data-center failures. In some cases, it is a careless error on the part of an operator. In others, it is out-and-out malfeasance. Not only can staff errors directly cause outages, but even worse, they also can escalate a controllable problem into a major crisis. One would think that staff problems are the one area that we can effectively control. Evidently, this is not the case.
In our previous articles on data-center failures, we focused on failures due to power, storage subsystems, network faults, and upgrades gone wrong. In this article, we look at some human contributions to data-center outages.
Sign up for your free subscription at http://www.availabilitydigest.com/signups.htm
Would You Like to Sign Up for the Free Digest by Fax?
Simply print out the following form, fill it in, and fax it to:
+1 908 459 5543
The Availability Digest is published monthly. It may be distributed freely. Please pass it on to an associate.
Managing Editor - Dr. Bill Highleyman email@example.com.
© 2011 Sombers Associates, Inc., and W. H. Highleyman