|Read the Digest in
You need the free
The digest of current topics on Continuous Availability. More than Business Continuity Planning.
BCP tells you how to recover from the effects of downtime.
CA tells you how to avoid the effects of downtime.
In this issue:
Browse through our Useful Links.
Check our article archive for complete articles.
Sign up for your free subscription.
Join us on our Continuous Availability Forum.
The Availability Barrier is Failover Time
Let’s say that your company is running a very critical application on a top-of-the-line Linux server. Experience has shown that it fails about twice a year and requires about four hours to restore to service. Your manager wants the application to have no downtime.
To achieve this, you configure an identical backup server to take over should the production system fail. Tests show that you can fail over in an hour. Congratulations! Or is it? You may have reduced the annual downtime from eight hours to two hours; but at twice the cost, you have increased the availability of the application by only a factor of four. Not very good!
The secret to achieving very high availabilities (minutes or even seconds of downtime per year) is fast failover. Though most critical applications today are running with a passive backup that can take over within an hour or more should the production system fail, a recent study by the Standish Group found that only 2% of critical applications are running active/active with failover times measured in seconds. And this despite the fact that moving from active/passive to active/active does not entail that much additional cost.
In our high-availability seminars, we demonstrate the impact of failover on availability with a bit of theory and many actual examples; and we show you how to reduce failover times from hours to minutes or seconds. Check out our seminar synopses on our web site, and give us a call to schedule an educational experience.
Dr. Bill Highleyman, Managing Editor
Royal Bank of Canada Goes Active/Active for ATM/POS
The Royal Bank of Canada (RBC) has taken a major step towards providing improved service to its customers by modernizing its active/backup data-center architecture and reengineering it into an active/active network. The end result? Planned outages for system upgrades have been reduced from hours to minutes, and recovery from an unplanned outage resulting from a system failure or a data-center disaster has been reduced more than 95% from hours or even days to a few minutes. Most importantly, once an outage occurs, the bank’s ATM/POS application services are restored to customers much faster, in many cases without the customer even realizing that an outage has occurred.
VMware is the new kid on the block when it comes to cloud computing. Its offering, the Cloud Foundry, is aptly named since it is not simply another cloud service. Rather, it is intended to provide a platform for developers to build their own clouds.
Unfortunately, just two weeks after its launch on April 12, 2011, the Cloud Foundry suffered a sequence of major problems that took it down for about twelve hours over a day and a half period. Fortunately, VMware’s outage did not have a serious effect on its users since the Cloud Foundry is a beta release that is expected to have problems – that is why we have beta releases. Users have free access to it and are penalized only by inconvenience. Nevertheless, it is important to learn from this experience.
One admirable thing that VMware undertook was its timely description of the events that led up to the outages and their root cause. In fact, VMware seemed to beat the press. The press accounts describing these incidents didn’t start until about the first of May, a week later.
Choosing a Business Continuity Solution: Part 1 – Availability Fundamentals
Business continuity encompasses those activities that an enterprise performs to maintain consistency and recoverability of its operations and services. The availability of application services provided by an enterprise’s IT infrastructure is only one of many facets of business continuity, albeit an extremely important one. Application availability depends upon the ability of IT services to survive any fault, whether it is a server failure, a network fault, or a data-center disaster. An enabling technology for achieving high availability and even continuous availability for application services is data replication.
This series of articles provides management with an understanding of data-replication technologies. Management can then make informed decisions concerning the availability approach to be applied to each application for maximum return on the company’s investment. Selecting the right data-replication technology to achieve your business-continuity goals is the focus of this four-part series.
In this first part, we review many of the fundamental concepts that help us define availability. Fundamental to highly available systems is data replication. Data-replication technologies are reviewed in Part 2. The various highly available architectures in use today are described in Part 3. Finally, in Part 4, we show how to choose the architecture that will meet your business requirements.
Most highly available systems in use today are complex assemblies of redundant components. We may know the reliability characteristics of each component in the system, but how can we use this information to calculate the availability of the entire system?
The reliability diagram is an important tool to achieve this end. In this article, we look at reliability diagrams and give an example of how to use them to calculate the availability of complex systems.
Most complex IT systems can be represented as a set of redundant nodes in serial with other nodes. To calculate the availability of such a system, the first step is to draw a reliability diagram of the system. The next step is to resolve each redundant node into a single node. Then each series of serial nodes is resolved into a single node. These two steps are executed iteratively until the system is reduced to a single node, thereby giving its availability.
The analysis in this article focuses on system downtime due only to node failures. However, in reality, a redundant system is down if it is in the process of failing over. The extension of reliability diagrams to include failover is discussed in our companion two-part series, Simplifying Failover Analysis, Parts 1 and 2 (Availability Digest; October 2010 and June 2011).
Sign up for your free subscription at http://www.availabilitydigest.com/signups.htm
Would You Like to Sign Up for the Free Digest by Fax?
Simply print out the following form, fill it in, and fax it to:
+1 908 459 5543
The Availability Digest is published monthly. It may be distributed freely. Please pass it on to an associate.
Managing Editor - Dr. Bill Highleyman firstname.lastname@example.org.
© 2011 Sombers Associates, Inc., and W. H. Highleyman