|Read the Digest in
You need the free Adobe
The digest of current topics on Continuous Processing Architectures. More than Business Continuity Planning.
BCP tells you how to recover from the effects of downtime.
CPA tells you how to avoid the effects of downtime.
In this issue:
Complete articles may be found at http://www.availabilitydigest.com/articles.
If you are going to the European ITUG conference in Brighton, England, this October, please attend one or more of the several active/active presentations that will be presented by Dr. Bill Highleyman, our managing editor. He will be talking about the comparison of active/active architectures to clusters; he will present a variety of case studies of real active/active systems in production; and he will also be hosting an active/active panel comprising European users, HP representatives, and the data replication vendors.
We direct your attention especially to this month’s article on “Adding Availability to Performance Benchmarks.” After all, a system that is down has zero performance.
Also, check out what happened when one of the world’s leading web site hosting services went down for days; and travel through a world governed by three 9s.
Please feel free to contact us for help if your data processing world of three or four 9s is just not doing the job you need.
QEI Provides Active/Active SCADA with OpenVMS
For almost 50 years, QEI of Springfield, New Jersey, has been supplying highly available SCADA systems to electric, transit, gas, water, and other utilities. A SCADA (Supervisory Control and Data Acquisition) system provides controllers with the facilities required to monitor and control the field devices upon which these utilities depend. It automatically generates alarms should conditions in the field demand immediate controller attention and provides a raft of historical data for trend analysis, root cause analysis, and many other functions important to the utilities.
QEI’s current SCADA system, TDMS-PLUS (Total Distribution Management System), focuses on the monitoring and control of electrical power substations used for the distribution of power to electric utility customers and transit systems. Built on the highly reliable and secure HP OpenVMS platform, TDMS-PLUS provides extreme availabilities through the use of dual, triple, or quadruple active/active redundancy in disaster-tolerant configurations.
Hostway’s Web Hosting Service Down for Days
Hostway is one of the largest web site hosting services in the world and serves over 400,000 customers. When Hostway attempted to move 3,700 servers of the Miami data center of a recently acquired company to its own facilities in Tampa, Florida, the servers suffered multiple failures, The result was that the web sites of 500 customers, many of them small online stores, were down for days. A week later, several customers were still down.
What went wrong? Where were the proper data and application migration techniques? What could Hostway have done to avoid the problem? We look at this horribly failed migration and attempt to answer these questions.
Is three 9s of availability sufficient for today’s data processing systems? The answer is “maybe.” It all depends upon the application.
Three 9s availability can be put in a more understandable context by considering what our lives would be like if they were governed by three 9s? For instance, more than 15,000 babies would be accidentally dropped by doctors or nurses each year. We would be without telephone service for more than ten minutes each week.
Three 9s might be acceptable for a large number of data processing applications, but we don’t think that people would like to live in a three 9s world. In this article, we list some of the consequences of such an environment.
Current performance benchmarks tell only part of the comparative story between systems. This is because a system has zero performance when it is down.
The cost of downtime in many applications is quite significant. The cost of a system as determined by the current TPC-C benchmark may be quite impressive,. However, its level of availability could well make the system more expensive than other systems with a higher initial cost but with a higher availability when the cost of downtime is considered.
Systems are so reliable today that a measure of the availability of a system might take many system-years to accurately determine. This is infeasible. However, the recovery time of a system is fairly easy to measure. Recovery time as a performance benchmark attribute would allow a comparison of system costs to be made based on the number of failures required for one system to be less expensive than another.
Recovery time is not an expensive measurement to make in the context of the cost of a transaction processing benchmark. Isn’t it time to add this attribute to our performance benchmarks so that users can make a truly informed decision?
Fault-Tolerant Windows and Linux from Stratus
ftServers from Stratus Technologies provide plug-and-play fault tolerance for Windows and Red Hat Linux applications. Using Intel Xeon chips in a dual modular redundancy architecture, ftServers bring extremely high availability – five 9s and beyond – to the industry standard marketplace at affordable prices.
Average annual unplanned downtime for industry-standard servers has been measured to be about thirteen hours for Windows systems, sixteen hours for Red Hat Linux systems, and ten hours for UNIX systems. The performance of ftServers is continually monitored by Stratus, and a running availability is posted on its home page. This Uptime Meter indicates an average downtime due to faults in the Stratus hardware or operating systems of about two minutes per year.
Stratus has extended its product line to support virtualization by incorporating VMware’s ESX virtualization server. ftServers can now run multiple virtual machines on a single physical ftServer.
Calculating Availability – Failure State Diagrams
In previous articles in our Calculating Availability series, we have studied a variety of topics. These have included failure probabilities, repair and recovery strategies, failover and failover faults, environmental faults, and the interdependence of hardware and software faults. In all of this work, we have derived a series of relations based on intuitive reasoning.
However, intuition is not always accurate. How realistic are these relationships? Have we been led astray by inaccurate reasoning? It turns out that there is a very formal way to derive these same relationships. That is through failure state diagrams, which we discuss in this article.
All of the relationships that we have presented in previous Geek Corner articles are based on formal results achieved by analyzing failure state diagrams. Basically, having determined the correct answers through a fairly laborious procedure – the state diagram, we then formulated intuitive approaches to arrive at the same conclusions; and these are what we have been presenting in our Geek Corner articles
In this article, we explain the use of failure state diagrams. They are actually simple in concept but sometimes can take some messy algebra to solve.
In later articles, we will use failure state diagrams to prove more significant results.
Would You Like to Sign Up for the Free Digest by Fax?
Simply print out the following form, fill it in, and fax it to:
+1 908 459 5543
The Availability Digest may be distributed freely. Please pass it on to an associate.
To be a reporter, visit http://www.availabilitydigest.com/reporter.htm.
Managing Editor - Dr. Bill Highleyman firstname.lastname@example.org.
© 2007 Sombers Associates, Inc., and W. H. Highleyman