|Read the Digest in
You need the free Adobe
The digest of current topics on Continuous Processing Architectures. More than Business Continuity Planning.
BCP tells you how to recover from the effects of downtime.
CPA tells you how to avoid the effects of downtime.
In this issue:
Complete articles may be found at http://www.availabilitydigest.com/articles.
Availability Math is Fun – for Some
In this issue, we continue our discussion on how to configure your servers to meet a performance SLA that might say, for instance, “95% of all transactions will complete in less than 2 seconds.” After all, if a system is not sufficiently responsive, it is effectively down,
Though the math behind this solution may be complex, for some of us it is surprisingly entertaining. We are the “math nuts.” However, we always try to leave those who consider themselves math-challenged with simple-to-use results that can be put to use without understanding any of the mathematical complexity.
If you are a math nut and have an issue with one of our analyses, please pass your comments by us. We are always learning; and, believe it or not, we are not always right.
If you are struggling with an availability calculation and need some help, this is even more of a reason to get in touch with us. We’d be glad (even excited) to assist.
Dr. Bill Highleyman, Managing Editor
Named for Edwin Hubble, an American astronomer, the Hubble Space Telescope is perhaps one of NASA’s most successful missions since the Apollo project put men on the moon. Launched in 1990, Hubble has helped to pin down the age of the universe; and it has pointed to the existence of dark energy that apparently makes up the bulk of the mass of the universe.
However, late last year, we almost lost Hubble. A module in the instrument controller failed, and the failover to the backup unit did not work. It took harried engineers over four weeks to get Hubble operational again.
The failure could not have happened at a better time. The last Space Shuttle mission to service Hubble was scheduled to launch two weeks after the failure. NASA has managed to reschedule the mission until next May so that the failed part can be replaced and give Hubble a redundant backup once again.
A “notwork” is a network that does not work.
In an active/active system, the communication links are every bit as important to system availability as are the processing nodes. Therefore, they must be redundant. Redundancy implies that the communication links are totally independent of each other and that they be used in such a way to ensure that there are no failover faults should a channel fail.
In addition, there are many other parameters that are important. They include bandwidth, channel latency, and error rates. Each of these parameters impacts the time that it takes to get a message from a source node to a destination node. As this time increases, replication latency is extended in asynchronously replicated systems; and application latency is extended in synchronously replicated systems. Replication latency in asynchronous systems increases the chance for data collisions and for data loss in the event of the failure of a node. Application latency in synchronous systems extends the transaction-response time.
To avoid notworks, all of these parameters should be clearly specified in the Service Level Agreements with the communication carriers.
The virtualization of time services is a powerful tool for consolidating worldwide backup resources for disaster recovery as well as for data-center consolidation and time-sensitive test environments. By providing a simulated time, TANDsoft’s OPTA2000 time simulator allows multiple processing environments to be running on the same system while using different clocks.
OPTA2000’s time simulation capabilities can be used in two ways. They can provide virtual time zones in which applications can perform their processing. Alternatively, OPTA2000 offers a virtual date/time at any point in the past or future for development and test purposes.
Without time simulation, an application must rely on the system clock; and all applications running on that system must be using the same date and time. OPTA2000 breaks this barrier and allows applications on a single system to be each running under their own virtualized clocks.
Many applications carry with them a performance Service Level Agreement (SLA) that specifies the response times they must achieve. The performance requirement is often expressed as a probability that the system’s transaction response time will be less than a given interval. For instance, “98% of all transactions must complete within 500 milliseconds.”
In Part 1 of this series, we derived the basic response-time expression for a single-server system. Here in Part 2, we extend that result to a multiserver system in which multiple servers work off a common work queue. In Part 3, we will further extend our results to answer the SLA question posed above.
The use of multiserver systems can greatly improve transaction-response times. In addition, availability is significantly increased since should a server fail, the other servers in the system will simply continue to process the common transaction queue. However, care must be taken not to run the multiserver system at too high a load. Though the transaction-response time may be acceptable, response time may rapidly increase with a small increase in load, perhaps even to the point of bringing the system to its knees.
Would You Like to Sign Up for the Free Digest by Fax?
Simply print out the following form, fill it in, and fax it to:
+1 908 459 5543
The Availability Digest may be distributed freely. Please pass it on to an associate.
To be a reporter, visit http://www.availabilitydigest.com/reporter.htm.
Managing Editor - Dr. Bill Highleyman firstname.lastname@example.org.
© 2009 Sombers Associates, Inc., and W. H. Highleyman