Home  Article Archive  Digest Archive  Free Signup 

AD Reporter  Comments  Contact Us  About Us

 

 

The Availability Digest Article Archive

 

All of the articles which have appeared in the Availability Digest as well as other information are provided here. Just click on the category in which you are interested and browse the selection. The volume and issue, if any, of the Availability Digest in which the article appeared is noted just after the title as (volume, issue) (e.g., 1,3 for Volume 1, Issue 3).

 

Google
WWW http://www.availabilitydigest.com

In order to read the articles in the archive, you need the free Adobe Reader.

Case StudiesNever AgainBest PracticesAvailability TopicsReadingProduct ReviewsGeek Corner

           

 

 

  Case Studies 

Active/Active Payment Processing at Swedbank  (3,1)  Swedbank uses active/active Base24 to support credit cards and POS terminals.

Asymmetric Active/Active at  Banco de Credito  (2,11)  Using an symmetric configuration save programming changes. 

Bank-Verlag - the Active/Active Pioneer  (1,3)  Bank-Verlag went active/active two decades ago with IBM/Tandem. 

BANKSERV Goes Active/Active  (2,4)  A banking switching service in South Africa moves Base24 into active/active.

Community College Learns From SAN Disaster  (2,2)  A disastrous SAN failure leads to dual redundancy.

CPA at Aqueduct, Belmont, and Saratoga  (2,1)  Race track wagering can never fail, or else riots start.

Do You Know Where Your Train Is?  (1,1)  A transit authority goes active/active for train tracking.

How Does Google Do It?  (3,2)  Google processes tens of gigabytes of data in minutes on their massive clusters.

HP's Active/Active Home Location Register  (1,2)  The brains of a cellular network can never go down.

HP's OpenCall INS Goes Active/Active  (2,6)  Replication lets OpenCall INS run active/active with collision detection and resolution.

Major Bank Uses Active/Active to Avoid Hurricanes  (2,10)  Fast failover is used to switch users out of hurricane path.

Payment Authorization - A Journey from DR to Active/Active   (2,12)  A start with DR leads this company to active/active and application integration.

QEI Provides Active/Active SCADA with OpenVMS  (2,9)  Electrical substation  monitoring that never goes down. 

Tackling Switchover Times  (1,1)  If active/active is too big a step to take now, work on reducing your switchover times.

Telecom Italia's Active/Active Mobile Service (2,3)  Italy's biggest cell phone network is supported by active/active.

  Never Again

Active/Active Save #1 - Coffee Pot Takes Down Node  (1,2)  When the coffee pot was plugged in - Surprise!

BlackBerry Gets Juiced  (2,5)  Poor testing leads to no service for North American subscribers.

BlackBerry Takes Another Dive  (3,3)  Deja Vu. Poor testing once again leads to no North American service.

Console Command Takes Down Active/Active System (1,3)  Stop applications on one node, stop other node. Oops!

Don't Wait for the Other Shoe to Drop  (2,2)  When a spare component fails, fix it fast. Don't tempt Murphy.

Hostway's Web Hosting Service Goes Down for Days  (2,9)  Small online stores offline for up to a week.

How Many 9s in Amazon?  (3,7)  Even giants fall. Amazon's S3 and EC2 services and online retail store go offline for hours.

IRS Goof Costs U.S. Taxpayers $300m +  (2,1)  Turning off the old system before testing the new one is dumb.

On-Demand Software Utility Hits Availability Bump  (2,10)  A utility is expected to be always up, but this one didn't make it.

PayPal Services Downgrade with Upgrade  (3,6)  Attempting an upgrade with no fallback plan takes PayPal services down for weeks.

Rackspace - Another Hosting Service Bites the Dust  (2,12)  A truck driver wipes out web sites for a day or more.

So You Think Your System is Robust?  (2,8)  So did these major enterprises, all of which went down in the first six months of 2007.

So You Think Your System is Reliable  (3,1)  Horror stories from the second half of 2007 focus on power and branch failures.

Software Bug Causes Train Wreck  (1,1)  A software bug, controller diversion, and engineer inattention combine to cause a train collision.

The Alaska Permanent Fund and the $38 Billion Keystroke  (2,4)  What do you do when your active and backup disks are wiped out and your tapes won't read?

The Case of the Flying Cable  (1,1)  A technician loses control of an under-floor cable and lets it hit a power strip.

The Great 2003 Northeast Blackout and the $6 Billion Software Bug (2,3)  A hot day, an untrimmed tree, and a monitoring system bug cost power customers $6 billion.

Triple Redundancy Failure on the Space Station  (0211)  A single point of failure takes down a triplexed critical computer.

VoIP PBX Succumbs to Overconfiguration  (2,6)  Why extra processing power made this PBX less reliable.

What? No Internet?  (3,2)  A multiple cable break isolates North Africa, the Middle East, and India.

  Best Practices

Availability Best Practices  (2,1)  Tips from those who have achieved near-continuous availability.    

Can 10,000 Chickens Replace Your Tractor?  (1,3) Save money by replacing your mainframe with clusters - Not!

Document Your System  (1,2) Documentation is a necessary evil. Let's focus on the "necessary" and not the "evil."

HP Blows Up Data Center  (2,8) An explosive demonstration of fast recovery.

Humanizing Three 9s  (2,9)  What if we lived in a world of three 9s?

Interview with Ron LaPedis on NonStop with XP Storage  (2,5)  How to improve NonStop reliability by using a SAN.

Katrina - The Harsh Teacher  (2,6)  The most powerful Gulf storm in 200 years showed us how unprepared we were for such a disaster.

On Blogs and Discussion Groups  (2,10) Online forums can be a big boost to your professional growth.

Reliable Multicasting  (3,1)  How to get messages over LAN and WAN multicast networks without message loss.

Recovery-Oriented Computing  (2,2)  If recovery time can be made small enough, users will perceive a faultless system.

Microrebooting for Fast Recovery (2,3)  An application of Recovery-Oriented Computing.

Rules of Availability - Part 1  (3,3)  The first set of common rules of availability from our books, Breaking the Availability Barrier.

Rules of Availability - Part 2  (3,5)  More common rules of availability from our books, Breaking the Availability Barrier.

Rules of Availability - Part 3  (3,7)  Concluding the common rules of availability from our books, Breaking the Availability Barrier.

Transaction-Oriented Computing  (2,4) Old art to some, new to others, transaction processing is the foundation for high availability.

With 100% Uptime, Do I Need a Business Continuity Plan?  (1,1)  You'd better believe it.

  Availability Topics

Active/Active Versus Clusters  (2,5)  For high availability, clusters are mature; but active/active systems provide greater reliability.

Adding Availability to Performance Benchmarks  (2,9)  A system that is down has zero performance.

All About Continuous Processing Architectures  (1,1)  CPA can get you arbitrarily close to 100% uptime.

Asynchronous Replication Engines  (1,2) These engines power most of today's active/active systems.

Availability versus Performance  (2,8) Is it time to trade higher availability for reduced performance?

Collision Detection and Resolution  (2,4)  What do you do if you can't avoid collisions when using bidirectional replication?

Fault Tolerance for Virtual Environments - Part 1  (3,3)  How virtualization can significantly reduce data center capital and operating costs.

Fault Tolerance for Virtual Environments - Part 2  (3,4)  Operating system and bare metal hypervisors.  

Fault Tolerance for Virtual Environments - Part 3  (3,6)  Hardening virtual environments with failover and fault-tolerance.

Hardware Replication  (2,1)  Replicating at the hardware level does not maintain database consistency.

Jim Gray - In Memoriam  (3,7)  The database pioneer that set the stage for active/active systems is lost at sea.

Let's Get an Availability Benchmark  (2,6)  Great performance is meaningless if the system in unavailable.

Migrating Your Application to Active/Active (2,3)  What must you do to prepare your application for an active/active environment?

Synchronous Replication  (1,3)  Avoid data collisions and data loss following a node failure.

Time Synchronization for Distributed Systems - Part 1  (2,11)  How does NTP calculate the time offset from a time server?

Time Synchronization for Distributed Systems - Part 2  (2,12)  How NTP minimizes time offset errors?

Time Synchronization for Distributed Systems - Part 3  (3,2)  Logical clocks offer an option for synchronizing systems.

Transaction Replication  (2,2)  A simple approach to active/active systems has scalability issues.

The History of Fault Tolerance  (1,2) The fault-tolerant marketplace was hot in 1984.

What is Active/Active?  (1,1) Active/active architectures can give subsecond recovery following a failure. 

  Recommended Reading

Aberdeen's 2008 Business Continuity Survey  (3,4)  A look at 150 small to large companies and their BC/DR plans and processes.

Blueprints for High Availability: Designing Resilient Distributed Systems  (2,5)  All you ever wanted to know about clusters.

Breaking the Availability Barrier  (3,5)   Everything you ever wanted to know about active/active systems - theory, implementation, and practice.

Business Continuity Planning: IT Examination Handbook  (1,1)  What better way to learn about BCP than from the auditor's handbook.

Continuous Availability Systems Design Guide  (2,1)  What to do if you want to move to CPA.

Distributed Systems: Principles and Paradigms  (3,1)  A thorough treatment of requirements for distributed system transparency.

Fire in the Computer Room , What Now?  (2,6)  Are you prepared for a total loss of your data center because of a fire or other disaster?

Migrating Legacy Systems: Gateways, Interfaces, & the Incremental Approach (2,3)  Legacy systems must be decomposed to migrate to active/active.

Multiple Processor Systems for Real-Time Applications  (2,10) A classic treatise on distributed systems that is still pertinent two decades later.

The Business and Economics of Linux and Open Source  (1,3)  Open source demystified for the reluctant manager.

The Unified Modeling Language User Guide  (1,2) UML is now the accepted standard for fast and easy documentation of systems and procedures.

Towards Zero Downtime: High Availability  Blueprints  (2,8) A close look at installing Microsoft clusters and cluster-aware applications.

Transaction Processing: Concepts and Techniques  (2,4) The classic book on transaction processing systems, by Jim Gray and Andreas Reuter.

Unix Backup and Recovery  (2,2)  Backing up is a pain, but it is the restore that counts.

  Product Reviews

Fault-Tolerant Windows and Linux from Stratus  (2,9)  ftServers provide transparent fault-tolerant operation.

Flexible Availability Options with GoldenGate's TDM  (2,2)  Implement a variety of data-sharing topologies with TDM's data replication facilities.

GRIDSCALE - A Virtualized Distributed Database  (3,7)  Like presentation and application servers, pooled database servers for the three-tier archictcture.

How Much Will Active/Active Cost Me?  (1,1)  The cost of downtime can swing your decision.

HP's ServiceGuard Clustering Facility  (2,5)  Managing HP-UX and Linux clusters.

MySQL Clusters Go Active/Active  (1,3)  Clusters of storage nodes are kept in sync by synchronous replication.

OpenVMS Active/Active Split-Site Clusters  (3,6)  OpenVMS Clusters provide active/active operation with synchronous replication.

Parallel Sysplex - Fault Tolerance from IBM  (3,4) IBM's Parallel Sysplex offers offers localized active/active availability.

Penguin Computing Offers Beowulf Clustering on Linux  (2,1)  Beowulf clustering, developed by NASA, is available on Linux along with Penguin's HPC servers.

Shadowbase - The Active/Active Solution (2,3)  Shadowbase provides fast data replication as well as online copy and database resynchronization.

solidDB - a Five 9s Memory-Resident Database  (3,5)   Server memory is getting so large, why not keep your database in high speed memory?

Time Synchronization for NonStop Servers  (0211)  NTP products  from Bowden Systems and HP for NonStop servers.

Virtual Tape - Getting Rid of a Troublesome Medium  (1,2) The backup paradigm is changing.  Goodbye, tape.

Virtual Tape for NonStop Servers with ETI-NET's EZX-BackBox  (2,6)  Virtual tape made super-fast with deduplication.

Virtual Transactions with NonStop AutoTMF  (2,4) Converting nontransactional applications to transactional applications.

  The Geek Corner

Calculating Availability - Redundant Systems  (1,1)  Some useful rules come out of the derivation of the availability equation.

Calculating Availability - Repair Strategies  (1,2) Your repair policy can have a significant impact on your system availability.

Calculating Availability - The Three Rs  (1,3)  Node repair, node recovery, and system restore are all required.

Calculating Availability - Hardware/Software Faults  (2,1)  Most faults don't need a repair.

Calculating Availability - Failover  (2,2)  When a system is failing over, it is often effectively down, thus reducing availability.

Calculating Availability - Failover Faults  (2,3)  Failovers can fail also.

Calculating Availability - Environmental Faults  (2,4) How to handle hurricanes, power failures, and riots when calculating availability.

Calculating Availability - Nodes, Subsystems, and Systems  (2,6)  When is a node a system, and when is it a subsystem?

Calculating Availability - Failure State Diagrams  (2,9)  Formalizing our intuitive derivations.

Calculating Availability - Heterogeneous Systems - Part 1  (3,3)  Probability 101 in preparation for analyzing systems with heterogeneous nodes.

Calculating Availability - Heterogeneous Systems - Part 2  (3,5)  The availability of redundant systems with different nodal availabilities.

Calculating Availability - Heterogeneous Systems - Part 3  (3,6)  Analyzing complex configurations of system components.

Failure State Diagrams - Repair Strategies  (2,10) The real story behind sequential repair and parallel repair.

Failure State Diagrams - Recovery Following Repair  (2,12)  The formal analysis of the impact of having to recover a node after its repair.

Failure State Diagrams - Hardware/Software Faults Revisited  (3,2)  Our intuitive results were a little simplistic.

Cluster Availability  (2,5)  How does the availability of a cluster compare to that of an active/active system?

Estimating Data Collision Rates  (2,8) Can you go active/active with a tolerable level of data collisions?

Is Parallel Repair Really Better Than Sequential Repair?  (3,4) A Digest reader points out that that depends upon the repair time distribution.

What's That Nerd Logo?  (1,1)  Our logo, ff2, really has a meaning. Find out why it describes active/active architectures.

 

 

© 2006 Sombers Associates, Inc., and W. H. Highleyman