Home  Article Archive  Digest Archive  Free Signup 

Seminars  Useful Links  Advertising  About Us

Writing  Our LinkedIn Continuous Availability Forum

 

The articles you read in the Availability Digest result from years of experience in researching and writing a variety of technical documents and marketing content. It’s what we do best, and we provide our services to others who value high-quality content created by IT specialists. Ask us about

• articles  • white papers  • case studies  • web content  • manuals  • specifications  • patent disclosures

 

The Availability Digest Article Archive

All of the articles which have appeared in the Availability Digest are provided here. Just click on the category in which you are interested and browse the selection. The volume and issue, if any, of the Availability Digest in which the article appeared is noted just after the title as (volume, issue) (e.g., 1,3 for Volume 1, Issue 3).


  WWW Availability Digest

In order to read the articles in the archive, you need the free Adobe Reader.

Case StudiesNever AgainBest PracticesAvailability TopicsReadingProduct ReviewsGeek Corner

           

 

 

  Case Studies 

Active/Active Payment Processing at Swedbank  (3,1)  Swedbank uses active/active Base24 to support credit cards and POS terminals.

Apollo 11 - Continuous Availability, 1960s Style  (4,9)  NASA's safety-critical computer systems put men on the moon four decades ago.

Asymmetric Active/Active at  Banco de Credito  (2,11)  Using an symmetric configuration saves programming changes. 

Bank-Verlag - the Active/Active Pioneer  (1,3)  Bank-Verlag went active/active two decades ago with IBM/Tandem. 

Bank-Verlag - An Update  (5,8)  Bank-Verlag replicates processed transactions in its active/active system.

BANKSERV Goes Active/Active  (2,4)  A banking switching service in South Africa moves Base24 into active/active.

Banks Use Synchronous Replication for Zero RPO  (5,2)  Triplexed data centers give fast recovery time with zero data loss

Casa Ley Upgrades to Active/Active OmniPayments  (8,1)  One of Mexico's largest grocery chains installs active/active financial transaction switch.

Cellular Provider Goes Active/Active for Prepaid Calls  (3,9)  NonStop active/active system keeps prepaid calls moving in Africa.

Commerzbank Survives 9/11 with OpenVMS Clusters  (4,7)  With an active/active backup 30 miles away, getting their people there did it.

Community College Learns From SAN Disaster  (2,2)  A disastrous SAN failure leads to dual redundancy.

CPA at Aqueduct, Belmont, and Saratoga  (2,1)  Race track wagering can never fail, or else riots start.

Do You Know Where Your Train Is?  (1,1)  A transit authority goes active/active for train tracking.

European Bank's Active/Active ATM Network  (4,6)   In this active/active system, ATM failover via DNS rerouting has its problems.

Faster Payments - Bringing Payment Processing Into the 21st Century  (5,6)  VocaLink uses active/active to provide 24x7 real-time payments.

Handelsbanken Turns to Parallel Sysplex  (4,10)  Sweden's Handelsbanken goes active/active to protect their online banking and ATM network.

How Does Google Do It?  (3,2)  Google processes tens of gigabytes of data in minutes on their massive clusters.

How Does Google Do It (part 2)  (7,11)  Google exposes how it distributes massive applications across thousands of servers while saving energy.

HP's Active/Active Home Location Register  (1,2)   The brains of a cellular network can never go down.

HP's OpenCall INS Goes Active/Active  (2,6)  Replication lets OpenCall INS run active/active with collision detection and resolution.

Major Bank Uses Active/Active to Avoid Hurricanes  (2,10)  Fast failover is used to switch users out of hurricane path.

Major ISP Migrates from Sybase to NonStop with No Downtime  (3,11)  Hundreds of millions of accounts migrated and verified.

Major U.S. Bank Replaces BASE24  (7,6)  Opsol's OmniPayments keeps bank's NonStop systems running after ACI's sunset announcement.

Payment Authorization - A Journey from DR to Active/Active   (2,12)  A start with DR leads this company to active/active and application integration.

QEI Provides Active/Active SCADA with OpenVMS  (2,9)  Electrical substation  monitoring that never goes down. 

Real-Time Fraud Detection  (4,12)  Credit-card switching service catches fraud on-the-fly - a great example of real-time business information.

Ring-of-Fire Bank Beats Earthquakes with Active/Active  (7,7)  Moving from tape backup to active/active can be a series of controlled steps.

Royal Bank of Canada Goes Active/Active for ATM/POS  (6,7)  Eliminates planned outages and reduces unplanned downtime from hours to minutes.

Tackling Switchover Times  (1,1)  If active/active is too big a step to take now, work on reducing your switchover times.

Telecom Italia's Active/Active Mobile Service  (2,3)  Italy's biggest cell phone network is supported by active/active.

Tour Operator Optimizes Look-to-Book Ratio  (6,8)  An asymmetric active/active system unloads query processing from the OLTP master node.

UK National Health Service - Blood and Transplant  (3,10)   An OpenVMS split-site cluster guarantees the availability the UK's blood supply.

U.S. Bank Critiques Active/Active  (4,5)   A NonStop active/active user shares experience and advice to those who would follow.

Wells Fargo's Pioneering Active/Active ATM Network  (5,9)  Dual networks with ATM collocation provides continuous availability.

  Never Again

A Personal Failover Fault  (8,3) - A failed PC forces a system recovery during a slide presentation. Failover failed. The BCP saved the day.

Active/Active Save #1 - Coffee Pot Takes Down Node  (1,2)  When the coffee pot was plugged in - Surprise!

Amazon's Cloud Downed by Fat Finger  (6,5)  A maintenance technician's error takes down an entire Availability Zone for four days.

Amazon Downed by Memory Leak  (7,11) A memory leak in an innocuous program cascades into  major systems, taking down the AWS cloud for hours.

American Eagle's Eight-Day Outage  (5,9)  Lack of recovery and failover testing takes down online sales for the $3 billion retailer for over a week.

Australia's Painful Banking Outages  (7,3)  Australia's four major banks suffer multiple outages as they upgrade their aging infrastructure.

BlackBerry Gets Juiced  (2,5)  Poor testing leads to no service for North American subscribers.

BlackBerry Messenger Down for Days  (6,10)  RIMs BBM texting service went down for over four days when it suffered a failover fault.

BlackBerry Takes Another Dive  (3,3)  Deja Vu. Poor testing once again leads to no North American service.

BlackBerry - OMG, it's Dιjΰ Vue  (5,1)  BlackBerry has now accumulated seven major outages in five years, providing an availability of three 9s.

Commonwealth Bank of Australia - A Correction  (7,4)  We make a correction to our article entitled, "Australia's Painful Banking Outages."

Console Command Takes Down Active/Active System (1,3)  Stop applications on one node, stop other node. Oops!

DDoS Attacks on U.S. Banks Continue  (8,2)   Islamic hactivists resume their attacks to get blasphemous video removed from the Internet,

Don't Wait for the Other Shoe to Drop  (2,2)  When a spare component fails, fix it fast. Don't tempt Murphy.

Fire Suppression Suppresses WestHost for Days  (5,5)  Never test a fire suppression system by triggering it.

First Stuxnet - Now the Flame Virus  (7,6)  Deemed more serious than Stuxnet, Flame takes over PCs and listens in on conversations.

Go Daddy Takes Down Millions of Web Sites  (7,9)   Any domain registered with Go Daddy was downed by a DNS network failure.

Google Troubles - A Case Study in Cloud Computing  (4,10)  Even the 900-pound Gorilla can have problems keeping its services up.

Haiti's Cell-Phone Network Costs Lives  (5,3)  Following Haiti's disastrous earthquake, many people couldn't call for help from beneath the rubble.

Has Gmail Become Gfail?  (4,3) Google's Gmail service has been down for hours six times over the last eight months.

History's Largest DDoS Attack?  (8,4)  Spam black-lister Spamhaus is taken down for days by a disgruntled spammer via a massive DDoS attack.

Hostway's Web Hosting Service Goes Down for Days  (2,9)  Small online stores offline for up to a week.

How Many 9s in Amazon?  (3,7)  Even giants fall. Amazon's S3 and EC2 services and online retail store go offline for hours.

Hubble Trouble  (4,1)  A failover fault when recovering from an instrument controller failure almost loses Hubble.

Hurricane Sandy  (7,12)  2012's Hurricane Sandy flooded lower Manhattan, taking out tens of thousands of web sites for weeks.

Innocuous Fault Leads to Weeks of Recovery  (3,12)  A simple disk mirror failure propagates into weeks of recovering lost data for a major bank.

IRS Goof Costs U.S. Taxpayers $300m +  (2,1)  Turning off the old system before testing the new one is dumb.

Islamic Hacktivists Attack U.S. Banks  (7,10)  Several banks taken down for a day in protest of YouTube video, "Innocence of Muslims."

JPMC Three-Day Outage Caused by Replication Corruption  (5,11)  Corruption of primary SAN by Oracle bug also takes down standby SAN.

Knight Capital Destroyed by Software Bug  (7,8)   A high-frequency trading bug costs Knight $440 million, forcing it to sell out to a consortium.

Lightning Downs Amazon - Not!  (6,9)  An Amazon European Availability Zone is taken down by hardware, software, and human faults.

London Stock Exchange PC-Trading System Down for a Day  (3,10)  Traders fume at commission loss on one of the most hectic trading days.

Medical Center's Multiday Outage  (6,2)  An attempt to achieve high availability on a limited budget leads to nonavailability.

Military GPS Disabled by Upgrade  (5,6)  A failed upgrade to support the next GPS generation takes down 10,000 military GPS receivers.

Mizuho Bank Down for Ten Days  (6,6)  A flood of earthquake donations by mobile phones overwhelmed the bank's evening batch runs.

More Never Agains  (3,8)   Over two dozen disastrous outages for the first half of 2008 are recounted.

More Never Agains II  (4,2)  System downtime problems have moved from the power lines to the networks.

More Never Agains III  (4,7)  Add the cloud to power and network problems creating over two dozens outages on which we report.

More Never Agains IV  (5,2)   Network, hardware/software problems highlight outages for the last half of 2009.

More Never Agains V  (5,7)   Over half of our 30 horror stories took down hosting providers. A failover plan is a must.

More Never Agains VI  (7,4)  Software bugs and recovery faults highlighted the outages in the first quarter of 2012.

More Never Agains VII  (7,9)  Power outages were the main cause of these failures.

More Never Agains VIII  (8,2)  Security threats are becoming more prevalent, with the Chinese evidently leading the charge.

National Australia Bank Customers Down for Days  (5,12)  A bad batch update disables critical customer services for two weeks.

Northern Virginia's 911 Service Down for Four Days  (7,12)  90 mph winds and air in the generator's fuel lines takes down a Verizon 911 hub.

On-Demand Software Utility Hits Availability Bump  (2,10)  A utility is expected to be always up, but this one didn't make it.

Oracle's Ticking Time Bomb  (7,2)  An obscure bug in the Oracle database could take down an entire data center if not patched immediately.

Orca - The Outage That May Change History  (7,11)  The Republican 2012 Get-Out-The-Vote system flopped from the beginning.

PayPal Fault Takes Merchants Offline  (4,9)  A network fault forces small online merchants to close shop for hours.

Poor Documentation Snags Google  (5,4)  A data center goes down, and failover to the backup data center goes awry.

Rackspace - Another Hosting Service Bites the Dust  (2,12)  A truck driver wipes out web sites for a day or more.

Royal Bank of Scotland Offline for Two Weeks  (7,7)  Falling back from a failed upgrade failed and took three U.K. banks down.

Sidekick: Your Data is in 'Danger'  (4,11)  A million smart-phone users lose all of their contacts, calendars, and photos.

Singapore Bank Downed by IBM Error  (5,8)  An undocumented new procedure takes down all DBS Bank systems for hours.

Skype Holiday Present - Down for a Day  (6,1)  Skype overload takes down its peer-to-peer network of hundreds of thousands of supernodes.

So You Think Your System is Robust?  (2,8)  So did these major enterprises, all of which went down in the first six months of 2007.

So You Think Your System is Reliable  (3,1)  Horror stories from the second half of 2007 focus on power and branch failures.

Software Bug Causes Train Wreck  (1,1)  A software bug, controller diversion, and engineer inattention combine to cause a train collision.

Sony PlayStation Taken Down for Weeks by Hackers  (6,5)  Hackers steal 100 million accounts from Sony, requiring weeks to repair security defenses.

Stuxnet - The World's First Cyberweapon  (6,3)  Stuxnet is the first worm to attack a control system and destroy machinery.

Sydney's M5 Tunnel Closed Again by Computer Glitch  (3,11)  Six times in six years is too much for New South Wales.

The Alaska Permanent Fund and the $38 Billion Keystroke  (2,4)  What do you do when your active and backup disks are wiped out and your tapes won't read?

The Case of the Flying Cable  (1,1)  A technician loses control of an under-floor cable and lets it hit a power strip.

The FAA's Availability Woes  (4,12)  Application and network failures plague air travelers. Where is NextGen - the next generation airspace system?

The Great 2003 Northeast Blackout and the $6 Billion Software Bug (2,3)  A hot day, an untrimmed tree, and a monitoring system bug cost power customers $6 billion.

The Planet Blows Up  (3,9)  A massive electrical explosion takes out thousands of hosting servers at a major dedicated hosting provider.

The State of Virginia - Down for Days  (5,10)  A maintenance error takes down 26 state agencies for up to a week.

Twitter Taken Down by DDoS Attack  (4,8)  The Twitter, Facebook, and LiveJournal social sites are taken down to silence a Georgian blogger.

Triple Redundancy Failure on the Space Station  (2,11)  A single point of failure takes down a triplexed critical computer.

Verizon 4G Network Down for Two Days  (6,6)  Verizon's "always reliable" 4G network brought down by software bug - no 3G backup.

VMware's Cloud Foundry Flounders  (6,7)  A storage fault caused by a power outage is followed by a bigger fault caused by a fat finger.

Vodafone Downed by Burglars  (6,4)  Thieves sledgehammer their way into a Vodafone exchange and steal computers and network equipment.

VoIP PBX Succumbs to Overconfiguration  (2,6)  Why extra processing power made this PBX less reliable.

What? No Internet?  (3,2)  A multiple cable break isolates North Africa, the Middle East, and India.

Why Back Up?  (4,4)   The malicious act of an IT manager deletes his company's database and forces the company to close its doors.

Will You Have Internet Access After July 9, 2012?  (7,5)  A recent FBI sting took down rogue DNS servers and substituted good servers until July 9th.

Windows Azure Cloud Succumbs to Leap Year  (7,3)  As the clock ticked to February 29th, the Azure Cloud went down for 32 hours.

  Best Practices

2010 NonStop Availability Award  (5,10)  This year's winner is Bank-Verlag with runners up Belgacom and VocaLink.

Achieving Fast Failover in Active/Active Systems - Part 1  (4,8) Using user and network redirection to failover in subseconds.

Achieving Fast Failover in Active/Active Systems - Part 2  (4,9)  Using server redirection to failover in subseconds.

Availability Best Practices  (2,1)  Tips from those who have achieved near-continuous availability.    

Avoiding Capacity Exhaustion  (7,7)   A strikingly simple graphic display forecasts capacity peaks by the hour over the year.

Avoiding Notworks  (4,1)  A network that doesn't work in a "notwork." Protect your network with a good SLA.

Backup Is More Than Backing Up  (4,5)   Backing up a database is an exercise in futility if you can't restore the database.

Can 10,000 Chickens Replace Your Tractor?  (1,3) Save money by replacing your mainframe with clusters - Not!

Can You Trust the Compute Cloud?  (3,8)  What will it take to make cloud computing the data utility of the future?

Chillerless Data Centers  (4,11)  Google and Yahoo! locate new data centers in the north country to take advantage of "free cooling."

Choosing a Business Continuity Solution - Part 1  (6,7)  What measures of availability are important to your organization?

Choosing a Business Continuity Solution - Part 2  (6,8)  Data replication is the fundamental force behind system availability.

Choosing a Business Continuity Solution - Part 3  (6,9)  Data replication leads to several highly available architectures.

Choosing a Business Continuity Solution - Part 4  (6,10)  Choosing a highly available architecture to meet your availability needs.

Continuous Availability Featured at HPTF 2009  (4,6)  Presentations include many continuous availability and high availability talks.

Cyber Threats Surpass Terrorism  (8.3)  The U.S. government says that in 2013, cyber threats surpassed terrorism as the top security concern.

Data Center Cooling Nature's Way  (5,5)  Data centers cut electric bills in half by replacing chillers with air economizers.

Data Center in a Box  (4,7)  Your next visit to a data center may be to the warehouse district.

Data Center Monitoring with Open-Source Nagios  (6,11)  Including NonStop systems in open-source "single pane of glass" monitoring.

Data Deduplication  (6,2)  Data deduplication can reduce backup storage and disaster-recovery bandwidth requirements by a factor of 20:1.

DDoS Attacks on the Rise  (8,4)  2012 saw a 53% rise in DDoS attacks with greatly increased malicious bandwidth.

Department of Homeland Security: Disable Java  (8,2)  A serious vulnerability in Java 7 means that it should be removed from browsers.

Document Your System  (1,2) Documentation is a necessary evil. Let's focus on the "necessary" and not the "evil."

Does Data Replication Eliminate the Need for Backups?  (5,11)  Data replication protects operations; data backup protects data.

DRJ's Fall World 2010 Business Continuity Conference  (5,8)  A week-long conference in September, 2010, focusing on Business Continuity.

DRJ's Spring World 2011 Business Continuity Conference  (6,2)  A week-long conference in March, 2011, focusing on Business Continuity.

DRJ's Fall World 2011 Business Continuity Conference  (6,8)  A week-long conference in September, 2011, focusing on Business Continuity.

DRJ's Spring World 2012 Business Continuity Conference  (7,1)  A week-long conference in March, 2012, focusing on Business Continuity.

FBI Warns Employees Are New Targets  (7,11)  The FBI warns that cybercriminals are moving from corporate IT systems to corporate employees.

Google's Extreme-Green Data Centers  (3,12)  Wave motion and seawater may power and cool data centers in the future.

Handling Data Collisions in Asynchronous Replication  (5,9)  An update on data collision avoidance, detection, and resolution.

High Availability Topics at HP Discover 2011  (6,5)  Over two dozen presentations on high-availability topics will be presented in Las Vegas in June, 2011.

How Do Your Readiness Plans Stack Up?  (6,1)  Compare your disaster recovery plans with those of 300 other companies.

HP CloudSystem  (7,2)  Companies can convert their current IT assets into a private cloud that can burst into public clouds.

HP Discover 2011  (6,3)  HP Discover 2011 is HP's major annual marketing and educational event, held in Las Vegas June 6th to June 10th, 2011.

HP's Project Odyssey - Migrating Mission Critical to x86  (7,3)  HP is moving HP-UX high availability features to Intel's Xeon x86 chip.

HP Blows Up Data Center  (2,8) An explosive demonstration of fast recovery.

HP's Cloud Recovery-as-a-Service  (7,6)  HP's cloud-based recovery service provides fast RTOs and short RPOs with no upfront capital expenditures.

Humanizing Three 9s  (2,9)  What if we lived in a world of three 9s?

Interview with Ron LaPedis on NonStop with XP Storage  (2,5)  How to improve NonStop reliability by using a SAN.

IPv6 Is Here - Like It or Not  (6,4)  Some tips from a father of the Internet on the simple ways to convert from IPv4 to IPv6.

Is Preventive Maintenance Preventive?  (7,10)  Major IT faults have been caused by preventive maintenance errors. Is PM worth it?

ISO 22301 - The New Business Continuity Management Standard  (7,10)  The first business continuity specification to be issued by ISO.

Katrina - The Harsh Teacher  (2,6)  The most powerful Gulf storm in 200 years showed us how unprepared we were for such a disaster.

Load Shedding  (7,12)  If your system begins to overload, how do you determine what load to shed?

Malware as a Service  (6,12)  Powerful hacking software is becoming just a click away.

Maximizing Availability in Everyday Systems  (5,7)   Even if you don't have a redundant system, there are things you can do to minimize outages.

Microrebooting for Fast Recovery (2,3)  An application of Recovery-Oriented Computing.

NonStop Boot Camp is Coming in October  (7,8)  The NonStop Community will gather in San Jose from October 14th through October 16th, 2012.

On Blogs and Discussion Groups  (2,10) Online forums can be a big boost to your professional growth.

OpenStack - The Open Cloud  (7,4)  A major open-source initiative may take us one step closer to a true worldwide compute utility.

OpenVMS Boot Camp Is Coming in March  (8,2)  It will be held for four days from March 18th through March 21 in Bedford, Massachusetts.

Recovery-Oriented Computing  (2,2)  If recovery time can be made small enough, users will perceive a faultless system.

Reliable Multicasting  (3,1)  How to get messages over LAN and WAN multicast networks without message loss.

Retail Web Sites Losing Millions to Poor Response Time  (7,1)  Slowness is worse than downtime - it makes people hate your site.

Roll-Your-Own Replication Engine - Part 1  (5,1)  What does it take to build your own replication engine? Lots!

Roll-Your-Own Replication Engines - Part 2  (5,2)  Issues with asynchronous and synchronous replication engines.

Rules of Availability - Part 1  (3,3)  The first set of common rules of availability from our books, Breaking the Availability Barrier.

Rules of Availability - Part 2  (3,5)  More common rules of availability from our books, Breaking the Availability Barrier.

Rules of Availability - Part 3  (3,7)  Concluding the common rules of availability from our books, Breaking the Availability Barrier.

Synchronous Replication Recovery Strategies  (5,3)  Bringing a failed database copy back on line under synchronous replication.

The 25 Most Exploitable Programming Errors  (8,2)  A detailed list of the programming errors that expose the most vulnerable security holes.

The Value of Availability  (6,6)  Downtime costs are based on the likelihood, duration, impact, and cost  of each risk factor taken individually.

Transaction-Oriented Computing  (2,4) Old art to some, new to others, transaction processing is the foundation for high availability.

Twitter Earthquake Detector  (5,4)  The U.S. Geological Service is mining tweets to get instant notification of earthquakes.

VRRP - Virtual Router Redundancy Protocol  (3,10)  Adding transparent failure detection and failover at the first hop.

What Really Caused the Windows Azure Outage?  (7,5)  The Windows Azure cloud was taken down by a simple leap-year bug.

With 100% Uptime, Do I Need a Business Continuity Plan?  (1,1)  You'd better believe it.

  Availability Topics

Active/Active Full Day Seminar at HPTF  (4,4) Dr. Bill speaks on active/active theory and practice at the 2009 HPTF conference.

Active/Active Versus Clusters  (2,5)  For high availability, clusters are mature; but active/active systems provide greater reliability.

Active/Active Systems - A Taxonomy  (3,9)  Classifying the many ways to build an active/active system.

Adding Availability to Performance Benchmarks  (2,9)  Recovery time is the proper metric to use for an availability benchmark.

Amazon's Availability Zones  (6,11)  Critical applications can run reliably in the cloud by distributing them across Amazon Availability Zones.

All About Continuous Processing Architectures  (1,1)  CPA can get you arbitrarily close to 100% uptime.

Anatomy of a DDoS Attack  (8,4)  DDoS attacks take down web sites by aiming traffic at various levels in the Internet protocol.

Anti-Virus - A Single Point of Failure?  (5,5)  McAfee's malicious anti-virus update takes down millions of computers in a flash.

Asynchronous Replication Engines  (1,2) These engines power most of today's active/active systems.

Availability versus Performance  (2,8) Is it time to trade higher availability for reduced performance?

Choosing a Database of Record  (3,11)  Which database copy in an active/active network is the "single version of truth?"

Collision Detection and Resolution  (2,4)  What do you do if you can't avoid collisions when using bidirectional replication?

Court Decides - HP 1, Oracle 0  (7,8)  Judge finds Oracle arguments a Seinfeld sitcom, orders continued Oracle support of HP Itanium servers.

Defining Active/Active  (4,12)  Can we agree on what are active/active architectures? Add your comments to this ongoing effort.

Defining Active/Active - Revision 1  (5,1)  Revision 1 of our definition based on suggestions posted to our LinkedIn Continuous Availability Forum.

Eavesdropping on the Internet  (4,3)  A vulnerability in the Border Gateway Protocol allows nefarious sites to read your Internet traffic.

Fault Tolerance for Virtual Environments - Part 1  (3,3)  How virtualization can significantly reduce data center capital and operating costs.

Fault Tolerance for Virtual Environments - Part 2  (3,4)  Operating system and bare metal hypervisors.  

Fault Tolerance for Virtual Environments - Part 3  (3,6)  Hardening virtual environments with failover and fault-tolerance.

Fire Suppressant's Impact on Hard Disks  (6,2)  Fire alarm sirens in the data center are fingered as the culprit in hard-disk damage.

FS-ISAC: Financial Services - Information Sharing & Analysis Center  (7,10)  A member-owned industry forum for sharing security threats.

Hardware Replication  (2,1)  Replicating at the hardware level does not maintain database consistency.

Help! My Data Center is Down! - Part 1: Power Outages  (6,10)  Unusual data center outages caused by power failures.

Help! My Data Center is Down! - Part 2: Storage Outages  (6,11)  Unusual data center outages caused by storage system failures.

Help! My Data Center is Down! - Part 3: Internet Outages  (6,12) Unusual data center outages caused by Internet failures.

Help! My Data Center is Down! - Part 4: Intranet Outages  (7,1) Unusual data center outages caused by intranet failures.

Help! My Data Center is Down! - Part 5: Upgrades  (7,2)  Unusual data center outages caused by upgrades gone wrong.

Help! My Data Center is Down! - Part 6: The Human Factor  (7,3)  Unusual data center outages caused by fat fingers.

Help! My Data Center is Down! - Part 7: Lessons Learned  (7,4) The lessons we can learn from the data center failures of Parts 1 to 6.

Hypoxic Fire-Prevention Systems  (6,1)  Why drown your servers after a fire breaks out? Keep the fire from starting in the first place.

Is the Cost of Converting to Active/Active Worth It?  (4,11)  Offsetting the cost of conversion with the cost of downtime.

It's Official! Leap Day Caused the Windows Azure Outage  (7,5)  Incrementing the year by one to get next year's date took down the Azure cloud.

Jim Gray - In Memoriam  (3,7)  The database pioneer that set the stage for active/active systems is lost at sea.

Let's Get an Availability Benchmark  (2,6)  Great performance is meaningless if the system in unavailable.

Leveraging Virtualization for Availability  (5,12)  With many eggs in one basket, system availability becomes all that more important.

Linux Leap-Second Bug Takes Down Data Centers  (7,8)   A leap second added on June 30, 2012, takes down unpatched Linux systems worldwide.

Media Communication During a Crisis  (6,5)  Don't create a second crisis by letting the press publish erroneous and damaging stories.

Migrating Your Application to Active/Active (2,3)  What must you do to prepare your application for an active/active environment?

NonStop Symposium and the OpenVMS Bootcamp  (5,7)  After being gone for a year, the exclusive NonStop and OpenVMS venues are back for 2010.

Recovery-as-a-Service  (8,2)  RaaS provides backup service in the cloud for critical applications.

Reducing Pharmaceutical Pollution  (8,1)  Monitoring of pharmaceutical processing practices to reduce pollution requires high availability computing.

Remembering Ken Olsen - An IT Icon  (6,3)  The founder of Digital Equipment Corp., Ken (1926 - 2011) brought interactive computing to the individual.

Social Media Availability and Performance  (6,4)  Social media is becoming critical in our daily lives. It is time for it to grow up.

Spamalytics  (4,10)  How good are our spam filters, and why does spam till pay?

Stratus Bets $50,000 That You Won't Be Down  (5,1)  Buy an ftServer by February 26, 2010, and Stratus will give you $50K if it fails in the first six months.

Stratus Puts $50,000 Where its Mouth Is - Again  (6,12)  Stratus' offer to pay you $50,000 if your ftServer/vSphere application fails expires 12/31/11.

Stratus Puts $50,000 Where its Mouth Is - an Update  (7,2)  Stratus extends its $50,000 ftServer/vSphere availability offer for another year to 12/31/12.

Synchronous Replication  (1,3)  Avoid data collisions and data loss following a node failure.

The Availability Matrix  (6,1)  Simplify your data center availability configurations using the independence of RTO and RPO.

The Causes of Outages  (8,3)  250 Never Again stories tell us the proportion of outages due to hardware, software, humans, networks, and other faults.

The Fragile Cloud  (4,6)   This new computing paradigm might ultimately replace corporate data centers if it can ever be made reliable.

The Fragile Internet  (4,5)   Can you trust your mission-critical applications solely to the Internet? We think not.

The History of Fault Tolerance  (1,2) The fault-tolerant marketplace was hot in 1984.

The Malware Threat to Android  (7,9)   With its major market share and unvetted apps, Android is the prime smart phone target for hackers.

The IPv4 Doomsday  (4,8)   The Internet Protocol Version 4 is about to run out of its four billion addresses in two years. What now?

The Ubiquitous Internet  (4,7)  1.5 billion users, 200 million web sites, and one-million viruses depend on the Internet

Time Synchronization for Distributed Systems - Part 1  (2,11)  How does NTP calculate the time offset from a time server?

Time Synchronization for Distributed Systems - Part 2  (2,12)  How NTP minimizes time offset errors?

Time Synchronization for Distributed Systems - Part 3  (3,2)  Logical clocks offer an option for synchronizing systems.

Transaction Replication  (2,2)  A simple approach to active/active systems has scalability issues.

Tussling with the Word "Redundant"  (3,12)  "Redundant" doesn't always translate the same to those in different countries.

Unintended Acceleration and EMI  (5,4)  If testing doesn't show it, does that prove that EMI can't make an engine computer misbehave?

Using an Availability Benchmark  (3,10)  Making use of a recovery time benchmark to influence your system choice.

WestHost Fire Suppression Test Fiasco - An Update  (5,9)  Why did the accidental activation of the fire suppression system destroy so many disks?

What is Active/Active?  (1,1) Active/active architectures can give subsecond recovery following a failure. 

What is Reliability?  (5,6)  We can get rid of marketeering by quantifying highly-reliable computer systems by their reliability parameters.

What is the Availability Barrier?  (5,3)  Mean time to recover. Let it fail but fix it fast.

What's Your Concern - MTR or MTBF?  (5,11)  Recovery time is for users, failure intervals are for system operators, availability is for management.

Worsing on Worsening  (4,2)  A 1967 chewing-out of IBM's Field Service staff resounds still today.

  Recommended Reading

Aberdeen's 2008 Business Continuity Survey  (3,4)  A look at 150 small to large companies and their BC/DR plans and processes.

Archive Storage - Disk or Tape?  (5,11)  Disk provides fast recovery from backups, and tape provides economical long-term archiving.

Beyond Redundancy  (7,5)  How geographic redundancy can improve service availability and reliability of computer-based systems.

Big Switch: Rewiring the World from Edison to Google  (6,9)  The cloud compute utility is following in the tracks of the electric utility.

Blueprints for High Availability: Designing Resilient Distributed Systems  (2,5)  All you ever wanted to know about clusters.

Breaking the Availability Barrier  (3,5)   Everything you ever wanted to know about active/active systems - theory, implementation, and practice.

Business Continuity from A to Z  (5,12)  The online book explores the responsibilities of the stakeholders in the business continuity plan.

Business Continuity Planning: IT Examination Handbook  (1,1)  What better way to learn about BCP than from the auditor's handbook.

Business Continuity Today  (4,3)  This freely-available living eBook covers a broad range business availability topics.

Continuous Availability Systems Design Guide  (2,1)  What to do if you want to move to CPA.

Distributed Systems: Principles and Paradigms  (3,1)  A thorough treatment of requirements for distributed system transparency.

Fire in the Computer Room , What Now?  (2,6)  Are you prepared for a total loss of your data center because of a fire or other disaster?

High Availability Network Fundamentals  (4,4)   A practical guide to predicting network availability (especially for the mathematically challenged).

Megaplex: An Odyssey of Innovation  (4,12)  Tandem is 35 years old. The Standish Group looks back on 35 years of availability innovation.

Megaplex Modeling: The Future of NonStop Demand  (5,10)  Standish Group envisions critical and non-critical applications sharing the same blades.

Migrating Legacy Systems: Gateways, Interfaces, & the Incremental Approach (2,3)  Legacy systems must be decomposed to migrate to active/active.

Mission-Critical Network Planning  (4,9)  A broad review of redundancy in servers, networks, storage, data centers, and power.

Multiple Processor Systems for Real-Time Applications  (2,10) A classic treatise on distributed systems that is still pertinent two decades later.

Pandemic Response Planning  (4,10)  How will your company continue operations if the Swine Flu hits with a vengeance?

Roadmap to the Megaplex  (5,7)   The six steps that will modernize your vertical NonStop applications for the open world of horizontal services.

Tandem Computers Unplugged: A People's History  (7,7) Tandem from 1975 till 1997 as seen through the eyes of its employees.

TCP/IP Illustrated, Volume 1: The Protocols  (4,11)  The "bible" of the TCP/IP Protocol Suite, the glue that binds active/active systems.

The Business and Economics of Linux and Open Source  (1,3)  Open source demystified for the reluctant manager.

The Disaster Recovery Journal  (5,2)  The resource for business continuity professionals.

The Unified Modeling Language User Guide  (1,2) UML is now the accepted standard for fast and easy documentation of systems and procedures.

Towards Zero Downtime: High Availability  Blueprints  (2,8) A close look at installing Microsoft clusters and cluster-aware applications.

Transaction Processing: Concepts and Techniques  (2,4) The classic book on transaction processing systems, by Jim Gray and Andreas Reuter.

Unix Backup and Recovery  (2,2)  Backing up is a pain, but it is the restore that counts.

  Product Reviews

Attunity Integration Suite  (5,12)  Data access, data federation, and data movement combine to make data and services available across the enterprise.

Critical Date Testing - Leap Day and More  (7,5)  Many products exist to test applications for proper processing of critical dates.

EMC's SRDF Data-Replication Engine  (6,4)  Maintain a consistent asynchronous or synchronous target copy of a database with no server involvement.

FalconStor RecoverTrac - Automated Disaster Recovery  (7,6)  Build your own recovery cloud supporting heterogeneous environments.

Fault-Tolerant Windows and Linux from Stratus  (2,9)  ftServers provide transparent fault-tolerant operation.

FileSync and CSR Synchronize NonStop Systems: Part 1 - FileSync  (6,10)  FileSync replicates changed files or file changes between systems.

FileSync and CSR Synchronize NonStop Systems: Part 2 - CSR  (6,11)  Command Stream Replicator repeats operator actions on remote systems.

Flexible Availability Options with GoldenGate's TDM  (2,2)  Implement a variety of data-sharing topologies with TDM's data replication facilities.

GRIDSCALE - A Virtualized Distributed Database  (3,7)  Like presentation and application servers, pooled database servers for the three-tier architecture.

How Much Will Active/Active Cost Me?  (1,1)  The cost of downtime can swing your decision.

HP's NonStop Blades  (3,8)  NonStop fault-tolerant fundamentals come to HP's c-Class blades.

HP's NonStop Synchronous Gateway  (4,6)   Finally, NonStop synchronous data replication might be on its way.

HP's Reliable Transaction Router  (5,5)  Reliable transaction messaging services between Windows, Linux, OpenVMS, and HP-UX systems.

HP's ServiceGuard Clustering Facility  (2,5)  Managing HP-UX and Linux clusters.

Master/Slave Replication with Continuent's Tungsten  (4,5)  Asynchronous replication between MySQL and Oracle.

MySQL Clusters Go Active/Active  (1,3)  Clusters of storage nodes are kept in sync by synchronous replication.

Nagios Open-Source Monitoring for HP NonStop  (8,3)  Manage your NonStop systems along with your Windows, Linux, and Unix systems with Nagios.

Neverfail for Windows Applications  (5,6)  Automated failover is provided for popular Windows applications like Exchange, SharePoint, SQL Server, IIS.

NonStop AutoSYNC - Eliminating Configuration Drift  (6,8)  Backup system configurations must be kept synchronized with their production systems.

OpenVMS Active/Active Split-Site Clusters  (3,6)  OpenVMS Clusters provide active/active operation with synchronous replication.

Oracle Data Replication  (6,9)  Data Guard, Streams, or GoldenGate - Which replication engine should be used when?

Parallel Sysplex - Fault Tolerance from IBM  (3,4) IBM's Parallel Sysplex offers offers localized active/active availability.

Penguin Computing Offers Beowulf Clustering on Linux  (2,1)  NASA's Beowulf clustering is available on Linux with Penguin's HPC servers.

Prolexic - A DDoS Mitigation Services Provider  (8,4)  Prolexic protects companies from DDoS attacks via a network of scrubbing centers.

Raima's High-Availability Embedded Database  (6,12)  A microprocessor embedded database with SQL capabilities offering five 9s availability.

Replicating Windows and Linux Environments with Double-Take  (4,8)  Replicate entire servers with incremental file-system updates.

Scaling MySQL with Continuent's uni/cluster  (3,11)  Synchronous replication of update queries and distribution of read queries.

SchoonerSQL Brings Five 9s to MySQL  (7,1)  Significant extensions to MySQL to improve its availability and replication performance.

Shadowbase - The Active/Active Solution  (2,3)  Shadowbase provides fast data replication as well as online copy and database resynchronization.

Shore Micro's 100-Microsecond Link Failover  (6,3)  Field-Programmable Gate Arrays protect redundant Ethernet links with 100 usec. failover.

solidDB - a Five 9s Memory-Resident Database  (3,5)   Server memory is getting so large, why not keep your database in high speed memory?

Stratus Avance Brings Availability to the Edge  (4,2)  If downtime in a branch office costs as little as $1,000 per hour, Avance can pay for itself in a year.

Stratus' ftServer Flexes Its Recovery Muscle  (5,8)  Independent testing measures scalability and demonstrates no impact due to catastrophic failure.

Time Synchronization for NonStop Servers  (0211)  NTP products  from Bowden Systems and HP for NonStop servers.

Virtual Tape - Getting Rid of a Troublesome Medium  (1,2) The backup paradigm is changing.  Goodbye, tape.

Virtual Tape for NonStop Servers with ETI-NET's EZX-BackBox  (2,6)  Virtual tape made super-fast with deduplication.

Virtual Transactions with NonStop AutoTMF  (2,4) Converting nontransactional applications to transactional applications.

Virtualized Time from TANDsoft  (4,1)  The OPTA2000 Time Simulator lets  multiple applications run on the same NonStop system with different clocks.

Windows Server Failover Clustering  (5,4)  Microsoft's successor to MSCS adds simplified cluster management and improved geographical dispersion.

  The Geek Corner

Calculating Availability - Redundant Systems  (1,1)  Some useful rules come out of the derivation of the availability equation.

Calculating Availability - Repair Strategies  (1,2) Your repair policy can have a significant impact on your system availability.

Calculating Availability - The Three Rs  (1,3)  Node repair, node recovery, and system restore are all required.

Calculating Availability - Hardware/Software Faults  (2,1)  Most faults don't need a repair.

Calculating Availability - Failover  (2,2)  When a system is failing over, it is often effectively down, thus reducing availability.

Calculating Availability - Failover Faults  (2,3)  Failovers can fail also.

Calculating Availability - Environmental Faults  (2,4) How to handle hurricanes, power failures, and riots when calculating availability.

Calculating Availability - Cluster Availability  (2,5)  How does the availability of a cluster compare to that of an active/active system?

Calculating Availability - Nodes, Subsystems, and Systems  (2,6)  When is a node a system, and when is it a subsystem?

Calculating Availability - Failure State Diagrams  (2,9)  Formalizing our intuitive derivations.

Calculating Availability - Heterogeneous Systems - Part 1  (3,3)  Probability 101 in preparation for analyzing systems with heterogeneous nodes.

Calculating Availability - Heterogeneous Systems - Part 2  (3,5)  The availability of redundant systems with different nodal availabilities.

Calculating Availability - Heterogeneous Systems - Part 3  (3,6)  Analyzing complex configurations of system components.

Calculating Availability - Heterogeneous Systems - Part 4  (3,8)  Demonstrating that systems with century uptimes can be configured.

Calculating RPO  (5,3)  An RPO is the probability that data loss following a node failure is less than a specified amount. How can we verify this?

Configuring to Meet a Performance SLA - Part 1  (3,12)  What size server is needed to provide a response time of 200 msec. 98% of the time?

Configuring to Meet a Performance SLA - Part 2  4,1)   Comparing the performance of single-server systems to multiserver systems.

Configuring to Meet a Performance SLA - Part 3  (4,2)  Answering the SLA specification for servers with exponential service times.

Configuring to Meet a Performance SLA - Part 4  (4,3)  Answering the SLA specification for servers with arbitrary service times.

Configuring to Meet a Performance SLA - Part 5  (4,4)  Answering the SLA specification for multiple servers in tandem.

Estimating Data Collision Rates  (2,8) Can you go active/active with a tolerable level of data collisions?

Failure State Diagrams - Repair Strategies  (2,10) The real story behind sequential repair and parallel repair.

Failure State Diagrams - Recovery Following Repair  (2,12)  The formal analysis of the impact of having to recover a node after its repair.

Failure State Diagrams - Hardware/Software Faults Revisited  (3,2)  Our intuitive results were a little simplistic.

Is Parallel Repair Really Better Than Sequential Repair?  (3,4) A Digest reader points out that that depends upon the repair time distribution.

Reliability Diagrams  (6,7) Complex systems can be analyzed via reliability diagrams as sets of parallel (redundant) and serial components.

SAP on VMware High Availability Analysis  (7,12)  Our availability analysis is used to predict the availability of  VMware ESXi clusters.

Simplifying Failover Analysis - Part 1  (5,10)  User's aren't down just because two nodes fail. They are also down waiting for a backup system to take over.

Simplifying Failover Analysis - Part 2  (6,6)  Extending failover analysis to complex multinode systems.

The Cost of RPO and RTO  (7,9)  What is the optimum architecture for minimizing the costs of downtime and lost data?

What's That Nerd Logo?  (1,1)  Our logo, ff2, really has a meaning. Find out why it describes active/active architectures.

Why Are Active/Active Systems So Reliable?  (3,9)  Analyzing the impact of resubmitting transactions rather than bringing up a backup system.

 

 

Contact us        © 2010 Sombers Associates, Inc., and W. H. Highleyman