|
Case Studies
Active/Active Payment Processing at
Swedbank (3,1) Swedbank uses
active/active Base24 to support credit cards and POS terminals.
Apollo 11 -
Continuous Availability, 1960s Style
(4,9) NASA's safety-critical computer systems put men on
the moon four decades ago.
Asymmetric
Active/Active at Banco de Credito
(2,11) Using an symmetric configuration saves programming
changes.
Bank-Verlag - the
Active/Active Pioneer (1,3)
Bank-Verlag went active/active two decades ago with IBM/Tandem.
BANKSERV Goes Active/Active
(2,4) A banking switching service in South
Africa moves Base24 into active/active.
Banks Use
Synchronous Replication for Zero RPO
(5,2) Triplexed data centers give fast recovery time with
zero data loss
Cellular Provider
Goes Active/Active for Prepaid Calls
(3,9) NonStop active/active system keeps prepaid calls
moving in Africa.
Commerzbank
Survives 9/11 with OpenVMS Clusters
(4,7) With an active/active backup 30 miles away, getting
their people there did it.
Community College
Learns From SAN Disaster (2,2) A
disastrous SAN failure leads to dual redundancy.
CPA
at Aqueduct, Belmont, and Saratoga
(2,1) Race
track wagering can never fail, or else riots start.
Do
You Know Where Your Train Is?
(1,1) A transit authority goes active/active for train
tracking.
European Bank's
Active/Active ATM Network (4,6)
In this active/active system, ATM failover via DNS rerouting has
its problems.
Handelsbanken
Turns to Parallel Sysplex (4,10)
Sweden's Handelsbanken goes active/active to protect their
online banking and ATM network.
How Does Google Do It? (3,2)
Google processes tens of gigabytes of data in minutes on their
massive clusters.
HP's Active/Active Home Location Register
(1,2) The brains of a cellular network can never go down.
HP's
OpenCall INS Goes Active/Active (2,6)
Replication lets OpenCall INS run active/active with collision
detection and resolution.
Major Bank
Uses Active/Active to Avoid Hurricanes
(2,10) Fast failover is used to switch users out of
hurricane path.
Major ISP Migrates
from Sybase to NonStop with No Downtime
(3,11) Hundreds of millions of accounts migrated and
verified.
Payment
Authorization - A Journey from DR to Active/Active
(2,12) A start with DR leads this company to active/active
and application integration.
QEI
Provides Active/Active SCADA with OpenVMS
(2,9) Electrical substation monitoring that never
goes down.
Real-Time
Fraud Detection (4,12)
Credit-card switching service catches fraud on-the-fly - a great
example of real-time business information.
Tackling
Switchover Times
(1,1) If active/active is too big a step to take now, work
on reducing your switchover times.
Telecom
Italia's Active/Active Mobile Service
(2,3) Italy's biggest cell phone network is supported by
active/active.
UK National Health Service - Blood and Transplant
(3,10) An OpenVMS split-site cluster guarantees the
availability the UK's blood supply.
U.S. Bank Critiques Active/Active (4,5)
A NonStop active/active user shares experience and advice to
those who would follow.
Never Again
Active/Active Save #1 - Coffee Pot Takes Down Node
(1,2) When the coffee pot was plugged in -
Surprise!
BlackBerry Gets Juiced (2,5) Poor
testing leads to no service for North American subscribers.
BlackBerry Takes
Another Dive (3,3) Deja Vu. Poor
testing once again leads to no North American service.
BlackBerry - OMG, it's Déjà
Vue
(5,1) BlackBerry has now accumulated seven
major outages in five years, providing an availability of three
9s.
Console Command Takes Down Active/Active System
(1,3) Stop applications on one node, stop
other node. Oops!
Don't
Wait for the Other Shoe to Drop (2,2)
When a spare component fails, fix it fast. Don't tempt Murphy.
Google
Troubles - A Case Study in Cloud Computing
(4,10) Even the 900-pound Gorilla can have problems
keeping its services up.
Has Gmail Become Gfail? (4,3) Google's
Gmail service has been down for hours six times over the last
eight months.
Hostway's Web Hosting
Service Goes Down for Days (2,9)
Small online stores offline for up to a week.
How Many 9s in Amazon? (3,7) Even
giants fall. Amazon's S3 and EC2 services and online retail
store go offline for hours.
Hubble Trouble (4,1) A failover
fault when recovering from an instrument controller failure
almost loses Hubble.
Innocuous Fault Leads to Weeks of Recovery
(3,12) A simple disk mirror failure
propagates into weeks of recovering lost data for a major bank.
IRS Goof Costs U.S.
Taxpayers $300m + (2,1) Turning
off the old system before testing the new one is dumb.
London
Stock Exchange PC-Trading System Down for a Day
(3,10) Traders fume at commission loss on one of the most
hectic trading days.
More Never Agains
(3,8) Over two dozen disastrous outages for the
first half of 2008 are recounted.
More
Never Agains II (4,2) System
downtime problems have moved from the power lines to the
networks.
More
Never Agains III (4,7) Add the
cloud to power and network problems creating over two dozens
outages on which we report.
More
Never Agains IV (5,2)
Network, hardware/software problems highlight outages for the
last half of 2009.
On-Demand Software
Utility Hits Availability Bump (2,10) A
utility is expected to be always up, but this one didn't make
it.
PayPal Services Downgrade with Upgrade
(3,6) Attempting an upgrade with no fallback plan takes
PayPal services down for weeks.
PayPal Fault Takes Merchants Offline (4,9)
A network fault forces small online merchants to close shop for
hours.
Rackspace - Another
Hosting Service Bites the Dust (2,12)
A truck driver wipes out web sites for a day or more.
Sidekick: Your Data
is in 'Danger' (4,11) A million
smart-phone users lose all of their contacts, calendars, and
photos.
So You Think Your System is Robust?
(2,8) So did these major enterprises, all of which went
down in the first six months of 2007.
So You Think Your System is Reliable
(3,1) Horror stories from the second half of 2007 focus on
power and branch failures.
Software
Bug Causes Train Wreck (1,1)
A software bug, controller diversion, and engineer inattention
combine to cause a train collision.
Sydney's M5 Tunnel
Closed Again by Computer Glitch (3,11)
Six times in six years is too much for New South Wales.
The Alaska Permanent Fund and the $38 Billion Keystroke
(2,4) What do you do when your active and backup disks are wiped
out and your tapes won't read?
The
Case of the Flying Cable
(1,1) A technician loses control of an under-floor cable
and lets it hit a power strip.
The FAA's
Availability Woes (4,12)
Application and network failures plague air travelers. Where is
NextGen - the next generation airspace system?
The
Great 2003 Northeast Blackout and the $6 Billion Software Bug
(2,3) A hot day, an untrimmed tree, and a monitoring
system bug cost power customers $6 billion.
The Planet
Blows Up (3,9) A massive
electrical explosion takes out thousands of hosting servers at a
major dedicated hosting provider.
Twitter Taken Down by
DDoS Attack (4,8) The Twitter,
Facebook, and LiveJournal social sites are taken down to silence
a Georgian blogger.
Triple
Redundancy Failure on the Space Station
(0211) A single point of failure takes down a triplexed
critical computer.
VoIP
PBX Succumbs to Overconfiguration (2,6)
Why extra processing power made this PBX less reliable.
What? No Internet?
(3,2) A multiple cable break isolates North Africa, the
Middle East, and India.
Why Back Up?
(4,4) The malicious act of an IT
manager deletes his company's database and forces the company to
close its doors.
Best Practices
Achieving
Fast Failover in Active/Active Systems - Part 1
(4,8) Using user and network redirection to
failover in subseconds.
Acheiving
Fast Failover in Active/Active Systems - Part 2
(4,9) Using server redirection to failover
in subseconds.
Availability
Best Practices (2,1) Tips from
those who have achieved near-continuous availability.
Avoiding
Notworks (4,1) A network that
doesn't work in a "notwork." Protect your network with a good
SLA.
Backup Is More Than Backing Up (4,5)
Backing up a database is an exercise in futility if you can't
restore the database.
Can 10,000 Chickens Replace Your Tractor?
(1,3) Save money by replacing your mainframe with
clusters - Not!
Can You Trust the Compute Cloud?
(3,8) What will it take to make cloud
computing the data utility of the future?
Chillerless Data
Centers (4,11) Google and Yahoo!
locate new data centers in the north country to take advantage
of "free cooling."
Continuous Availability Featured at HPTF 2009
(4,6) Presentations include many
continuous availability and high availability talks.
Data Center in a Box (4,7) Your
next visit to a data center may be to the warehouse district.
Document Your
System (1,2) Documentation is a
necessary evil. Let's focus on the "necessary" and not the
"evil."
Google's
Extreme-Green Data Centers (3,12)
Wave motion and seawater may power and cool data centers in the
future.
HP
Blows Up Data Center (2,8) An explosive
demonstration of fast recovery.
Humanizing Three 9s (2,9) What if
we lived in a world of three 9s?
Interview
with Ron LaPedis on NonStop with XP Storage
(2,5) How to improve NonStop reliability by using a SAN.
Katrina -
The Harsh Teacher (2,6) The most
powerful Gulf storm in 200 years showed us how unprepared we
were for such a disaster.
On Blogs and Discussion Groups (2,10)
Online forums can be a big boost to your professional growth.
Reliable Multicasting
(3,1) How to get messages over LAN and WAN multicast
networks without message loss.
Recovery-Oriented Computing (2,2)
If recovery time can be made small enough, users will perceive a
faultless system.
Microrebooting for Fast Recovery
(2,3) An application of Recovery-Oriented Computing.
Roll-Your-Own Replication Engine - Part 1
(5,1) What does it take to build your own
replication engine? Lots!
Roll-Your-Own Replication Engines - Part 2
(5,2) Issues with asynchronous and
synchronous replication engines.
Rules
of Availability - Part 1 (3,3)
The first set of common rules of availability from our books,
Breaking the Availability Barrier.
Rules
of Availability - Part 2 (3,5)
More common rules of
availability from our books,
Breaking the Availability Barrier.
Rules
of Availability - Part 3 (3,7)
Concluding the common rules of availability from our books,
Breaking the Availability Barrier.
Transaction-Oriented Computing (2,4)
Old art to some, new to others, transaction processing is the
foundation for high availability.
VRRP - Virtual Router Redundancy Protocol
(3,10) Adding transparent failure detection and failover
at the first hop.
With
100% Uptime, Do I Need a Business Continuity Plan?
(1,1) You'd better believe it.
Availability Topics
Active/Active Full Day Seminar at HPTF
(4,4) Dr. Bill speaks on active/active theory and practice at
the 2009 HPTF conference.
Active/Active Versus
Clusters (2,5) For high
availability, clusters are mature; but active/active systems
provide greater reliability.
Active/Active
Systems - A Taxonomy (3,9)
Classifying the many ways to build an active/active system.
Adding
Availability to Performance Benchmarks
(2,9) Recovery time is the proper metric to use for an
availability benchmark.
All
About Continuous Processing Architectures
(1,1) CPA can get you
arbitrarily close to 100% uptime.
Asynchronous
Replication Engines (1,2) These engines
power most of today's active/active systems.
Availability versus Performance (2,8)
Is it time to trade higher availability for reduced performance?
Choosing a
Database of Record (3,11) Which
database copy in an active/active network is the "single version
of truth?"
Collision Detection and Resolution
(2,4) What do you do if you can't avoid collisions when
using bidirectional replication?
Defining Active/Active (4,12) Can
we agree on what are active/active architectures? Add your
comments to this ongoing effort.
Defining Active/Active - Revision 1
(5,1) Revision 1 of our definition based on suggestions
posted to our LinkedIn Continuous Availability Forum.
Eavesdropping on the Internet (4,3)
A vulnerability in the Border Gateway Protocol allows nefarious
sites to read your Internet traffic.
Fault
Tolerance for Virtual Environments - Part 1
(3,3) How virtualization can significantly reduce
data center capital and operating costs.
Fault
Tolerance for Virtual Environments - Part 2
(3,4) Operating system and bare metal hypervisors.
Fault
Tolerance for Virtual Environments - Part 3
(3,6) Hardening virtual environments
with failover and fault-tolerance.
Hardware
Replication (2,1) Replicating at
the hardware level does not maintain database consistency.
Is the Cost of
Converting to Active/Active Worth It?
(4,11) Offsetting the cost of conversion with the cost of
downtime.
Jim Gray - In
Memoriam (3,7) The database
pioneer that set the stage for active/active systems is lost at
sea.
Let's
Get an Availability Benchmark (2,6)
Great performance is meaningless if the system in unavailable.
Migrating
Your Application to Active/Active
(2,3) What must you do to prepare your application for an
active/active environment?
Spamalytics
(4,10) How good are our spam filters, and why does spam
till pay?
Synchronous Replication (1,3)
Avoid data collisions and data loss following a node failure.
The Fragile
Cloud (4,6) This new
computing paradigm might ultimately replace corporate data
centers if it can ever be made reliable.
The Fragile
Internet (4,5) Can you
trust your mission-critical applications solely to the Internet?
We think not.
The
History of Fault Tolerance (1,2) The
fault-tolerant marketplace was hot in 1984.
The IPv4
Doomsday (4,8) The Internet
Protocol Version 4 is about to run out of its four billion
addresses in two years. What now?
The
Ubiquitous Internet (4,7) 1.5
billion users, 200 million web sites, and one-million viruses
depend on the Internet
Time
Synchronization for Distributed Systems - Part 1
(2,11) How does NTP calculate the time offset from a time
server?
Time
Synchronization for Distributed Systems - Part 2
(2,12) How NTP minimizes time offset
errors?
Time Synchronization for Distributed
Systems - Part 3 (3,2) Logical clocks
offer an option for synchronizing systems.
Transaction
Replication (2,2) A simple
approach to active/active systems
has scalability issues.
Tussling with the
Word "Redundant" (3,12)
"Redundant" doesn't always translate the same to those in
different countries.
Using an
Availability Benchmark (3,10)
Making use of a recovery time benchmark to influence your system
choice.
What
is Active/Active?
(1,1) Active/active architectures
can give subsecond recovery following a failure.
Worsing on Worsening
(4,2) A 1967 chewing-out of IBM's Field
Service staff resounds still today.
Recommended Reading
Aberdeen's 2008
Business Continuity Survey (3,4)
A look at 150 small to large companies and their BC/DR plans and
processes.
Blueprints for High
Availability: Designing Resilient Distributed Systems
(2,5) All you ever wanted to know about
clusters.
Breaking the Availability Barrier
(3,5) Everything you ever wanted to know about
active/active systems - theory, implementation, and practice.
Business
Continuity Planning: IT Examination Handbook
(1,1)
What better way to learn about BCP than from the auditor's
handbook.
Business Continuity Today (4,3)
This freely-available living eBook covers a broad range business
availability topics.
Continuous Availability Systems Design Guide
(2,1) What to do if you want to move to CPA.
Distributed Systems: Principles and
Paradigms (3,1) A thorough treatment
of requirements for distributed system transparency.
Fire in
the Computer Room , What Now? (2,6)
Are you prepared for a total loss of your data center because of
a fire or other disaster?
High Availability
Network Fundamentals (4,4)
A practical guide to predicting network availability (especially
for the mathematically challenged).
Megaplex: An Odyssey
of Innovation (4,12) Tandem is
35 years old. The Standish Group looks back on 35 years of
availability innvation.
Migrating Legacy
Systems: Gateways, Interfaces, & the Incremental Approach
(2,3) Legacy
systems must be decomposed to migrate to active/active.
Mission-Critical Network Planning (4,9)
A broad review of redundancy in servers, networks, storage, data
centers, and power.
Multiple Processor Systems for Real-Time Applications
(2,10) A classic treatise on distributed systems
that is still pertinent two decades later.
Pandemic
Response Planning (4,10) How will
your company continue operations if the Swine Flu hits with a vengeance?
TCP/IP Illustrated, Volume 1: The Protocols
(4,11) The "bible" of the TCP/IP Protocol
Suite, the glue that binds active/active systems.
The
Business and Economics of Linux and Open Source
(1,3) Open source demystified for the
reluctant manager.
The
Disaster Recovery Journal (5,2)
The resource for business continuity professionals.
The Unified Modeling Language User Guide
(1,2) UML is now the accepted standard for fast
and easy documentation of systems and procedures.
Towards
Zero Downtime: High Availability Blueprints
(2,8) A close look at installing Microsoft clusters and
cluster-aware applications.
Transaction Processing: Concepts and Techniques
(2,4) The classic book on transaction processing
systems, by Jim Gray and Andreas Reuter.
Unix Backup and
Recovery (2,2) Backing up is a
pain, but it is the restore that counts.
Product
Reviews
Fault-Tolerant
Windows and Linux from Stratus (2,9)
ftServers provide transparent fault-tolerant operation.
Flexible
Availability Options with GoldenGate's TDM
(2,2) Implement a variety of data-sharing
topologies with TDM's data replication facilities.
GRIDSCALE - A Virtualized Distributed Database
(3,7) Like presentation and application servers, pooled
database servers for the three-tier architecture.
How
Much Will Active/Active Cost Me?
(1,1) The cost of downtime can swing your
decision.
HP's NonStop Blades
(3,8) NonStop fault-tolerant fundamentals come to HP's
c-Class blades.
HP's NonStop Synchronous Gateway (4,6)
Finally, NonStop synchronous data replication might be on its
way.
HP's
ServiceGuard Clustering Facility (2,5)
Managing HP-UX and Linux clusters.
Master/Slave Replication with Continuent's Tungsten
(4,5) Asynchronous replication between MySQL and Oracle.
MySQL Clusters Go Active/Active (1,3)
Clusters of storage nodes are kept in sync by synchronous
replication.
OpenVMS Active/Active
Split-Site Clusters (3,6) OpenVMS
Clusters provide active/active operation with synchronous
replication.
Parallel Sysplex
- Fault Tolerance from IBM (3,4) IBM's
Parallel Sysplex offers offers localized active/active
availability.
Penguin
Computing Offers Beowulf
Clustering on Linux (2,1) Beowulf
clustering, developed by NASA, is available on Linux along with
Penguin's HPC servers.
Replicating
Windows and Linux Environments with Double-Take
(4,8) Replicate entire servers with
incremental file-system updates.
Scaling MySQL with
Continuent's uni/cluster (3,11)
Synchronous replication of update queries and distribution of
read queries.
Shadowbase - The
Active/Active Solution
(2,3) Shadowbase provides fast data replication as well as
online copy and database resynchronization.
solidDB - a Five 9s Memory-Resident
Database (3,5) Server
memory is getting so large, why not keep your database in high
speed memory?
Stratus Avance Brings Availability to the Edge
(4,2) If downtime in a branch office costs
as little as $1,000 per hour, Avance can pay for itself in a
year.
Stratus
Bets $50,000 That You Won't Be Down
(5,1) Buy an ftServer by February 26, 2010, and Stratus
will give you $50K if it fails in the first six months.
Time
Synchronization for NonStop Servers
(0211) NTP products from Bowden Systems and HP for
NonStop servers.
Virtual
Tape - Getting Rid of a Troublesome Medium
(1,2) The backup paradigm is changing.
Goodbye, tape.
Virtual
Tape for NonStop Servers with ETI-NET's EZX-BackBox
(2,6) Virtual tape made super-fast with
deduplication.
Virtual Transactions with NonStop AutoTMF
(2,4) Converting nontransactional applications
to transactional applications.
Virtualized Time
from TANDsoft (4,1) The OPTA2000
Time Simulator lets multiple applications run on the same
NonStop system with different clocks.
The Geek
Corner
Calculating
Availability - Redundant Systems (1,1)
Some useful
rules come out of the derivation of the availability equation.
Calculating Availability - Repair Strategies
(1,2) Your repair policy can have a significant
impact on your system availability.
Calculating Availability - The Three Rs (1,3)
Node repair, node recovery, and system restore are all required.
Calculating Availability - Hardware/Software
Faults (2,1) Most faults
don't need a repair.
Calculating Availability - Failover (2,2)
When a system is failing over, it is often effectively down,
thus reducing availability.
Calculating Availability - Failover Faults
(2,3) Failovers can fail also.
Calculating Availability - Environmental Faults
(2,4) How to handle hurricanes, power failures,
and riots when calculating availability.
Calculating Availability - Cluster Availability
(2,5) How does the availability of a cluster compare to
that of an active/active system?
Calculating Availability - Nodes, Subsystems, and Systems
(2,6) When is a node a system, and when is
it a subsystem?
Calculating Availability - Failure State Diagrams
(2,9) Formalizing our intuitive
derivations.
Calculating Availability - Heterogeneous Systems - Part 1
(3,3) Probability 101 in preparation for analyzing
systems with heterogeneous nodes.
Calculating Availability - Heterogeneous Systems - Part 2
(3,5) The availability of redundant systems with
different nodal availabilities.
Calculating Availability - Heterogeneous Systems - Part
3 (3,6) Analyzing complex
configurations of system components.
Calculating Availability - Heterogeneous
Systems - Part 4 (3,8) Demonstrating
that systems with century uptimes can be configured.
Configuring
to Meet a Performance SLA - Part 1
(3,12) What size server is needed to provide a response
time of 200 msec. 98% of the time?
Configuring
to Meet a Performance SLA - Part 2 4,1) Comparing the performance of single-server systems to
multiserver systems.
Configuring
to Meet a Performance SLA - Part 3
(4,2) Answering the SLA specification for servers with
exponential service times.
Configuring to Meet a Performance SLA - Part 4
(4,3) Answering the SLA specification for
servers with arbitrary service times.
Configuring
to Meet a Performance SLA - Part 5
(4,4) Answering the SLA specification for multiple servers
in tandem.
Estimating Data Collision Rates (2,8)
Can you go active/active with a tolerable level of data
collisions?
Failure State Diagrams - Repair Strategies
(2,10) The real story behind sequential repair
and parallel repair.
Failure State Diagrams - Recovery Following Repair
(2,12) The formal analysis of the impact of having to
recover a node after its repair.
Failure State Diagrams - Hardware/Software
Faults Revisited (3,2) Our intuitive
results were a little simplistic.
Is Parallel
Repair Really Better Than Sequential Repair?
(3,4) A Digest reader points out that that depends upon the
repair time distribution.
What's
That Nerd Logo?
(1,1) Our logo,
ff2,
really has a meaning. Find out why it describes active/active
architectures.
Why Are
Active/Active Systems So Reliable?
(3,9) Analyzing the impact of resubmitting transactions
rather than bringing up a backup system.
|