Active/Active Payment Processing at
Swedbank (3,1) Swedbank uses
active/active Base24 to support credit cards and POS terminals.
Active/Active with Linux and MySQL
(8,10) Two data centers for VoIP are kept synchronized
with Tungsten data replication.
Apollo 11 -
Continuous Availability, 1960s Style
(4,9) NASA's safety-critical computer systems put men on
the moon four decades ago.
Active/Active at Banco de Credito
(2,11) Using an symmetric configuration saves programming
Bank-Verlag - the
Active/Active Pioneer (1,3)
Bank-Verlag went active/active two decades ago with IBM/Tandem.
Bank-Verlag - An Update (5,8)
Bank-Verlag replicates processed transactions in its
BANKSERV Goes Active/Active
(2,4) A banking switching service in South
Africa moves Base24 into active/active.
Synchronous Replication for Zero RPO
(5,2) Triplexed data centers give fast recovery time with
zero data loss.
Casa Ley Upgrades to
Active/Active OmniPayments (8,1)
One of Mexico's largest grocery chains installs active/active
financial transaction switch.
Goes Active/Active for Prepaid Calls
(3,9) NonStop active/active system keeps prepaid calls
moving in Africa.
Survives 9/11 with OpenVMS Clusters
(4,7) With an active/active backup 30 miles away, getting
their people there did it.
Learns From SAN Disaster (2,2) A
disastrous SAN failure leads to dual redundancy.
at Aqueduct, Belmont, and Saratoga
track wagering can never fail, or else riots start.
You Know Where Your Train Is?
(1,1) A transit authority goes active/active for train
Active/Active ATM Network (4,6)
In this active/active system, ATM failover via DNS rerouting has
Faster Payments -
Bringing Payment Processing Into the 21st Century
(5,6) VocaLink uses active/active to
provide 24x7 real-time payments.
Turns to Parallel Sysplex (4,10)
Sweden's Handelsbanken goes active/active to protect their
online banking and ATM network.
How Does Google Do It? (3,2)
Google processes tens of gigabytes of data in minutes on their
Google Do It (part 2) (7,11) Google
exposes how it distributes massive applications across thousands
of servers while saving energy.
HP's Active/Active Home Location Register
(1,2) The brains of a cellular network can never go down.
OpenCall INS Goes Active/Active (2,6)
Replication lets OpenCall INS run active/active with collision
detection and resolution.
Uses Active/Active to Avoid Hurricanes
(2,10) Fast failover is used to switch users out of
Major ISP Migrates
from Sybase to NonStop with No Downtime
(3,11) Hundreds of millions of accounts migrated and
Major U.S. Bank
Replaces BASE24 (7,6) Opsol's
OmniPayments keeps bank's NonStop systems running after ACI's
Authorization - A Journey from DR to Active/Active
(2,12) A start with DR leads this company to active/active
and application integration.
Provides Active/Active SCADA with OpenVMS
(2,9) Electrical substation monitoring that never
Fraud Detection (4,12)
Credit-card switching service catches fraud on-the-fly - a great
example of real-time business information.
Ring-of-Fire Bank Beats Earthquakes with Active/Active
(7,7) Moving from tape backup to
active/active can be a series of controlled steps.
Royal Bank of Canada Goes Active/Active for ATM/POS
(6,7) Eliminates planned outages and
reduces unplanned downtime from hours to minutes.
(1,1) If active/active is too big a step to take now, work
on reducing your switchover times.
Italia's Active/Active Mobile Service
(2,3) Italy's biggest cell phone network is supported by
Optimizes Look-to-Book Ratio (6,8)
An asymmetric active/active system unloads query processing from
the OLTP master node.
Try Doing This
Today (8,8) In the early days of
computing, we built major systems with kilobytes of memory and
megahertz processor speeds.
UK National Health Service - Blood and Transplant
(3,10) An OpenVMS split-site cluster guarantees the
availability the UK's blood supply.
U.S. Bank Critiques Active/Active (4,5)
A NonStop active/active user shares experience and advice to
those who would follow.
Pioneering Active/Active ATM Network
(5,9) Dual networks with ATM collocation provides
Failover Fault (8,3) - A failed PC
forces a system recovery during a slide presentation. Failover
failed. The BCP saved the day.
Active/Active Save #1 - Coffee Pot Takes Down Node
(1,2) When the coffee pot was plugged in -
Downed by Fat Finger (6,5) A
maintenance technician's error takes down an entire Availability
Zone for four days.
Downed by Memory Leak (7,11) A memory
leak in an innocuous program cascades into major systems,
taking down the AWS cloud for hours.
Eagle's Eight-Day Outage (5,9)
Lack of recovery and failover testing takes down online sales
for the $3 billion retailer for over a week.
Australia's Painful Banking Outages
(7,3) Australia's four major banks suffer multiple outages
as they upgrade their aging infrastructure.
BlackBerry Gets Juiced (2,5) Poor
testing leads to no service for North American subscribers.
BlackBerry Messenger Down for Days
(6,10) RIMs BBM texting service went down for over four
days when it suffered a failover fault.
Another Dive (3,3) Deja Vu. Poor
testing once again leads to no North American service.
BlackBerry - OMG, it's Déjà
(5,1) BlackBerry has now accumulated seven
major outages in five years, providing an availability of three
Commonwealth Bank of Australia - A Correction
(7,4) We make a correction to our article
entitled, "Australia's Painful Banking Outages."
Console Command Takes Down Active/Active System
(1,3) Stop applications on one node, stop
other node. Oops!
Attacks on U.S. Banks Continue (8,2)
Islamic hactivists resume their attacks to get blasphemous video
removed from the Internet,
Wait for the Other Shoe to Drop (2,2)
When a spare component fails, fix it fast. Don't tempt Murphy.
Suppresses WestHost for Days (5,5)
Never test a fire suppression system by triggering it.
First Stuxnet -
Now the Flame Virus (7,6) Deemed
more serious than Stuxnet, Flame takes over PCs and listens in
Go Daddy Takes Down
Millions of Web Sites (7,9)
Any domain registered with Go Daddy was downed by a DNS network
Troubles - A Case Study in Cloud Computing
(4,10) Even the 900-pound Gorilla can have problems
keeping its services up.
Hacked AP Tweet
Crashes Markets (8,5) Phony AP
tweet reporting that Obama was injured in an explosion crashes
markets in seconds.
Cell-Phone Network Costs Lives (5,3)
Following Haiti's disastrous earthquake, many people couldn't
call for help from beneath the rubble.
Has Gmail Become Gfail? (4,3) Google's
Gmail service has been down for hours six times over the last
DDoS Attack? (8,4) Spam black-lister
Spamhaus is taken down for days by a disgruntled spammer via a
massive DDoS attack.
Hostway's Web Hosting
Service Goes Down for Days (2,9)
Small online stores offline for up to a week.
How Many 9s in Amazon? (3,7) Even
giants fall. Amazon's S3 and EC2 services and online retail
store go offline for hours.
Hubble Trouble (4,1) A failover
fault when recovering from an instrument controller failure
almost loses Hubble.
Sandy (7,12) 2012's Hurricane
Sandy flooded lower Manhattan, taking out tens of thousands of
web sites for weeks.
Innocuous Fault Leads to Weeks of Recovery
(3,12) A simple disk mirror failure
propagates into weeks of recovering lost data for a major bank.
IRS Goof Costs U.S.
Taxpayers $300m + (2,1) Turning
off the old system before testing the new one is dumb.
Hacktivists Attack U.S. Banks (7,10)
Several banks taken down for a day in protest of YouTube video,
"Innocence of Muslims."
JPMC Three-Day Outage Caused by Replication Corruption
(5,11) Corruption of primary SAN by Oracle
bug also takes down standby SAN.
Knight Capital Destroyed by Software Bug
A high-frequency trading bug costs Knight $440 million, forcing it to sell out to a consortium.
Lightning Downs Amazon - Not! (6,9)
An Amazon European Availability Zone is taken down by hardware,
software, and human faults.
Stock Exchange PC-Trading System Down for a Day
(3,10) Traders fume at commission loss on one of the most
hectic trading days.
Chargers (8,8) Researchers find a
way to infect iPhones and iPads via the USB charging port.
Center's Multiday Outage (6,2) An
attempt to achieve high availability on a limited budget leads
Down by Redundant Power Failures (8,10)
The primary power cable fails while the other one is being
Disabled by Upgrade (5,6) A
failed upgrade to support the next GPS generation takes down
10,000 military GPS receivers.
Mizuho Bank Down
for Ten Days (6,6) A flood of
earthquake donations by mobile phones overwhelmed the bank's
evening batch runs.
More Never Agains
(3,8) Over two dozen disastrous outages for the
first half of 2008 are recounted.
Never Agains II (4,2) System
downtime problems have moved from the power lines to the
Never Agains III (4,7) Add the
cloud to power and network problems creating over two dozens
outages on which we report.
Never Agains IV (5,2)
Network, hardware/software problems highlight outages for the
last half of 2009.
Never Agains V (5,7) Over
half of our 30 horror stories took down hosting providers. A
failover plan is a must.
Never Agains VI (7,4) Software
bugs and recovery faults highlighted the outages in the first
quarter of 2012.
Never Agains VII (7,9) Power
outages were the main cause of these failures.
Never Agains VIII (8,2) Security
threats are becoming more prevalent, with the Chinese evidently
leading the charge.
Never Agains IX (8,7)
Environmental faults lead the list, followed by updates gone
Nasdaq Taken Down by Software Flaw
(8,9) A blast of messages from the NYSE disables Nasdaq's
quote reporting system and causes a failover fault.
National Australia Bank Customers Down for Days
(5,12) A bad batch update disables critical
customer services for two weeks.
New York City's New
911 System Goes Down Four Times (8,6)
After extensive testing, the city lost 911 service four times in
Virginia's 911 Service Down for Four Days
(7,12) 90 mph winds and air in the
generator's fuel lines takes down a Verizon 911 hub.
Utility Hits Availability Bump (2,10) A
utility is expected to be always up, but this one didn't make
Time Bomb (7,2) An obscure bug in
the Oracle database
could take down an entire data center if not patched
Orca - The Outage That May Change History
(7,11) The Republican 2012 Get-Out-The-Vote
system flopped from the beginning.
PayPal Fault Takes Merchants Offline
A network fault forces small online merchants to close shop for
Documentation Snags Google (5,4)
A data center goes down, and failover to the backup data center
Rackspace - Another
Hosting Service Bites the Dust (2,12)
A truck driver wipes out web sites for a day or more.
Royal Bank of Scotland Offline for Two Weeks
(7,7) Falling back from a failed upgrade
failed and took three U.K. banks down.
Sidekick: Your Data
is in 'Danger' (4,11) A million
smart-phone users lose all of their contacts, calendars, and
Singapore Bank Downed by IBM Error
(5,8) An undocumented new procedure takes down all DBS
Bank systems for hours.
Skype Holiday Present - Down for a Day
(6,1) Skype overload takes down its peer-to-peer network
of hundreds of thousands of supernodes.
So You Think Your System is Robust?
(2,8) So did these major enterprises, all of which went
down in the first six months of 2007.
So You Think Your System is Reliable
(3,1) Horror stories from the second half of 2007 focus on
power and branch failures.
Bug Causes Train Wreck (1,1)
A software bug, controller diversion, and engineer inattention
combine to cause a train collision.
Taken Down for Weeks by Hackers (6,5)
Hackers steal 100 million accounts from Sony, requiring weeks to
repair security defenses.
Spamhaus Attacker Caught (8,5) The
mastermind behind the ten-day 300 gbps attack on Spamhaus
arrested in Spain and extradited to The Netherlands.
Stuxnet - The World's
First Cyberweapon (6,3) Stuxnet
is the first worm to attack a control system and destroy
Sydney's M5 Tunnel
Closed Again by Computer Glitch (3,11)
Six times in six years is too much for New South Wales.
Million ATM Heist (8,5) Hackers
clone prepaid debit cards and net $45 million in ten hours from
ATMs around the world.
The Alaska Permanent Fund and the $38 Billion Keystroke
(2,4) What do you do when your active and backup disks are wiped
out and your tapes won't read?
Case of the Flying Cable
(1,1) A technician loses control of an under-floor cable
and lets it hit a power strip.
Availability Woes (4,12)
Application and network failures plague air travelers. Where is
NextGen - the next generation airspace system?
Great 2003 Northeast Blackout and the $6 Billion Software Bug
(2,3) A hot day, an untrimmed tree, and a monitoring
system bug cost power customers $6 billion.
Blows Up (3,9) A massive
electrical explosion takes out thousands of hosting servers at a
major dedicated hosting provider.
The State of
Virginia - Down for Days (5,10) A
maintenance error takes down 26 state agencies for up to a week.
Twitter Taken Down by
DDoS Attack (4,8) The Twitter,
Facebook, and LiveJournal social sites are taken down to silence
a Georgian blogger.
Redundancy Failure on the Space Station
(2,11) A single point of failure takes down a triplexed
Verizon 4G Network
Down for Two Days (6,6) Verizon's
"always reliable" 4G network brought down by software bug - no
VMware's Cloud Foundry Flounders (6,7)
A storage fault caused by a power outage is followed by a bigger
fault caused by a fat finger.
Vodafone Downed by
Burglars (6,4) Thieves
sledgehammer their way into a Vodafone exchange and steal
computers and network equipment.
PBX Succumbs to Overconfiguration (2,6)
Why extra processing power made this PBX less reliable.
What? No Internet?
(3,2) A multiple cable break isolates North Africa, the
Middle East, and India.
Why Back Up?
(4,4) The malicious act of an IT
manager deletes his company's database and forces the company to
close its doors.
Will You Have
Internet Access After July 9, 2012?
(7,5) A recent FBI sting took down rogue DNS servers and
substituted good servers until July 9th.
Windows Azure Cloud Succumbs to Leap Year
(7,3) As the clock ticked to February 29th,
the Azure Cloud went down for 32 hours.
Windows Azure Downed by a Single Point of Failure
(8,11) Azure developers could not run new
applications after a failed Microsoft update.
Availability Award (5,10) This
year's winner is Bank-Verlag with runners up Belgacom and
NonStop Advanced Technical Bootcamp
(8,9) To be held in San Jose in November, the NonStop
Bootcamp is the premier NonStop annual meeting.
Fast Failover in Active/Active Systems - Part 1
(4,8) Using user and network redirection to
failover in subseconds.
Fast Failover in Active/Active Systems - Part 2
(4,9) Using server redirection to failover
Best Practices (2,1) Tips from
those who have achieved near-continuous availability.
Capacity Exhaustion (7,7) A
strikingly simple graphic display forecasts capacity peaks by
the hour over the year.
Notworks (4,1) A network that
doesn't work in a "notwork." Protect your network with a good
Backup Is More Than Backing Up (4,5)
Backing up a database is an exercise in futility if you can't
restore the database.
Can 10,000 Chickens Replace Your Tractor?
(1,3) Save money by replacing your mainframe with
clusters - Not!
Can You Trust the Compute Cloud?
(3,8) What will it take to make cloud
computing the data utility of the future?
Centers (4,11) Google and Yahoo!
locate new data centers in the north country to take advantage
of "free cooling."
Choosing a Business Continuity Solution - Part 1
(6,7) What measures of availability are
important to your organization?
Choosing a Business Continuity Solution - Part 2
(6,8) Data replication is the fundamental
force behind system availability.
Choosing a Business Continuity Solution - Part 3
(6,9) Data replication leads to several
highly available architectures.
Choosing a Business Continuity Solution - Part 4
(6,10) Choosing a highly available
architecture to meet your availability needs.
Continuous Availability Featured at HPTF 2009
(4,6) Presentations include many
continuous availability and high availability talks.
Destructive Ransomware (8,11)
CryptoLocker encrypts your files until you pay a ransom.
Surpass Terrorism (8.3) The U.S.
government says that in 2013, cyber threats surpassed terrorism
as the top security concern.
Data Center Cooling
Nature's Way (5,5) Data centers
cut electric bills in half by replacing chillers with air
Data Center in a Box (4,7) Your
next visit to a data center may be to the warehouse district.
Center Monitoring with Open-Source Nagios
(6,11) Including NonStop systems in
open-source "single pane of glass" monitoring.
Deduplication (6,2) Data
deduplication can reduce backup storage and disaster-recovery
bandwidth requirements by a factor of 20:1.
Attacks on the Rise (8,4) 2012
saw a 53% rise in DDoS attacks with greatly increased malicious
Homeland Security: Disable Java (8,2)
A serious vulnerability in Java 7 means that it should be
removed from browsers.
System (1,2) Documentation is a
necessary evil. Let's focus on the "necessary" and not the
Data Replication Eliminate the Need for Backups?
(5,11) Data replication protects
operations; data backup protects data.
Fall World 2010 Business Continuity Conference
(5,8) A week-long conference in September,
2010, focusing on Business Continuity.
Spring World 2011 Business Continuity Conference
(6,2) A week-long conference in March,
2011, focusing on Business Continuity.
Fall World 2011 Business Continuity Conference
(6,8) A week-long conference in September,
2011, focusing on Business Continuity.
Spring World 2012 Business Continuity Conference
(7,1) A week-long conference in March,
2012, focusing on Business Continuity.
Enterprise Availability Architectures for Business-Critical
Achieving the proper balance of availability and cost.
Employees Are New Targets (7,11)
The FBI warns that cybercriminals are moving from corporate IT
systems to corporate employees.
Extreme-Green Data Centers (3,12)
Wave motion and seawater may power and cool data centers in the
Collisions in Asynchronous Replication
(5,9) An update on data collision avoidance, detection,
Availability Topics at HP Discover 2011
(6,5) Over two dozen presentations on high-availability
topics will be presented in Las Vegas in June, 2011.
Your Readiness Plans Stack Up? (6,1)
Compare your disaster recovery plans with those of 300 other
CloudSystem (7,2) Companies can
convert their current IT assets into a private cloud that can
burst into public clouds.
2011 (6,3) HP Discover 2011 is
HP's major annual marketing and educational event, held in Las
Vegas June 6th to June 10th, 2011.
HP's Project Odyssey
- Migrating Mission Critical to x86
(7,3) HP is moving HP-UX high availability features to
Intel's Xeon x86 chip.
Blows Up Data Center (2,8) An explosive
demonstration of fast recovery.
Recovery-as-a-Service (7,6) HP's cloud-based recovery service provides fast RTOs and short
RPOs with no upfront capital expenditures.
Humanizing Three 9s (2,9) What if
we lived in a world of three 9s?
with Ron LaPedis on NonStop with XP Storage
(2,5) How to improve NonStop reliability by using a SAN.
IPv6 Is Here -
Like It or Not (6,4) Some tips
from a father of the Internet on the simple ways to convert from
IPv4 to IPv6.
Maintenance Preventive? (7,10)
Major IT faults have been caused by preventive maintenance
errors. Is PM worth it?
ISO 22301 - The New
Business Continuity Management Standard
(7,10) The first business continuity specification to be
issued by ISO.
The Harsh Teacher (2,6) The most
powerful Gulf storm in 200 years showed us how unprepared we
were for such a disaster.
(7,12) If your system begins to overload,
how do you determine what load to shed?
Malware as a Service
(6,12) Powerful hacking software is
becoming just a click away.
Maximizing Availability in Everyday Systems
(5,7) Even if you don't have a
redundant system, there are things you can do to minimize
Microrebooting for Fast Recovery
(2,3) An application of Recovery-Oriented Computing.
Threats to Corporate Networks (8,7)
Mobile devices are a convenience for employees but a security
threat for corporations.
Camp is Coming in October (7,8)
The NonStop Community will gather in San Jose from October 14th
through October 16th, 2012.
On Blogs and Discussion Groups (2,10)
Online forums can be a big boost to your professional growth.
OpenStack - The
Open Cloud (7,4) A major
open-source initiative may take us one step closer to a true
worldwide compute utility.
Boot Camp Is Coming in March (8,2)
It will be held for four days from March 18th through March 21
in Bedford, Massachusetts.
Massive Security Patch for Java (8,5)
Following DHS recommendation to disable Java, Oracle releases 42
critical security updates.
Recovery-Oriented Computing (2,2)
If recovery time can be made small enough, users will perceive a
(3,1) How to get messages over LAN and WAN multicast
networks without message loss.
Retail Web Sites
Losing Millions to Poor Response Time
(7,1) Slowness is worse than downtime - it makes people
hate your site.
Roll-Your-Own Replication Engine - Part 1
(5,1) What does it take to build your own
replication engine? Lots!
Roll-Your-Own Replication Engines - Part 2
(5,2) Issues with asynchronous and
synchronous replication engines.
of Availability - Part 1 (3,3)
The first set of common rules of availability from our books,
Breaking the Availability Barrier.
of Availability - Part 2 (3,5)
More common rules of
availability from our books,
Breaking the Availability Barrier.
of Availability - Part 3 (3,7)
Concluding the common rules of availability from our books,
Breaking the Availability Barrier.
Superstorm Sandy Survivors (8,6)
How three companies in the path of Sandy kept their systems and
Synchronous Replication Recovery Strategies
(5,3) Bringing a failed database copy back
on line under synchronous replication.
Most Exploitable Programming Errors
(8,2) A detailed list of the programming errors that
expose the most vulnerable security holes.
Value of Availability (6,6)
Downtime costs are based on the likelihood, duration, impact,
and cost of each risk factor taken individually.
Transaction-Oriented Computing (2,4)
Old art to some, new to others, transaction processing is the
foundation for high availability.
Detector (5,4) The U.S.
Geological Service is mining tweets to get instant notification
VRRP - Virtual Router Redundancy Protocol
(3,10) Adding transparent failure detection and failover
at the first hop.
Really Caused the Windows Azure Outage?
(7,5) The Windows Azure cloud was taken down by a simple
Retirement a Hackers' Boon (8,10)
Hackers are holding zero-day attacks until Microsoft no longer
provides security fixes.
100% Uptime, Do I Need a Business Continuity Plan?
(1,1) You'd better believe it.
Active/Active Full Day Seminar at HPTF
(4,4) Dr. Bill speaks on active/active theory and practice at
the 2009 HPTF conference.
Active/Active on Commodity Servers
Why has active/active technology not made it into the commodity
Clusters (2,5) For high
availability, clusters are mature; but active/active systems
provide greater reliability.
Systems - A Taxonomy (3,9)
Classifying the many ways to build an active/active system.
Availability to Performance Benchmarks
(2,9) Recovery time is the proper metric to use for an
Amazon's Availability Zones (6,11)
Critical applications can run reliably in the cloud by
distributing them across Amazon Availability Zones.
About Continuous Processing Architectures
(1,1) CPA can get you
arbitrarily close to 100% uptime.
Anatomy of a
DDoS Attack (8,4) DDoS attacks
take down web sites by aiming traffic at various levels in the
Anti-Virus - A Single Point of Failure?
(5,5) McAfee's malicious anti-virus update takes down
millions of computers in a flash.
Replication Engines (1,2) These engines
power most of today's active/active systems.
Availability versus Performance (2,8)
Is it time to trade higher availability for reduced performance?
Database of Record (3,11) Which
database copy in an active/active network is the "single version
Collision Detection and Resolution
(2,4) What do you do if you can't avoid collisions when
using bidirectional replication?
Court Decides - HP
1, Oracle 0 (7,8) Judge finds
Oracle arguments a Seinfeld sitcom, orders continued Oracle
support of HP Itanium servers.
Defining Active/Active (4,12) Can
we agree on what are active/active architectures? Add your
comments to this ongoing effort.
Defining Active/Active - Revision 1
(5,1) Revision 1 of our definition based on suggestions
posted to our LinkedIn Continuous Availability Forum.
Eavesdropping on the Internet (4,3)
A vulnerability in the Border Gateway Protocol allows nefarious
sites to read your Internet traffic.
Tolerance for Virtual Environments - Part 1
(3,3) How virtualization can significantly reduce
data center capital and operating costs.
Tolerance for Virtual Environments - Part 2
(3,4) Operating system and bare metal hypervisors.
Tolerance for Virtual Environments - Part 3
(3,6) Hardening virtual environments
with failover and fault-tolerance.
Suppressant's Impact on Hard Disks
(6,2) Fire alarm sirens in the data center are fingered as
the culprit in hard-disk damage.
Services - Information Sharing & Analysis Center
(7,10) A member-owned industry forum for
sharing security threats.
Replication (2,1) Replicating at
the hardware level does not maintain database consistency.
Help! My Data Center is Down! - Part 1: Power Outages
(6,10) Unusual data center outages caused
by power failures.
Help! My Data Center is Down! - Part 2: Storage Outages
(6,11) Unusual data center outages caused
by storage system failures.
Help! My Data Center is Down! - Part 3: Internet Outages
(6,12) Unusual data center outages caused by Internet failures.
Help! My Data Center is Down! - Part
4: Intranet Outages
(7,1) Unusual data center outages caused by intranet failures.
Help! My Data Center is Down! - Part 5: Upgrades
(7,2) Unusual data center outages caused by
upgrades gone wrong.
Help! My Data Center is Down! - Part
6: The Human Factor (7,3) Unusual
data center outages caused by fat fingers.
Help! My Data Center is Down! - Part
7: Lessons Learned (7,4) The lessons we
can learn from the data center failures of Parts 1 to 6.
HP Clarifies the
Future of OpenVMS (8,7) OpenVMS
will be supported by HP for years to come.
Fire-Prevention Systems (6,1) Why
drown your servers after a fire breaks out? Keep the fire from
starting in the first place.
Is the Cost of
Converting to Active/Active Worth It?
(4,11) Offsetting the cost of conversion with the cost of
Official! Leap Day Caused the Windows Azure Outage
(7,5) Incrementing the year by one to get
next year's date took down the Azure cloud.
Jim Gray - In
Memoriam (3,7) The database
pioneer that set the stage for active/active systems is lost at
Get an Availability Benchmark (2,6)
Great performance is meaningless if the system in unavailable.
Leveraging Virtualization for Availability
(5,12) With many eggs in one basket, system
availability becomes all that more important.
Leap-Second Bug Takes Down Data Centers
(7,8) A leap second added on June 30, 2012, takes
down unpatched Linux systems worldwide.
Communication During a Crisis (6,5)
Don't create a second crisis by letting the press publish
erroneous and damaging stories.
Your Application to Active/Active
(2,3) What must you do to prepare your application for an
NonStop Symposium and
the OpenVMS Bootcamp (5,7) After
being gone for a year, the exclusive NonStop and OpenVMS venues
are back for 2010.
Ponemon on Live Threat Analysis (8,11)
Intelligence data about cyber threats happening right now is
crucial to stopping cyberattacks.
Recovery-as-a-Service (8,2) RaaS
provides backup service in the cloud for critical applications.
Reducing Pharmaceutical Pollution (8,1)
Monitoring of pharmaceutical processing practices to reduce
pollution requires high availability computing.
Remembering Ken Olsen - An IT Icon
(6,3) The founder of Digital Equipment Corp., Ken (1926 -
2011) brought interactive computing to the individual.
Availability and Performance (6,4)
Social media is becoming critical in our daily lives. It is time
for it to grow up.
Threat Report 2013 (8,6) The
major security threats of 2013 for businesses and individuals.
(4,10) How good are our spam filters, and why does spam
Bets $50,000 That You Won't Be Down
(5,1) Buy an ftServer by February 26, 2010, and Stratus
will give you $50K if it fails in the first six months.
Puts $50,000 Where its Mouth Is - Again
(6,12) Stratus' offer to pay you $50,000 if your
ftServer/vSphere application fails expires 12/31/11.
Stratus Puts $50,000 Where its Mouth Is - an Update
(7,2) Stratus extends its $50,000 ftServer/vSphere
availability offer for another year to 12/31/12.
Synchronous Replication (1,3)
Avoid data collisions and data loss following a node failure.
Availability Matrix (6,1)
Simplify your data center availability configurations using the
independence of RTO and RPO.
The Causes of
Outages (8,3) 250 Never Again
stories tell us the proportion of outages due to hardware,
software, humans, networks, and other faults.
Cloud (4,6) This new
computing paradigm might ultimately replace corporate data
centers if it can ever be made reliable.
Internet (4,5) Can you
trust your mission-critical applications solely to the Internet?
We think not.
History of Fault Tolerance (1,2) The
fault-tolerant marketplace was hot in 1984.
The Malware Threat to
Android (7,9) With its
major market share and unvetted apps, Android is the prime smart
phone target for hackers.
Doomsday (4,8) The Internet
Protocol Version 4 is about to run out of its four billion
addresses in two years. What now?
Ubiquitous Internet (4,7) 1.5
billion users, 200 million web sites, and one-million viruses
depend on the Internet
Synchronization for Distributed Systems - Part 1
(2,11) How does NTP calculate the time offset from a time
Synchronization for Distributed Systems - Part 2
(2,12) How NTP minimizes time offset
Time Synchronization for Distributed
Systems - Part 3 (3,2) Logical clocks
offer an option for synchronizing systems.
Replication (2,2) A simple
approach to active/active systems
has scalability issues.
Tussling with the
Word "Redundant" (3,12)
"Redundant" doesn't always translate the same to those in
Acceleration and EMI (5,4) If
testing doesn't show it, does that prove that EMI can't make an
engine computer misbehave?
Availability Benchmark (3,10)
Making use of a recovery time benchmark to influence your system
Suppression Test Fiasco - An Update
(5,9) Why did the accidental activation of the fire
suppression system destroy so many disks?
(1,1) Active/active architectures
can give subsecond recovery following a failure.
What is Reliability?
(5,6) We can get rid of marketeering by
quantifying highly-reliable computer systems by their
the Availability Barrier? (5,3)
Mean time to recover. Let it fail but fix it
Concern - MTR or MTBF? (5,11)
Recovery time is for users, failure intervals are for system
operators, availability is for management.
Worsing on Worsening
(4,2) A 1967 chewing-out of IBM's Field
Service staff resounds still today.
Business Continuity Survey (3,4)
A look at 150 small to large companies and their BC/DR plans and
Archive Storage -
Disk or Tape? (5,11) Disk
provides fast recovery from backups, and tape provides
economical long-term archiving.
Redundancy (7,5) How geographic
redundancy can improve service availability and reliability of
Rewiring the World from Edison to Google
(6,9) The cloud compute utility is
following in the tracks of the electric utility.
Blueprints for High
Availability: Designing Resilient Distributed Systems
(2,5) All you ever wanted to know about
Breaking the Availability Barrier
(3,5) Everything you ever wanted to know about
active/active systems - theory, implementation, and practice.
Business Continuity from A to Z (5,12)
The online book explores the responsibilities of the
stakeholders in the business continuity plan.
Continuity Planning: IT Examination Handbook
What better way to learn about BCP than from the auditor's
Business Continuity Today (4,3)
This freely-available living eBook covers a broad range business
Continuous Availability Systems Design Guide
(2,1) What to do if you want to move to CPA.
Distributed Systems: Principles and
Paradigms (3,1) A thorough treatment
of requirements for distributed system transparency.
the Computer Room , What Now? (2,6)
Are you prepared for a total loss of your data center because of
a fire or other disaster?
Network Fundamentals (4,4)
A practical guide to predicting network availability (especially
for the mathematically challenged).
Megaplex: An Odyssey
of Innovation (4,12) Tandem is
35 years old. The Standish Group looks back on 35 years of
Modeling: The Future of NonStop Demand
(5,10) Standish Group envisions critical and non-critical
applications sharing the same blades.
Systems: Gateways, Interfaces, & the Incremental Approach
systems must be decomposed to migrate to active/active.
Mission-Critical Network Planning (4,9)
A broad review of redundancy in servers, networks, storage, data
centers, and power.
Multiple Processor Systems for Real-Time Applications
(2,10) A classic treatise on distributed systems
that is still pertinent two decades later.
Response Planning (4,10) How will
your company continue operations if the Swine Flu hits with a vengeance?
the Megaplex (5,7) The six
steps that will modernize your vertical NonStop applications for
the open world of horizontal services.
Tandem Computers Unplugged: A People's History
(7,7) Tandem from 1975 till 1997 as seen through
the eyes of its employees.
TCP/IP Illustrated, Volume 1: The Protocols
(4,11) The "bible" of the TCP/IP Protocol
Suite, the glue that binds active/active systems.
Business and Economics of Linux and Open Source
(1,3) Open source demystified for the
Disaster Recovery Journal (5,2)
The resource for business continuity professionals.
The Unified Modeling Language User Guide
(1,2) UML is now the accepted standard for fast
and easy documentation of systems and procedures.
Zero Downtime: High Availability Blueprints
(2,8) A close look at installing Microsoft clusters and
Transaction Processing: Concepts and Techniques
(2,4) The classic book on transaction processing
systems, by Jim Gray and Andreas Reuter.
Unix Backup and
Recovery (2,2) Backing up is a
pain, but it is the restore that counts.
Suite (5,12) Data access, data
federation, and data movement combine to make data and services
available across the enterprise.
Date Testing - Leap Day and More (7,5)
Many products exist to test applications for proper processing
of critical dates.
EMC's SRDF Data-Replication Engine
(6,4) Maintain a consistent asynchronous or synchronous
target copy of a database with no server involvement.
FalconStor RecoverTrac - Automated Disaster Recovery
(7,6) Build your own recovery cloud
supporting heterogeneous environments.
Windows and Linux from Stratus (2,9)
ftServers provide transparent fault-tolerant operation.
FileSync and CSR
Synchronize NonStop Systems: Part 1 - FileSync
(6,10) FileSync replicates changed files or
file changes between systems.
FileSync and CSR Synchronize NonStop Systems: Part 2 - CSR
(6,11) Command Stream Replicator repeats
operator actions on remote systems.
Availability Options with GoldenGate's TDM
(2,2) Implement a variety of data-sharing
topologies with TDM's data replication facilities.
GRIDSCALE - A Virtualized Distributed Database
(3,7) Like presentation and application servers, pooled
database servers for the three-tier architecture.
Much Will Active/Active Cost Me?
(1,1) The cost of downtime can swing your
HP's NonStop Blades
(3,8) NonStop fault-tolerant fundamentals come to HP's
HP's NonStop Synchronous Gateway (4,6)
Finally, NonStop synchronous data replication might be on its
HP's Reliable Transaction Router (5,5)
Reliable transaction messaging services between Windows, Linux,
OpenVMS, and HP-UX systems.
ServiceGuard Clustering Facility (2,5)
Managing HP-UX and Linux clusters.
Master/Slave Replication with Continuent's Tungsten
(4,5) Asynchronous replication between MySQL and Oracle.
MySQL Clusters Go Active/Active (1,3)
Clusters of storage nodes are kept in sync by synchronous
Open-Source Monitoring for HP NonStop
(8,3) Manage your NonStop systems along with your Windows,
Linux, and Unix systems with Nagios.
Windows Applications (5,6)
Automated failover is provided for popular Windows applications
like Exchange, SharePoint, SQL Server, IIS.
NonStop AutoSYNC -
Eliminating Configuration Drift (6,8)
Backup system configurations must be kept synchronized with
their production systems.
Split-Site Clusters (3,6) OpenVMS
Clusters provide active/active operation with synchronous
OpenVMS Emulation on PCs (8,9)
vtAlpha and vtVAX emulate Alpha and VAX hardware on multicore
x86 PCs with no software changes.
Oracle Data Replication (6,9)
Data Guard, Streams, or GoldenGate - Which replication engine
should be used when?
- Fault Tolerance from IBM (3,4) IBM's
Parallel Sysplex offers offers localized active/active
Computing Offers Beowulf
Clustering on Linux (2,1) NASA's
Beowulf clustering is available on Linux with
Penguin's HPC servers.
Prolexic - A DDoS Mitigation Services
Provider (8,4) Prolexic protects
companies from DDoS attacks via a network of scrubbing centers.
mission-critical applications with HP Serviceguard Solutions for
Linux (8,10) A capability of HP's
Raima's High-Availability Embedded Database
(6,12) A microprocessor embedded database
with SQL capabilities offering five 9s availability.
Balancing for High Availability (8,7)
Loadbalancer.org's redundant load balancers eliminate a single
point of Intranet failures.
Windows and Linux Environments with Double-Take
(4,8) Replicate entire servers with
incremental file-system updates.
Scaling MySQL with
Continuent's uni/cluster (3,11)
Synchronous replication of update queries and distribution of
Brings Five 9s to MySQL (7,1)
Significant extensions to MySQL to improve its availability and
Shadowbase - The
Active/Active Solution (2,3) Shadowbase provides fast data replication as well as
online copy and database resynchronization.
100-Microsecond Link Failover (6,3)
Field-Programmable Gate Arrays protect redundant Ethernet links
with 100 usec. failover.
solidDB - a Five 9s Memory-Resident
Database (3,5) Server
memory is getting so large, why not keep your database in high
Stratus Avance Brings Availability to the Edge
(4,2) If downtime in a branch office costs
as little as $1,000 per hour, Avance can pay for itself in a
ftServer Flexes Its Recovery Muscle
(5,8) Independent testing measures scalability and
demonstrates no impact due to catastrophic failure.
Surviving DNS DDoS
Attacks (8,11) The Secure64 DNS
Authority server detects and blocks DDoS traffic while
continuing to respond to DNS queries.
Synchronization for NonStop Servers
(0211) NTP products from Bowden Systems and HP for
Tape - Getting Rid of a Troublesome Medium
(1,2) The backup paradigm is changing.
Tape for NonStop Servers with ETI-NET's EZX-BackBox
(2,6) Virtual tape made super-fast with
Virtual Transactions with NonStop AutoTMF
(2,4) Converting nontransactional applications
to transactional applications.
from TANDsoft (4,1) The OPTA2000
Time Simulator lets multiple applications run on the same
NonStop system with different clocks.
Server Failover Clustering (5,4)
Microsoft's successor to MSCS adds simplified cluster management
and improved geographical dispersion.
Availability - Redundant Systems (1,1)
rules come out of the derivation of the availability equation.
Calculating Availability - Repair Strategies
(1,2) Your repair policy can have a significant
impact on your system availability.
Calculating Availability - The Three Rs (1,3)
Node repair, node recovery, and system restore are all required.
Calculating Availability - Hardware/Software
Faults (2,1) Most faults
don't need a repair.
Calculating Availability - Failover (2,2)
When a system is failing over, it is often effectively down,
thus reducing availability.
Calculating Availability - Failover Faults
(2,3) Failovers can fail also.
Calculating Availability - Environmental Faults
(2,4) How to handle hurricanes, power failures,
and riots when calculating availability.
Calculating Availability - Cluster Availability
(2,5) How does the availability of a cluster compare to
that of an active/active system?
Calculating Availability - Nodes, Subsystems, and Systems
(2,6) When is a node a system, and when is
it a subsystem?
Calculating Availability - Failure State Diagrams
(2,9) Formalizing our intuitive
Calculating Availability - Heterogeneous Systems - Part 1
(3,3) Probability 101 in preparation for analyzing
systems with heterogeneous nodes.
Calculating Availability - Heterogeneous Systems - Part 2
(3,5) The availability of redundant systems with
different nodal availabilities.
Calculating Availability - Heterogeneous Systems - Part
3 (3,6) Analyzing complex
configurations of system components.
Calculating Availability - Heterogeneous
Systems - Part 4 (3,8) Demonstrating
that systems with century uptimes can be configured.
RPO (5,3) An RPO is the
probability that data loss following a node failure is less
than a specified amount. How can we verify this?
to Meet a Performance SLA - Part 1
(3,12) What size server is needed to provide a response
time of 200 msec. 98% of the time?
to Meet a Performance SLA - Part 2 4,1) Comparing the performance of single-server systems to
to Meet a Performance SLA - Part 3
(4,2) Answering the SLA specification for servers with
exponential service times.
Configuring to Meet a Performance SLA - Part 4
(4,3) Answering the SLA specification for
servers with arbitrary service times.
to Meet a Performance SLA - Part 5
(4,4) Answering the SLA specification for multiple servers
Estimating Data Collision Rates (2,8)
Can you go active/active with a tolerable level of data
Failure State Diagrams - Repair Strategies
(2,10) The real story behind sequential repair
and parallel repair.
Failure State Diagrams - Recovery Following Repair
(2,12) The formal analysis of the impact of having to
recover a node after its repair.
Failure State Diagrams - Hardware/Software
Faults Revisited (3,2) Our intuitive
results were a little simplistic.
Repair Really Better Than Sequential Repair?
(3,4) A Digest reader points out that that depends upon the
repair time distribution.
Reliability Diagrams (6,7) Complex
systems can be analyzed via reliability diagrams as sets of
parallel (redundant) and serial components.
SAP on VMware High
Availability Analysis (7,12) Our
availability analysis is used to predict the availability of
VMware ESXi clusters.
Analysis - Part 1 (5,10) User's
aren't down just because two nodes fail. They are also down
waiting for a backup system to take over.
Simplifying Failover Analysis - Part 2
(6,6) Extending failover analysis to complex multinode
Reliability Models (8,8) Modeling
software reliability is a complex task upon which there is not
The Cost of RPO
and RTO (7,9) What is the optimum
architecture for minimizing the costs of downtime and lost data?
That Nerd Logo?
(1,1) Our logo,
really has a meaning. Find out why it describes active/active
Active/Active Systems So Reliable?
(3,9) Analyzing the impact of resubmitting transactions
rather than bringing up a backup system.
@availabilitydig - The
Twitter Feed of Outages (8,8)
@availabilitydig - The September, 2013,
Twitter Feed of Outages (8,9)
@availabilitydig - The
Twitter Feed of Outages (8,10)
@availabilitydig - The
Twitter Feed of Outages (8,11)