Icebergs Ahead – Disaster Recovery Plans and Executions

Original Article posted on The Analytical Scientist, June 21, 2016
Article #501  |  Issue # 0616

Are you ready for the “unthinkable” – your laboratory computer systems suffering a titanic failure?

Disaster recovery plans are crucial insurance against the unexpected.

Google’s self-reported 99.984 percent uptime and the non-existent interruptions claimed by other major providers encourage us to take the availability of laboratory information management systems (LIMS) and other critical software tools for granted. The known risks – upgrades, planned downtime for system maintenance, and the occasional power outage that lasts long enough to test the effectiveness of a fossil fuel-powered generator backup – may be included in a company’s disaster recovery (DR) plan. However, when was the last time you checked it? And perhaps more importantly – have you planned for the unexpected?

With record-retention mandates differing worldwide, organizations cannot be too risk averse in their backup retention policies. The risk of not doing so can have both legal and financial consequences. For example, laboratories that monitor air quality and waste output need to be able to produce records and data readily whenever the Environmental Protection Agency (EPA), Occupational Safety and Health Administration (OSHA), and other regulatory authorities request it. Failure to do so can result in fines, shutdowns, loss of revenue and even legal action.

Audits and virtual machines

I recently audited a small laboratory that was not using virtual machines. Upon completion, I found multiple deficiencies that left the lab vulnerable to data loss (whether from a power outage or, more commonly, operator error). Below are a few of the findings with recommended action items to correct the offense:

Unmaintained battery backups

A preventative measure for avoiding data loss is to test battery backups and verify that they will last long enough for either the backup generator power to kick in or for a graceful server shutdown.

Incorrect user permissions

Without correct permissions, the door is open to accidental data loss or updates. I found that users had the ability to edit data throughout the LIMS even though their work was only focused on a subset of the LIMS functionalities. Moreover, I’ve found that “operator error” is the most common cause of record loss. Addressing this shortcoming can prevent the need for many last minute record recoveries.

“With record-retention mandates differing worldwide, organizations cannot be too risk averse in their backup retention policies.”

Single point recovery failure

When reviewing the DR process, I found that there were multiple single points of failure with the current backup strategy. A single backup method was used, which was a full daily backup of all files on the server. Though it allowed for single file recoveries, it did not allow for server recovery because the open records on the server were not being backed up. In fact, it would only have been valid if there were recovery scripts for a rebuild of the server that would allow file recovery. I determined that a full backup should be run on the weekend, and then incremental server backups run daily during the week. Additionally, recovery scripts were created to rebuild servers. Finally, the tape rotations allowed for many recovery options.

Invalid backup schedules

There were scheduling conflicts with the database backups to disk and the server file level backups. I found that the database hot or cold backups and exports were active during the backup of the server files, which meant that the database was never fully backed up on tape. The corrective action taken was mapping out the start and end times for each database and server file backup to determine if they overlapped and to change the times to prevent scheduling conflicts. If a conflict could not be avoided, other approaches would have been needed, such as rolling backup times or backing databases up directly to tape.

No tape rotations

I found that backup tape rotations were inconsistent. There are a number of reasons for tape rotations. One is to save money by reusing backup tapes, and another is to set aside weekly or monthly tapes for the remainder of the accepted retention period. To correct this, you need to issue a policy where certain tapes are set aside for an agreed period. For example, one tape a week was stored for an entire month and then after six months only one tape a month was stored for the duration of the company’s retention period.

Missing “test and verify” backup plans

The disaster recovery plan executions were never run to verify that the backups could be recovered. What good would the effort and money spent on developing a DR plan be if backups can’t be used to recover a lost system months or years down the road? Make sure the plan is tested and verified to work properly.

There really is no excuse for not putting some level of backup policy in place. Recently, many new backup strategies have been used, especially with the popularization and use of virtual machines (VM) for emulating particular computer systems. With the advancement of today’s technology – as well as past developments in database technology that enable multi-master replication and other real-time backup solutions – it is now possible to take a live snapshot of your server. And though this doesn’t necessarily provide you with file-level recovery, it does provide another viable backup strategy.

Write a comprehensive Disaster Recovery plan

When a company runs its business on a paperless or near paperless platform, its successful operation relies on its servers and systems availability. You should be asking yourself a number of serious questions:

  • What happens if servers or systems are unavailable?
  • How long can you run your business “in the dark”?
  • Are there adequate backup and recovery plans in place?
  • Is your business risk-averse enough or do you take uptime for granted?
  • Is working with no net acceptable?
  • Are there contingency plans in place for a total loss to the server room or a breakdown of legacy analytical equipment?
  • What about the loss of just one server?
  • Have you checked to make sure you can actually recover from your backups?
  • Have you identified where you will acquire the parts for equipment that is no longer supported by the OEM?

Answering these questions is the true value of a disaster recovery plan, which should provide documented steps to help you take action and limit your losses.

“There really is no excuse for not putting some level of backup policy in place.”

No one really appreciates the value of insurance until it needs to be used. Similarly, the benefit of a disaster recovery plan is only realized when something happens. Developing and putting in place a working DR plan won’t increase revenues, but doing so is great insurance for a company’s future. It can be hard justifying the effort needed to develop a working DR model, especially if nothing has ever happened in the past. However, when the unexpected does happen and no DR model is in place, it can cause a business to lose customers, lose revenue, halt growth, downsize, shut down, and even face fines and legal action.

“There is no danger that Titanic will sink. The boat is unsinkable, and nothing but inconvenience will be suffered by the passengers.” – Phillip Franklin, White Star Line vice-president, 1912

 


Author:

Tony Lisi
Laboratory Informatics Consultant
CSols Inc.