Hope is Not a Strategy: Real Data Warehouse DR planning

In my long tenure as a data warehousing provider, it is often that I will find parallels and analogies in the everyday world which I can use to connect with people. This week was no exception. As I was at church for the Easter services a lot of messages came about. Not to bang the religion drum, Easter is a pretty big event to many. It is about suffering, rebirth, atonement, forgiveness, resurrection and hope, just to name a few items.

Editor’s note: Rob Armstrong is an employee of Teradata. Teradata is a sponsor of The Smart Data Collective.

It was this message of hope that resonated with me as I had just had the following conversation with a customer regarding their disaster recovery strategy. I asked them what are the current plans for disaster recovery and the timeliness for being back to full operations. The answer was that they had a secondary machine in the nearby building in which data can be quickly reloaded. I questioned the logic of having the “disaster” system so close to the primary because in case of a REAL disaster chances are both building are going to be effected.

The answer I got was that “we hope that will not happen”. OK, if hope is your strategy then why bother with any type of failover. You can just “hope” that nothing bad will happen.

So what makes a good, solid plan for disaster recovery? The obvious answer is that it needs to understand the business need and the business tolerance for downtime. It needs to balance the value of data availability and the cost of platform redundancy.

There is a bit of a Catch-22 here as most companies are fearful to put “mission-critical” applications on the data warehouse as often it does not have robust DR but then you will not get robust DR unless there is a business need to have it originally.

The common plan today is that the data warehouse is not a “tier 1” platform but certainly it needs be restored within “days” of any major disaster. Worse case you need to have a secondary platform where the data can be loaded from backup tapes. As I alluded to above, the secondary system should be significantly far enough away that any disaster will not take out both systems simultaneously.

More frequently, companies are seeing enough business value out off the data warehouse that the business demand is that any system must be able to be recovered in “hours”. In this case, you not only have a secondary system available but are also loading data so worse case you may only have a week or so of data that needs to be added to the DR system.

The leading edge companies are now seeing the data warehouse as critical. They are running operational processes and driving interactive decisions from real time and historical data. The data warehouse can not be down as business is directly impacted. In these cases, you not only have a secondary system but it is kept in synch with the primary. Ideally you are also running workload on both platforms at any time.

The important point to keep in mind is that regardless of where you are on the continuum, you have to be ready to move forward. When the business starts moving decisions into more direct and operational processes, the data warehouse must already be positioned to add the higher level of DR need.

So your plan needs to not only reflect the current need but also drive the next demand. What else is required of a good plan?

The next point to discuss today is what data needs to be available. Not all data is equal and this causes much of that catch-22. Perhaps all you need to be available in a disaster is data from the last 3 months, not the last 5 years. If you are providing DR for 3 months of data then the financial equation changes dramatically as opposed to providing DR for the “whole data warehouse”.

This actually helps in getting the data warehouse into the world of operational decisions. Incorporating the operational data into the warehouse clearly has business value. If you can get the operational systems to access that data in the warehouse then you can get additional value by reducing redundancy, reducing data latencies, and eliminating systems management and platforms.

So again we need to have the conversation between business and IT on what data sets and for what length of history. The good part is that regardless of where you put the “critical” data you were going to have to provide DR (with secondary platforms) anyway. Now you can start to provide the more frequent DR on the warehouse for only a subset of the data. In a sense we get to attack the catch-22. Critical data can have the DR rigor within the warehouse arena and a process is set up so as new data and applications are identified, it is “easy” (alright, easier) to move from a single to dual environment.

The last attribute of a good plan to discuss today is fairly simple as well though unfortunately not always incorporated. A good plan will also have testing of the processes. If you are not actually executing the DR plan on a regular periodic schedule then you do not have DR. The worst time to test your DR plan is in the middle of a disaster. One of my customers would routinely cause a disaster once a month somewhere in the entire enterprise data flow and divisions were measured, and compensated, based on their time to recover. Are you regularly testing your DR plans?

So questions to consider. Do you have reasonable DR plans in place? Are they tested regularly yet unexpectedly? Have the business and IT owners set out the roadmap for more “critical” data warehouse usage? What is standing in your way of moving forward?