Estimating Extract, Transform, and Load (ETL) Projects

In the consulting world, project estimation is a critical component required for the delivery of a successful project. If you estimate correctly, you will deliver a project on time and within budget; get it wrong and you could end up over budget, with an unhappy client and a burned out team. Project estimation for business intelligence and data integration projects is especially difficult, given the number of stakeholders involved across the organization as well as the unknowns of data complexity and quality. Add to this mix a firm fixed price RFP (request for proposal) response for a client your organization has not done work for and you have the perfect climate for a poor estimate. In this article, I share my thoughts about the best way to approach a project estimate for an extract, transform load (ETL) project.For those of you not familiar with ETL, it is a common technique used in data warehousing to move data from one database (the source) to another (the target). In order to accomplish this data movement, the data first must be extracted out of the source system—the “E.” Once the data extract is complete, data transformation may need to occur. For example, it may be necessary to transform a state name to a two-digit state code (Virginia to VA)—the “T.” After the data have been extracted from the source and transformed to meet the target system requirements, they can then be loaded into the target database—the “L.”

Before starting your ETL estimation, you need to understand what type of estimate you are trying to produce. How precise does the estimate need to be? Will you be estimating effort, schedule, or both? Will you build your estimate top down or bottom up? Is the result being used for inclusion in an RFP response or will it be used in an unofficial capacity? By answering these questions, you can assess risk and produce an estimate that best mitigates that risk.

In many cases, the information you have to base your estimate on is high level with only a few key data points do go on, and you do not have either the time or ability to ask for more details. In these situations, the response I hear most often is that an estimate cannot be produced. I disagree! As long as the precision of the estimate produced is understood by the customer, there is value in the estimate and it should be done. The alternative to a high-level estimate is none at all, and as someone who has to deliver on the estimate, I would rather have a bad estimate with clear assumptions than no baseline at all. The key is being clear about how the estimate should be used and what the limitations are. I have found that one of the best ways to frame the accuracy of the estimate with the customer and project team is through the use of assumptions. Every estimate is built with many assumptions in mind and having them clearly laid out almost always generates good discussion and eventually a more refined and accurate estimate.

A common question that comes up during the estimation process is effort versus schedule; in other words, how many hours will the work take versus the duration it will take to complete the effort. To simplify the estimating process, I start with a model that delivers the effort and completely ignore the schedule. Once the effort has been refined, it can be taken to the delivery team for a secondary discussion on overlaying the estimated effort across time.

Once you know what type of estimate you are trying to deliver and who your audience is, you can begin the process of effectively estimating the work. All too often, this up-front thinking is ignored and the resulting estimate does not meet expectations.

I’ve reviewed a number of the different ETL estimating techniques available and have found some to be extremely complex and others more straightforward. Then there are the theory of estimating and the tried and true models of Wide Band Delphi and COCOMO. All of these theories are interesting and have value but they don’t easily produce the data to support the questions I am always asked in the consulting world: How much will this project cost? How many people will you need to deliver it? What does the delivery schedule look like? I have discovered that most models focus on one part of the effort (generally development) but neglect to include requirements, design, testing, data stewardship, production deployment, warranty support, and so forth. When estimating a project in the consulting world, we care about the total cost, not just how long it will take to develop the ETL code.

Estimating an ETL Project

In the ETL space I use two models (top down and bottom up) for my estimation, if I have been provided enough data to support both; this helps better ground the estimate and confirms that there are no major gaps in the model.

Estimating an ETL Project Using a Top Down Technique

To start a top down estimate I break down the project by phase and then add in key oversight roles that don’t pertain specifically to any single phase (i.e., project manager, technical lead, subject matter expert, operations, etc.). Once I have the phases that relate to the project I am estimating for, I estimate each phase vertically as a percentage of the development effort, as shown in the chart below. Everyone has a different idea about what percentage to use in the estimate and there is no one right answer. I start with the numbers below and tweak them accordingly, based on the project environment and resource experience.

Phase	Percentage of Development
Requirements	50% of Development
Design	25% of Development
Development
System Test	25% of Development
Integration Test	25% of Development

Once I have my verticals established, I break my estimate horizontally into low, medium, and high, using the percentages below:

Complexity	Percent of Medium
Low	50% of Medium
Medium	N/A
High	150% of Medium

Generally, when doing a high-level ETL estimate, I know the number of sources I am dealing with and, if I’m lucky, I also have some broad stroke level of complexity information. Once I have my model built out, as described above, I work with my development team to understand the effort involved for a single source. I then take the numbers of sources and plug them into my model, as shown below (Figure 1, in yellow). If I don’t have complexity information, I simply record the same numbers of sources in the low, medium, and high columns to give me an estimate range of +/−50%

I now have a framework I can share with my team to shape my estimate. After my initial cut, I meet with key team members to review the estimate, and I inevitably end up with a revised estimate and, more importantly, a comprehensive set of assumptions. There is no substitute for socializing your estimate with your team or with a group of subject matter experts; they are closest to the work and have input and ideas that help refine the estimate into something that is accurate and defendable when cost or hours are challenged by the client.

Estimating an ETL Project Using a Bottom Up Estimate

When enough data are available to construct a bottom up estimate, this estimate can provide a powerful model that is highly defendable. To start a bottom up ETL, estimate a minimum of two key data elements are required: the number of data attributes required and the number of target structures that exist. Understanding the target data structure is a critical input to ETL estimation, because data modeling is a time-consuming and specialized skill that can have a significant impact on the cost and schedule.

When starting a bottom up ETL estimate, it is important to break up the attributes into logical blocks of information. If a data warehouse is the target, subject areas work best as starting points for segmenting the estimation. A subject area is a logical grouping of data within the warehouse and is a great way to break down the project into smaller chunks that align with how you will deliver the work. Once you have a logical grouping of how the data will be stored, break down the number of attributes into the various groups, noting the percentages of attributes that do not have a target data structure.

Once you have defined the target data subject areas, attributes, and percentages of data modeled, the time spent per task, per attribute can be estimated. It is important to define all tasks that will be completed during the life cycle of the project. Clearly defining the assumptions around each task is also critical, because consumers of the model will interpret the tasks differently.

In the example shown, there is a calculation that adjusts the modeling hours based on the percentage of attributes that are not modeled, giving more modeling time as the percentage increases. This technique can be used for any task that has a large variance in effort based on an external factor.

To complete the effort, estimate the hours per task that can be multiplied by the total number of attributes to get effort by task. In addition, the tasks can be broken out across the expected project resource role, providing a jump start on how the effort should be scheduled. As with any estimate, I always add a contingency factor at the bottom to account for unforeseen risk.

Comparing a top down estimate with a bottom up estimate will provide two good data sets that can drive discussion about the quality of the estimate as well us uncover additional assumptions.

Scheduling the Work

Once the effort estimate is complete (regardless of the type), I can start thinking about how much time and how many resources are needed to complete the project. Generally, the requestor of the estimate has an expected delivery date in mind and I know the earliest time we can start the work. With those two data points, I can calculate the number of business days I have to deliver the project and get a rough order of magnitude estimate of the resources required.

The first thing I do is map the phases established in the effort estimate to the various project team roles (BSA, developer, tester, etc.). Once I break down the effort into roles, I can then divide the effort by the number of days available in the project to get the expected number of resources required. In the example below (Figure 2), I shorten the time that the BSA, developer, and tester will work, taking into account that each life cycle phase does not run for the duration of the project. At this stage, I also take into consideration the cost of each resource and add in a contingency factor. This method allows for the ability to adjust the duration of the project without impacting the level of effort needed to complete the work.

Using the techniques described above provides you with the flexibility to easily answer the “what if” questions that always come up when estimating work. By keeping the effort and the schedule separate, you have total control over the model.

Delivering on the Estimate

Once the effort and duration of the project are stabilized, a project planning tool (e.g., Microsoft Project) can be used to dive into the details of the work breakdown structure and further map out the details of the project.

It is important to continue to validate your estimate throughout the project. As you finish each project phase, revisiting the estimate to evaluate assumptions and estimating factors will help make future estimates better, which is especially important if you expect to do additional projects in the same department.

Conclusion

In my experience, bottom up estimates produce the most accurate results, but often the information required to produce such an estimate is not available. The bottom up technique allows the work to be broken down to a very detailed level. To effectively estimate bottom up ETL projects, the granularity needed is typically the number of reports, data elements, data sources, or the metrics required for the project.

When a low level of detail is not available, using a top down technique is the best option. Top down estimates are derived using a qualitative model and are more likely to be skewed based on the experience factor of the person doing the estimate. I find that these estimates are also much more difficult to defend because of their qualitative nature. When doing a top down estimate for a proposal, I like to include additional money in thebudget for contingency to cover the unknowns that certainly lie in the unknown details.

There is an argument that a bottom up estimate is no more precise than a top down estimate. The thinking here is that with a lower level of detail, you make smaller estimating errors more often, netting the same result as the large errors made in a top down approach. Although this is a compelling argument (and why I do both estimates when I can), the more granular the estimate you have, the quicker you can identify flaws and make corrections. With a top down estimate, errors take longer to be revealed and are harder to correct.

An estimate is only as good as the data used to start the estimate and the assumptions captured. Providing clear and consistent estimates helps build credibility with business customers and clients and provides a concrete defensible position on how you plan to deliver against scopeand it also provides a constant reminder of the impact of additional scope. No matter how easy or small a project appears to be, always start with an estimate and be prepared for that estimate to need fine tuning as new information becomes available.

Glossary of Terms

ETL – Extract, Transform, and Load. A technique used to move data from one database (the source) to another database (the target)

Business Intelligence – A technique used to analyze data to support better business decision making

Data Integration – The process of combining data from multiple sources to provide end users with a unified view of the data

Data Steward – The person responsible for maintaining the metadata repository that describes the data within the data warehouse.

Data Warehouse – A repository of data designed to facilitate reporting and business intelligence analysis

RFP – A request for proposal (RFP) is an early stage in the procurement process, issuing an invitation for suppliers, often through a bidding process, to submit a proposal on a specific commodity or service.

Source – An ETL term used to describe the source system that provides data to the ETL process

Subject Area – A term used in data warehousing that describes a set of data with a common theme or set of related measurements (e.g., customer, account, or claim)

Target – An ETL term used to describe the database that receives the transformed data

More Read