Extract, Transform, and Load (ETL)

Key Points

  • Be pragmatic about data integration. Always balance the cost of automated data integration with the business value
  • Ensure the BI developers speak directly with the business owners to clarify where the real business value lies. This can get lost in translation if there is a sole reliance on formal specifications, resulting in unnecessary development
  • Always balance ETL spend with developing end user facing capability. ETL successes alone rarely are rarely enough to gain future sponsorship and engage end users

One of the inconvenient truths of data warehousing and business intelligence is that you can easily spend far more time analysing data quality, cleansing data, and moving data around, than developing the end user interaction layer that will provide the real value to the business processes

Extract, transform, and load (ETL)

ETL is the term used to describe taking data from the source system (EXTRACT); data cleansing, integration, and applying business rules (TRANSFORM); and moving it into the final destination – often a DW (LOAD). An array of specialist tools exists to help with every aspect of this process.
One of the inconvenient truths of data warehousing and business intelligence is that you can easily spend far more time analysing data quality, cleansing data, and moving data around, than developing the end user interaction layer that will provide the real value to the business processes. This can be a source of great frustration for many BI specialists, who would rather be getting to grips with the business problems and creating the end user deliverables than writing ETL procedures.
A certain amount of data manipulation is essential to provide the users with quality integrated data and apply business rules. A successful ETL project that integrates and cleanses frequently used data will drastically reduce the cost of developing subsequent BI solutions that use the same data. These potential gains must be balanced by the fact that ETL is expensive, time consuming and error prone; so you need to be sure of the return on investment before you do any more than is required for the immediate purpose. 
The most effective method of containing ETL spend is for the BI specialist to speak directly with the subject matter expert. From these discussions, it should become apparent which data sources will contribute most to supporting the business process.
It is likely that some of the data will be difficult and expensive to integrate. This may not be an issue if the user can retrieve it efficiently using the existing process. Do not assume that just because the data is difficult to extract that it is pivotal to supporting the business process.

BI in practice
Practical compromises on data integration
An events manager organises a weekly social evening. The company is an IT consultancy and the event provides an informal setting for the consultants to meet with clients that have recently started a new project, or finished an existing one.
The manager wants a BI solution to support the process by providing details of customers with relevant projects (recently started or finished), consultants’ diaries for availability, and information to help determine the venue.
The events manager identifies the following data sources:
HR database containing staff diary information
CRM database for client contacts and emails
Billing system for project initiation and billing
Websites of venues
Website and web service for weather information
The manager has noticed that outdoor events have better attendance, so he uses weather data to determine if an outdoor venue is practical. Ideally, he would like venue information and weather data integrated into the solution. However, it transpires that due to firewall restrictions and complexities with the venue and weather websites, it will be expensive and time consuming to integrate this data.
The other systems are less problematic and so the solution will automatically identify potential clients, match them with staff availability and suggest the day with the greatest potential attendance of clients and staff. From that point, the manager will continue the manual effort of browsing the weather and venue websites to complete the process.
Although not a perfect solution, the manager spent the majority of time identifying projects, clients, and staff availability. Now that we have supported this process, the manager has more time to select the right venue, albeit by continuing with their original process.

 

 

See Also