Data warehouse
Data warehousing is the most common method for increasing business intelligence data availability. We use a data warehouse (DW) to store and integrate data from other systems. Once the data is loaded into the DW, we typically keep it there for as long as it is required for analysis and decision support; this may be far longer than we would keep the data in the source systems.
The highly influential books Building the Data Warehouse by W.H. Inmon (Inmon, 2005), and The Data Warehouse Toolkit: The Complete Guide to Dimensional Modelling by Ralph Kimball (Kimball and Ross, 2002), established and popularised DW as a formal methodology when they were first published in the 1990s. These texts provide a detailed explanation of the justifications for DW and practical advice on how to organise a DW project. There are some differences in emphasis between the two books but there is broad agreement on the advantages and purpose of a data warehousing approach.
Data storage options
Before we look at the benefits of data warehousing in more detail, it is worth briefly considering the alternatives. Data warehousing is not a prerequisite for business intelligence and I have built many BI applications without using one. In the early days of your BI strategy, it is perfectly acceptable to develop BI solutions without following a strict data warehouse methodology. In fact, if you wait until a data warehouse is in place you may miss valuable opportunities to make quick wins. Business intelligence technology can often support simple business processes without the requirement for a high degree of data integration and availability. When we identify data that will be useful for BI, we have three basic choices:
- Leave the data where it is and provide the technology, security, and information required to access it
- Extract some or all of the data verbatim into a shared database repository (possibly the staging area of a DW)
- Extract some or all of the data and integrate it with data from other repositories in a DW
Storage strategy | Relative initial cost | Data availability | Advantages |
---|---|---|---|
Source system |
Low |
Low-High depending on structure and system |
Data is nearer real time and we reduce data storage costs by not duplicating it in a DW |
Extract only |
Low-Medium |
Low-High depending on structure and system |
Record a history of snapshots, and avoid security and load issues on source system |
Extract and integrate |
High |
High |
Integration benefits many of the factors affecting availability (see Table below) |
The data strategy is likely to mature over time, moving from documentation, to a regular extraction, right through to full integration. We can use DW methods to identify, extract, and integrate data.
We have said that DW is not a prerequisite for BI development. Our primary concern is that the data is sufficiently available to use in the current BI project and that when we use it, we also increase the data availability for future projects. A data warehouse is a very good way to achieve these dual aims but not the only choice.
Data warehouse and data availability
The primary focus of this series of articles is not data warehousing and it is enough to discuss how DW contributes to data availability and look at some of the more critical decision points you may have when choosing to warehouse data. Table 12 (below) looks at some of the main benefits of DW and relates them to the availability factors we discussed earlier in the chapter.
Benefit | Data availability |
---|---|
Data integration |
Integrating data from different source systems is time consuming and potentially complex. However, the process of integration usually increases the data quality by raising awareness of data issues and cleansing the data during ETL. Because the DW is a central source of data from different systems, this increases the visibility of the data throughout the organisation. If the DW holds integrated data, the BI project can concentrate on publishing the data to users. This reduces project costs and time-to-market |
Query performance |
Operational database design optimises record inserts, updates, and deletions. DW databases optimise query performance by using different physical and logical modelling strategies |
Query availability compared to operational system |
System administrators may impose restrictions on querying the operational data to avoid disruption to critical operational processes. A DW does need to support operational processes directly allowing us to run resource intensive queries |
Historical data |
An operational system may store a limited history of transaction data and then archive or delete old records to maintain performance. A DW will typically record a complete history of interesting transactions for historical analysis. An operational system may store only the current values of master data attributes like a company address or an employee’s department. The DW can take snapshots of these values and store a history preventing loss of information as described in the non-existing data scenario |
Derived data and custom groupings |
Operational systems store the minimum information necessary for supporting operational processes. DWs can store whatever information is useful for analysis. For example, we may want to classify customers using our own grouping attributes, or analyse a product description field by extracting all the keywords into separate database fields. We can use the DW to capture any information that was previously non-existing in a structured format or data that is not captured early enough due to the update timing of the operational system |
Unstructured data |
A DW can be a central repository for data that is useful but not stored in core operational systems. Examples include spreadsheet data; flat files from suppliers, customers, or other third parties; and log files. The ETL process may improve the data structure and integrity of the data, raising the availability to other applications |
Data warehouse modelling
DW data modelling is a technical subject and we cover it very briefly in this section. If you are a non-technical reader, you may find the following discussion a bit dry. If so, please move to the subsequent articles on ETL, data ownership, and business rules, which return to a more business oriented perspective.
3NF and dimensional modelling are the two main approaches to modelling data in the DW. We normally implement a DW using an RDBMS and we can use either modelling approach in this environment.
3NF data modelling
3NF modelling is the approach used in most operational systems but is also used for some DW. A relational data model consisting of tables, columns, and relationships is in the 3NF if it adheres to a set of rules designed to avoid data duplication and integrity issues. 3NF modelling is a very good way of maintaining data integrity. The constraints also provide a level of self-documentation about the meaning of the data that can be lost using dimensional modelling.
There are two perceived disadvantages of using a 3NF data model for the DW. Firstly, it is difficult for non-technical users to understand the data model and write queries directly against it. Secondly, because of the data model complexity, we normally have to create dimensional models in additional to the 3NF model to support analysis and BI tools. This means that a DW in 3NF will probably be more expensive to implement than a dimensional DW.
Dimensional modelling
The second approach to DW modelling is dimensional modelling. A dimensional model does not enforce data integrity to the same extent as 3NF. However, the model is much better suited to analysis because it is more intuitive to users and equivalent queries execute much faster. Many analytical databases and end user tools anticipate a dimensional model.
We can construct a dimensional model from queries on a 3NF database or materialise the structure in the DW. In common with 3NF models, it is not a prerequisite of BI applications that you implement a dimensional model. The driving factor is the extent to which the chosen model increases data availability – by lowering the cost of acquisition – for the intended purpose.
Guidance on choosing a modelling approach
You do not need to choose a single modelling approach for the entire DW. In practice, you will probably model different data sources using different approaches depending on your goal.
If the DW has to integrate and maintain data from a variety of complex structured and unstructured sources then you will probably want to validate the integrity of the integrated data. The process of integration is effectively creating a repository of original data and the same considerations apply as if you were maintaining the data for operational purposes. A 3NF model is a proven method of maintaining data integrity.
If you are confident about the integrity of the source system data, do not have much data integration, and have a specific purpose in mind then a dimensional modelling approach will probably be quicker and less costly than creating a 3NF DW model.
There is plenty of information about the relative merits of the two approaches. I recommend that those wanting a deeper understanding read both Imhoff (Imhoff, C. et al. (2003) Mastering Data Warehouse Design: Relational and Dimensional Techniques; Wiley Publishing Inc) and Kimball (Kimball, R. and Ross, M. (2002) The Data Warehouse Toolkit: The Complete Guide to Dimensional Modeling 2nd Edition, John Wiley & Sons: New York 2002)
Modelling time in the warehouse
One of the great benefits of a data warehousing is the ability to track and record changes in data over time. For instance, if we wanted to look at the effect of product packaging on sales, it is interesting to know that a particular product had red packaging in year 1, and blue packaging in year 2. The history of package information may not be available in the original source system – it may only record the current colour of the packaging. Using a DW, we can capture the package colour and record a history of changes.
When business users think about examples like the one above, their first impulse is to request that we track the history of changes for all their reference data. We should strongly resist this impulse because accurately modelling changing data is conceptually complex (see Malinowski, E. & Zimányi, E. (2006) A Conceptual Solution for Representing Time in Data Warehouse Dimensions, The Third Asia-Pacific Conference on Conceptual Modelling (APCCM 2006)), validating it is resource intensive, and interpreting the results is, in my experience, the most frequent cause of confusion for users and developers of the DW. In any given scenario, a user can describe exactly how they would expect a change in attribute value to be treated. Unfortunately, each new scenario leads to a different expectation.
It helps to consider the reality of operational processes before deciding whether to record the history of an attribute in the DW. Data entry mistakes, delays in updating databases, and backdated corrections can all frustrate our ability to record history correctly. When we integrate data from different sources, each updated at different frequencies, it is even more difficult to appreciate the real meaning of the resulting data.
Put simply, if we record the history of an attribute in a DW we make a statement about our ability to record it accurately. Users may not be sympathetic to the nuances of the underlying operational process that render our history inaccurate.
BI in practice
Historical attribute values in the DW
A company uses a DW to integrate data from two systems. The first is a billing system that records invoices and payments. The second is a CRM system that records the current holding company and industry sector for each customer.
We can relate records from the billing system with records from the CRM system using a customer identifier. This allows us to analyse billings by holding company and industry type over time.
The company updates the billing system every day with new invoices and receipts. The DW ETL process extracts the new records from the billing system each night. The nightly process also takes a snapshot of the CRM attributes, building a history of changes to the holding company and industry sector of each customer. The tables below show how the history of CRM attribute changes might look in the DW.
----- Holding company history table (CRM data source)
Customer Holding Co. Start Date End Date
10021 Small Corp 21-Mar-09 29-Jun-10
10021 Big Corp 30-Jun-10
----- Industry sector history table (CRM data source)
Customer Sector Start Date End Date
10021 Courier 01-Feb-08 25-Jan-10
10021 Freight 26-Jan-10
The following table shows a single record from the billing history table
----- Billing history table (Billing data source)
Customer Id Invoice No. Amount Inv. Date
10021 301 $3,000 15-Jan-10
Based on the invoice date and the CRM information in the DW, we can allocate the invoice for $3,000 to the Small Corp holding company and the Courier industry sector.
Unfortunately, what is invisible to the DW user is that the company only updates the CRM system periodically, when a sales representative meets with a customer. At this point, the sales representative asks the customer for their latest details and records them in the CRM system. The Start Date and End Date recorded in the DW represent the update time in the CRM system, not when the event occurred in the real world.
Let us say that customer 10021 changed holding company and industry sector on the 15-Dec-09. The sales representative meets with the customer on the 25-Jan-10. After this meeting, the sales representative updates the industry sector but forgets about the holding company. It is only at the second meeting, on 29-Jun-10, that they update the holding company.
The realities of the operational process thwart our attempts to accurately record history in the DW. Operational processes can also change over time, subtly changing the semantics of the DW data. Even if we understand the sales representative’s business process and update timings, it does not necessarily mean that the process was always this way, or that it will be so in the future.
The example is worth remembering because it is so common. There are a number of possible solutions for creating a more accurate history, or making the meaning of the history tables clearer, but they all involve additional cost. If we are selective about the historical information we record, we have a far better chance of correctly representing the real world in the data warehouse