Business Intelligence Data Storage Architecture

Key Points

For the vast majority of business intelligence projects, an off the shelf standard RDBMS is sufficient for building data marts and data warehouses
Business intelligence user tools often query analytical data stores and do not interact directly with the RDBMS data warehouse layer. However, an RDBMS layer is still the most robust mechanism for preserving data integrity and historical information
In the early stages of a BI strategy you will spend the majority of your development time doing ETL. A specialist ETL tool will ensure the work is performed as efficiently as possible.

You will typically require three layers of accessible data storage for your BI solutions: operational (primary) data storage; data warehouse/historical storage; and analytical databases. In addition, you will require an ETL tool to move data between each of the layers.

Data storage architecture

You will typically require three layers of accessible data storage for your business intelligence solutions: primary data storage; data warehouse/historical storage; and analytical databases. In addition, you will require an ETL tool to move data between each of the layers.

Data sources (primary data storage)

The first layer is comprised of source systems that hold the primary data. I use the term primary data to emphasise that the systems are the original source of the data rather than being comprised of data collected from other systems. Examples might include ERP systems, semi structured data like spreadsheets or other documents, and external sources like web services or subscriptions.
Where the BI solution requires near-real-time data you may point your BI user tools and applications directly at these primary data sources.
In addition to real-time capability, an advantage of using the source systems directly is that you do not have to replicate the data to a data warehouse, which can be time consuming. Of course, a data warehouse has many advantages that you do not get with this approach.
The technology strategy should identify appropriate tools to extract data from these data sources. We should also provide guidance as to when it is appropriate to move data directly from the source system to the end user tool, and when it should be via the data warehouse layer.

Data warehouse (historical data storage)

In the article on data strategy, we discussed the reasons for building a data warehouse. Here we look at the technology requirements to support this functionality.
The persistent historical storage layer should be supported by a server based (as opposed to personal) RDBMS. If immediate requirements envisage many terabytes of data, then you may consider specialist data warehouse software or even a data warehouse appliance .
However, I cannot imagine recommending that an organisation with a new or immature business intelligence strategy start by implementing these specialist products. It would be the equivalent of buying a Formula 1 racing car for driving lessons. Even if you have many terabytes of data, you should initially consider building your business intelligence capability with the organisation’s existing RDBMS software and less ambitious goals. After a few short projects, you will be in a much better position to know the exact challenges you need to address.
For the vast majority of BI projects, an off the shelf standard RDBMS is sufficient. In the previous chapter, we briefly discussed the differences between 3NF DW modelling and dimensional modelling. In either case, we would use a RDBMS.
BI user tools often query analytical data stores and do not interact directly with the RDBMS DW layer. This leads to the question: why do I have this redundant storage, why not store the historical data in the analytical database? RDBMS have a number of features that are unavailable or poorly supported in analytical databases. These include:

Data constraints including referential integrity, uniqueness, and complex data types
Fine grained control over updating and amending data to support backdated corrections, amendments and incremental loads
Standard SQL interface for querying, manipulating and extracting data
Logging, backup scheduling, and failover
Scalability management

This functionality is essential for the ongoing maintenance of the DW.

Analytical databases (OLAP)

Analytical databases (OLAP) come in a number of styles but they share three common attributes:

designed and optimised for efficient querying
easier and more intuitive for the user to interact directly with the data model for ad hoc analysis
simpler and more powerful expression language for defining typical analytical calculations

Analytical databases provide faster and more flexible query analysis than can be achieved through an RDBMS with equivalent hardware. They are able to do this because the data model is not concerned with supporting the ongoing maintenance and integrity of the data. Analytical databases are one of the most exciting areas of BI with the latest products able to consume huge quantities of data whilst maintaining very fast query response times. For this reason, they are well suited for ad hoc query or interactive dashboards and reports.

Logical data model

The logical data model for an analytical database typically stores more information (metadata) about the data than a relational model. For instance, it may store the display order for a series of attributes. A simple example is the days of the week. Rather than displaying them in alphabetical order, we may specify the sort order as Monday, Tuesday, Wednesday…Sunday. This additional information increases the usability of the data model for ad hoc analysis.
Many databases use a multidimensional data model comprised of dimensions (representing things like customers and products) and fact tables (representing transactions and events like sales). Some data models make no logical distinction between master data and transaction data, and allow slice and dice on any attribute in the database.

Physical data model and performance

Most analytical databases store at least part of the data in memory; compile and manage frequently requested queries and data aggregations; and implement a highly compressed and indexed physical data model. These factors all contribute to the exceptional query response times that analytical databases provide.
It is worth noting that complex data models and algorithmic calculations can have a detrimental impact on query performance. It would be a mistake to assume that just because a vendor demonstration showed sub-second query response time on 100 GB of data, that the same query performance will be possible when your business rules and algorithms are applied to the model, even if you have significantly less data.

Query language

Another common attribute of analytical databases is an expressive query language better suited to typical analytical queries than SQL. One popular query language is MDX (see Spofford, S. et al. (2006) MDX Solutions: With Microsoft SQL Server Analysis Services 2005 and Hyperion Essbase; John Wiley & Sons; 2nd Edition Mar 2006). It exploits the additional metadata stored in a multidimensional data model. Even simple MDX calculations can provide answers to business questions that would traditionally have required a SQL programmer.
MDX is not the only query language available for analytical databases. Most either support this standard or their own bespoke language. Generally, the language makes it straightforward to define calculations including:

Time trend analysis
Ratios at different levels of aggregation
Set definition and manipulation

A typical question for an analytical database might be:
For each of my ten most popular products this year by volume in each region, with decreasing volume over the last three years: show me the sales and margin for the current year by period and the top sales rep for each of the products in the period and their percentage contribution to the sales.

Table - Result set from an analytical database query
Product	Region	Sales Rep	Period	Sales	Margin	Sales Rep %
Computer	North	Janet Jones	Jan	100K	10%	30%
Printer	North	Henry House	Jan	30K	15%	20%
Scan-ner	South	Toby Talbot	Jan	25K	25%	40%
Mouse	South	Prue Percy	Jan	130K	30%	5%

One interesting feature of such a query is that the selection criteria for the list of products are independent of the information we wish to report. This type of question would require a lot of SQL code and take a long time to execute on a SQL RDBMS. In contrast, we could express the same query with just a few lines of MDX and expect it to return results much more rapidly than the equivalent SQL query.

Query interface

One shortcoming of some analytical databases is that they lack a standard query interface. If the database supports a standard language like MDX then you should be able to query it with any compatible tool. This gives you more flexibility to embed queries within your business applications. A large number of analytical databases have a tightly coupled database and user interface. If the database is proprietary with no published query interface it may be difficult to query the database using anything other than the vendors own tools.

Data mining models

The traditional pattern of analysis normally starts with a hypothesis. For instance, we might hypothesise that the location of a customer influences how quickly they pay their invoices. We then ask questions to prove or disprove the hypothesis.
Are my international customers better at paying their invoices than domestic customers?
We perform the analysis ending with a conclusion that either supports or refutes that hypothesis. Data-mining questions are more open ended. You might ask questions like:
What products should I be marketing to which customers?
What are the key factors that determine the success of a new shop?
There are a number of data mining algorithms, each best suited to particular questions and outputs. Data mining is not an entry-level BI project. It requires a good knowledge of the data, an appreciation of data mining algorithms, and a deep understanding of the business domain to identify interesting patterns.
Advocates of data mining emphasise that the preparation work can be as informative as the process of running the algorithm. A mining algorithm can make hundreds of passes over the data set so it is important that the data is clean and that the structure is suitable for machine processing.

BI in practice
Analysing the results of data mining
A company provides travel consultancy to organisations around the world. They book flights and hotels and charge a commission for the service. They collect information about their customers and want to use this to understand more about them.
The first thing they want to know is what makes for a profitable customer. If they can understand the commonalities of profitable customers, they can concentrate their resources on marketing to those customers.
They run a mining algorithm on all the attributes that they have collected about the customer. The result tells them that the most significant contributors to profitability are customers with high revenue and high margin. Well that is undeniably true but ultimately unhelpful as it only asserts a mathematical relationship between margin, revenue, and profit.
However, the results also reveal that long-term customers based in city centres with a presence in Asia are a high predictor of profitability. This leads to further questions and analysis but also reveals potential opportunities. For instance, they may want to target similar companies who are not existing customers.
One of the questions we should always ask is whether we have really found the predictive attributes or a side effect of an attribute outside of the data we have analysed.
In our present scenario, further analysis reveals that the company’s top sales manager manages all the customers in the high profitability group. So are the customers highly profitable because they are long-term city based customers with a presence in Asia, or, is it because the highflying sales manager manages them? Is the sales manager’s performance attributable to hard work and natural talent or are they just lucky enough to be managing the long-term customers with a presence in Asia?

In my experience, data mining is not as popular as reports and dashboards. The results of a data mining exercise are hard to predict and it is even harder to find specialists in data mining. However, there are plenty of applications of data mining that are intuitive to business people and do not require a huge investment.
The most commonly cited application of data mining is market basket analysis. Larger retailers have used this technique for many years. Market basket analysis finds relationships between collections of items that would be difficult to detect using more manual analysis techniques. The basket is the shopping basket or cart and the analysis finds the likelihood of different collections of items appearing together in a shopping cart.
It may be obvious to everyone that if a person buys a tennis racket, that same person is more likely to buy tennis balls. Market basket analysis can provide insight into subtle relationships between items that are not so intuitive. For instance, we might find that if a parent buys children’s football boots they also buy a doll or toy car; the rationale being that if one child gets football boots the other child should also get a present. This understanding can help us with cross selling, promotions, and product placement in the store.

Extract, transform, and load (ETL)

The technology strategy should specify the tools for moving data between the different storage layers to make it more available for end user applications. You should definitely use a specialist ETL tool for this purpose.
ETL tools can perform an array of functions including data profiling to determine data quality, advanced data manipulation and transformations, job scheduling, data load management, exception handling, logging, and workflow. Typically, ETL tools automate many of the routine steps in ETL development and provide an intuitive graphical interface for constructing and testing ETL procedures.
A number of end user tools and analytical databases come with crude ETL capability designed to get you started without the need for IT involvement or specialist tools. This may be appropriate for the Extract and Load part of ETL, or for moving well-organised data from the DW into a query tool or analytical database. However, ETL functionality tied to a specific end user tool does not increase availability of the data to the organisation as a whole. You may also find that complex data Transformations are difficult to achieve with bolt-on ETL tools and even more difficult to troubleshoot if they do not provide the expected results.
Good ETL tools automate or semi-automate the majority of common data manipulation tasks and provide significant productivity gains over bespoke programming solutions. The reality in the early stages of a BI strategy is that you will spend the majority of your development time doing ETL. For this reason, you want to be sure you perform the ETL work as efficiently as possible.

For comments and feedback or to talk to CMBI about your BI and DW requirements please visit our Contact page or email insight@cmbi.com.au