Data availability for business intelligence

Key Points

We should understand the various factors that influence data availability because they have such a strong influence on the cost and integrity of BI solutions
Visibility is one of the most common barrier to using data effectively. If you do not know the data exists, you will not specify a solution to use it.
Security is frequently a barrier to data availability. Some organisations have a culture of trust, expecting their employees to act responsibly with the data, and some prefer to avoid the risk by exposing only that which is essential for each person’s immediate responsibility

Data availability encapsulates all the factors affecting our ability to use the data for a BI solution. It encapsulates more than just the ability to lay your hands on it. The data must be integral, well understood, and in a retrievable format.

Data availability

Data availability encapsulates all the factors affecting our ability to use the data for a BI solution. In this section, we explore some of the most common factors that affect availability.

Visibility

Visibility within the organisation

If you do not know the data exists, you will not specify a solution to use it. This is a very common situation in most organisations. Often people in different departments or business units have absolutely no idea what is stored in the operational systems used by other areas of the business. Because this does not appear to affect their operational processes, there is little motivation to learn.
Just as common are personal data repositories, often invisible even within a business unit or team. They manifest themselves as spreadsheets or personal databases and may have a critical role in supporting the owner’s business process. The owner may be reluctant to share these sources because they do not want to support a service level agreement where others expect them to update it to their schedule. Equally, they may never have considered that the data might be valuable to others.

Visibility of the outside world

Invisibility of data within an organisation has always been commonplace, but we may also be unaware of the increasing variety of useful data that sits outside of the organisation.
Web services and data provisioning services provided by government and private companies, are a very efficient way of improving the availability of reference and statistical data. Many of the government services are free.
The cost of buying lists of addresses, companies, stock exchange, and weather data, to name but a few, is far lower than the cost of trying to cleanse poorly maintained reference data during ETL development.
Industry survey results are an example of data that is generally visible to at least some people in the organisation, because they contribute the company’s data to the survey provider. However, there can be a missed opportunity to capture the results in a structured format and share them with the rest of the organisation.

Data quality

Data integrity

Data integrity is the extent to which the data conforms to a set of formal constraints and business rules. A 3NF database design is the most common way of enforcing data integrity. The sample customer table below demonstrates common problems found in data without constraints.

Table – A table of customers with data integrity issues
Name	Age	Gender	Occupation
Amy Dracula Adams	25	Female	Teacher
Jimmy Vampire Jones		male
Henry Wolf man Hope	Builder	MALE	20
Amy Dracula Adams	Twenty Five	F	Teacher
	5	M	Lawyer

Integrity problems include:

Duplicate rows in the table. Is it likely that we have two twenty-five year old teachers called Amy Dracula Adams?
Missing data in the Name, Age, and Occupation columns for some rows
Incorrect and inconsistent data in the Age column. If we wanted to find the average age of our customers, the jumbled mix of numeric and character data would hamper our analysis
Data allocated to the incorrect column.
Inconsistent representations in the Gender column of the two valid values
A poor choice of data types for representing the Age of customers

This type of table is an ETL developer’s worst nightmare! The problem only gets worse when you introduce more tables and try to relate customer data to other areas like sales or address lists.
A well-designed and maintained operational database would avoid all these problems. In reality, not all useful data is stored in an operational database. Even when it is, we may still turn to unstructured sources if they are a more current or frequently updated record than the operational system. There is often a strong motivation to use semi-structured repositories, but it is almost inevitable that a manually maintained spreadsheet of more than a few hundred lines will have at least some of the issues seen in the table above.

Data accuracy

Constraints can only go so far in preserving data quality. The following table is an improvement on our original example but still has issues.

Table – A table of customers with data accuracy issues
Name	Date of birth	Gender	Occupation
Amy Dracula Adams	15-Jan-80	Female	Teacher
Jimmy Vampire Jones	25-May-65	Male	Actor
Henry Wolf man Hope	27-Aug-73	Male	Builder
June Dread James	02-Mar-82	Female	Teacher
Terry Troll Timms	25-May-65	Female	Lawyer

The data has consistency, but accuracy is still an issue. In the real world:

Amy Dracula Adams has quit teaching to work as a security guard on the graveyard shift
Terry Troll Timms is a man not a woman
Jimmy Vampire Jones was born on 30-Dec-1958 but somehow Terry’s birthdate has been entered instead

These data correctness issues are more difficult to spot but will obviously affect the quality of our analysis and reporting. When users do not trust the data, they will not use it. The data continues to be effectively unavailable even though it is accessible.

Security

Security is frequently a barrier to data availability. You know the data exists but it is not accessible. Statutory obligations, personal privacy, industry standards, or protection of intellectual property may all be legitimate reasons for this.
However, organisations will interpret these standards differently and in ways that will affect the ability to share data. From my observations, it appears that security and access to data within an organisation – we are not talking about exposing the information to the world at large – is a cultural part of the organisation rather than specific to a particular industry or set of standards. Put simply, some organisations have a culture of trust, expecting their employees to act responsibly with the data, and some prefer to avoid the risk by exposing only that which is essential for each person’s immediate responsibility.
The most effective BI solutions utilise data from across business processes to provide a broader perspective on the decision. As with our Financial Controller example in Chapter 1, where we considered HR, CRM, and GL data, many decisions would benefit from having more exposure to processes in other business functions.

Data source structure

Common data source structures include relational databases, XML documents, binary word processing documents, proprietary spreadsheet structures, text files, and media files for music and photographs. Sometimes the document type can come in a number of data structures. For instance, a word processing document can be stored as an XML document or a binary file format.
It is possible to make some reasonable assumptions about other availability factors once you know the data source structure. The following table provides some relative indicators for common data sources.

Table – Availability factors for common data source structures
Data source type	Structure	Interface	Tools support	Visibility	Integrity
Relational database	3NF tables and columns	SQL	Very high	High	High
Multi-dimensional database	Dimensional model	SQL, MDX	High	Med	Med-High
XML document	XML	XML	High	Low	Low-High
Text file	CSV, fixed width	File	Med	Low	Low-Med
Word process-ing document	Binary, XML, HTML	API, XML	Low	Low	Low
Spread sheet	Binary, XML	API, XML	Med	Low	Low-Med
Web page	HTML, XML	HTML, XML	Med	Med	Low-Med

Non-existing data

Non-existing data for our purposes is data or information that was once known and possibly recorded, but is now lost with the passing of time. We may not record the data, as with the thought processes of a manager making a decision; or we may record, but subsequently overwrite it, as with a database field that records the current state of an entity.
The purpose of identifying non-existing data is firstly, to draw a line at where BI can support a business process at the current time; and secondly, to lay the groundwork for improving data availability by capturing the data for future use. In the early stages of a BI program, we may find a number of our ideas require data that does not exist. These ideas should feed into the data strategy where we determine appropriate methods for capturing the data going forwards. The following example is a common scenario of non-existing data.

BI in practice
Analysing our analysis – data that is never captured
Recently, I was discussing industry survey data with a marketing analyst. The survey provided detailed statistics about sales of the company’s products and their competitors’ products at a national and regional level. The marketing department used the data to see how they were tracking within their industry.
The analyst explained to me that they discussed the results in a meeting and could normally correlate unusual trends to a specific promotion, television advertising campaign, or new product release.
I was curious to know whether they recorded these insights in a structured format for retrospective analysis. I was not surprised to find that they did not. We review reports, draw conclusions, and then move on. We might take meeting minutes, but we rarely analyse the effectiveness of our analysis!
The conclusions we draw from analysis are often the most valuable data for looking at trends. We could use a history of these inferences to quantify the value of advertising or predict the impact of future industry events.

Update timing

Update timing can be crucial to data availability. For instance, it is very easy to get free share price data from the internet as long as you can accept a 15-minute delay. For people who want a rough indication of their portfolio value, this delay is acceptable. For a professional trader in need of real time information, the delay means the data has little value.
A similar scenario is common for operational systems. Good quality data may exist in the system, and be easy to retrieve. However, if the record keeping process occurs a week after it was completed in the real world, this is too late if we need to know what happened yesterday.