Data availability
Data availability encapsulates all the factors affecting our ability to use the data for a BI solution. In this section, we explore some of the most common factors that affect availability.
Visibility
Visibility within the organisation
If you do not know the data exists, you will not specify a solution to use it. This is a very common situation in most organisations. Often people in different departments or business units have absolutely no idea what is stored in the operational systems used by other areas of the business. Because this does not appear to affect their operational processes, there is little motivation to learn.
Just as common are personal data repositories, often invisible even within a business unit or team. They manifest themselves as spreadsheets or personal databases and may have a critical role in supporting the owner’s business process. The owner may be reluctant to share these sources because they do not want to support a service level agreement where others expect them to update it to their schedule. Equally, they may never have considered that the data might be valuable to others.
Visibility of the outside world
Invisibility of data within an organisation has always been commonplace, but we may also be unaware of the increasing variety of useful data that sits outside of the organisation.
Web services and data provisioning services provided by government and private companies, are a very efficient way of improving the availability of reference and statistical data. Many of the government services are free.
The cost of buying lists of addresses, companies, stock exchange, and weather data, to name but a few, is far lower than the cost of trying to cleanse poorly maintained reference data during ETL development.
Industry survey results are an example of data that is generally visible to at least some people in the organisation, because they contribute the company’s data to the survey provider. However, there can be a missed opportunity to capture the results in a structured format and share them with the rest of the organisation.
Data quality
Data integrity
Data integrity is the extent to which the data conforms to a set of formal constraints and business rules. A 3NF database design is the most common way of enforcing data integrity. The sample customer table below demonstrates common problems found in data without constraints.
Name | Age | Gender | Occupation |
---|---|---|---|
Amy Dracula Adams |
25 |
Female |
Teacher |
Jimmy Vampire Jones |
male |
||
Henry Wolf man Hope |
Builder |
MALE |
20 |
Amy Dracula Adams |
Twenty Five |
F |
Teacher |
5 |
M |
Lawyer |
Integrity problems include:
- Duplicate rows in the table. Is it likely that we have two twenty-five year old teachers called Amy Dracula Adams?
- Missing data in the Name, Age, and Occupation columns for some rows
- Incorrect and inconsistent data in the Age column. If we wanted to find the average age of our customers, the jumbled mix of numeric and character data would hamper our analysis
- Data allocated to the incorrect column.
- Inconsistent representations in the Gender column of the two valid values
- A poor choice of data types for representing the Age of customers
This type of table is an ETL developer’s worst nightmare! The problem only gets worse when you introduce more tables and try to relate customer data to other areas like sales or address lists.
A well-designed and maintained operational database would avoid all these problems. In reality, not all useful data is stored in an operational database. Even when it is, we may still turn to unstructured sources if they are a more current or frequently updated record than the operational system. There is often a strong motivation to use semi-structured repositories, but it is almost inevitable that a manually maintained spreadsheet of more than a few hundred lines will have at least some of the issues seen in the table above.
Data accuracy
Constraints can only go so far in preserving data quality. The following table is an improvement on our original example but still has issues.
Name | Date of birth | Gender | Occupation |
---|---|---|---|
Amy Dracula Adams |
15-Jan-80 |
Female |
Teacher |
Jimmy Vampire Jones |
25-May-65 |
Male |
Actor |
Henry Wolf man Hope |
27-Aug-73 |
Male |
Builder |
June Dread James |
02-Mar-82 |
Female |
Teacher |
Terry Troll Timms |
25-May-65 |
Female |
Lawyer |
The data has consistency, but accuracy is still an issue. In the real world:
- Amy Dracula Adams has quit teaching to work as a security guard on the graveyard shift
- Terry Troll Timms is a man not a woman
- Jimmy Vampire Jones was born on 30-Dec-1958 but somehow Terry’s birthdate has been entered instead
These data correctness issues are more difficult to spot but will obviously affect the quality of our analysis and reporting. When users do not trust the data, they will not use it. The data continues to be effectively unavailable even though it is accessible.
Security
Security is frequently a barrier to data availability. You know the data exists but it is not accessible. Statutory obligations, personal privacy, industry standards, or protection of intellectual property may all be legitimate reasons for this.
However, organisations will interpret these standards differently and in ways that will affect the ability to share data. From my observations, it appears that security and access to data within an organisation – we are not talking about exposing the information to the world at large – is a cultural part of the organisation rather than specific to a particular industry or set of standards. Put simply, some organisations have a culture of trust, expecting their employees to act responsibly with the data, and some prefer to avoid the risk by exposing only that which is essential for each person’s immediate responsibility.
The most effective BI solutions utilise data from across business processes to provide a broader perspective on the decision. As with our Financial Controller example in Chapter 1, where we considered HR, CRM, and GL data, many decisions would benefit from having more exposure to processes in other business functions.
Data source structure
Common data source structures include relational databases, XML documents, binary word processing documents, proprietary spreadsheet structures, text files, and media files for music and photographs. Sometimes the document type can come in a number of data structures. For instance, a word processing document can be stored as an XML document or a binary file format.
It is possible to make some reasonable assumptions about other availability factors once you know the data source structure. The following table provides some relative indicators for common data sources.
Data source type | Structure | Interface | Tools support | Visibility | Integrity |
---|---|---|---|---|---|
Relational database |
3NF tables and columns |
SQL |
Very high |
High |
High |
Multi-dimensional database |
Dimensional model |
SQL, MDX |
High |
Med |
Med-High |
XML document |
XML |
XML |
High |
Low |
Low-High |
Text file |
CSV, fixed width |
File |
Med |
Low |
Low-Med |
Word process-ing document |
Binary, XML, HTML |
API, XML |
Low |
Low |
Low |
Spread |
Binary, XML |
API, XML |
Med |
Low |
Low-Med |
Web page |
HTML, XML |
HTML, XML |
Med |
Med |
Low-Med |
Non-existing data
Non-existing data for our purposes is data or information that was once known and possibly recorded, but is now lost with the passing of time. We may not record the data, as with the thought processes of a manager making a decision; or we may record, but subsequently overwrite it, as with a database field that records the current state of an entity.
The purpose of identifying non-existing data is firstly, to draw a line at where BI can support a business process at the current time; and secondly, to lay the groundwork for improving data availability by capturing the data for future use. In the early stages of a BI program, we may find a number of our ideas require data that does not exist. These ideas should feed into the data strategy where we determine appropriate methods for capturing the data going forwards. The following example is a common scenario of non-existing data.
BI in practice
Analysing our analysis – data that is never captured
Recently, I was discussing industry survey data with a marketing analyst. The survey provided detailed statistics about sales of the company’s products and their competitors’ products at a national and regional level. The marketing department used the data to see how they were tracking within their industry.
The analyst explained to me that they discussed the results in a meeting and could normally correlate unusual trends to a specific promotion, television advertising campaign, or new product release.
I was curious to know whether they recorded these insights in a structured format for retrospective analysis. I was not surprised to find that they did not. We review reports, draw conclusions, and then move on. We might take meeting minutes, but we rarely analyse the effectiveness of our analysis!
The conclusions we draw from analysis are often the most valuable data for looking at trends. We could use a history of these inferences to quantify the value of advertising or predict the impact of future industry events.
Update timing
Update timing can be crucial to data availability. For instance, it is very easy to get free share price data from the internet as long as you can accept a 15-minute delay. For people who want a rough indication of their portfolio value, this delay is acceptable. For a professional trader in need of real time information, the delay means the data has little value.
A similar scenario is common for operational systems. Good quality data may exist in the system, and be easy to retrieve. However, if the record keeping process occurs a week after it was completed in the real world, this is too late if we need to know what happened yesterday.