Strategies for promoting data availability for business intelligence

Key Points

The first steps towards promoting data availability should provide quick wins at low expense. Publishing an inventory of data sources is a good place to start. Promoting a culture that favours responsible data sharing between business units and departments is also effective
Be pragmatic about the level of data quality required. 100% data quality and consistency is very difficult to achieve and is not necessary for many business intelligence applications
Do not expect data source owners to jump through hoops to clean up their data in the early stages of a BI program. We must gain sponsorship by delivering value to the business
Unstructured content is often the presentation layer for more structured data. Before considering ETL, investigate whether the same information can be obtained from another source

Data availability encapsulates all the factors affecting our ability to use the data for a BI solution. Finding pragmatic and inexpensive solutions to raise data availability should be at the heart of your business intelligence strategy.

Strategies for promoting data availability

In the article, Data Availability For Business Intelligence, we looked at factors that influence the availability of data. We now consider those factors once again, this time looking at methods to raise availability.

Visibility

Increasing the visibility of data should be the first task in the data strategy. It has the distinct advantage of being a relatively quick and inexpensive process. Practically speaking, we can start by creating a master list of data repositories and publishing it to the company. We can request a short interview with data owners or ask everyone to complete a survey. The process should be a quick painless exercise so that colleagues do not see it as a burden or an intrusion of privacy. At this stage, we do not need a blow-by-blow account of every column in every database and spreadsheet.
You may end up with something similar to the table below. We aim to keep detail to a minimum and include unstructured and semi structured information as well as formal systems.

A useful check is to compare the list of business processes in this table with the business processes identified in the process improvement strategy. If any you identified in the organisation chart are not present in the data asset list above, then you may have missed a repository. It is a sure bet that most office-based workers will be able to contribute to the list. However, you will probably be missing many of these personal data repositories in the first sweep.

Table – Data asset list
Source	Owner	Store	Business process	Master reference data
GL	Finance	RDBMS	Accounts	Org structure, Accounts
Inventory	Ops	RDBMS	Orders, warehouse, distribution	Suppliers, Property, Products,
Training	Human Resources	Personal database	Training and staff development	Staff
Budgets	Finance	Spread sheet	Budgeting and forecasting	Products, Staff
Events	Marketing	Doc	Events management	Customers

Visibility and trust

Everyone who contributes to the list should be able to see the full results, saving exceptionally sensitive repositories. This is part of building a shared sense of trust and ownership of the data assets of the company. BI requires everyone in the business to contribute, but this sponsorship will only last as long as the advantages are obvious to all.
Even the seemingly innocuous task of creating a list of repositories may be controversial to some. An owner of a spreadsheet entitled ‘Employee redundancy list – these guys have got to go!’ or, ‘Top secret new products’ may have understandable reservations about publishing their existence to the whole company. My view is that we should still declare these data assets and publish their existence to those who have the requisite level of trust and authority in other business units or functions.
Neither should we shy away from publishing the existence of data assets that are normal business processes even though their content – which we are not publishing at this time – may be private or sensitive. Therefore, it follows that we should publish the existence of a spreadsheet that records employee remuneration or project outcomes.
A compelling reason for a bias in favour of visibility is that these sensitive repositories are also a treasure trove of master reference data. The remuneration review spreadsheet may be the only source in the company where we can find an employee’s current qualifications and experience. Similarly, the project review spreadsheet may be the only accurate source of each project’s duration and in fact the only place in the company where you can get a complete list of past projects. Although some of the data in the spreadsheet is sensitive, it also contains valuable master data that is not sensitive, and could be useful to other business processes. Raising the visibility of sensitive data sources creates an opportunity to share parts of the data whilst preserving the security of the sensitive elements.

Data quality

When a data quality issue arises, the obvious first step is to raise it with the source system owner. Ideally, they will clean up the mess in the source to resolve the problem; in reality, this may only fix a subset of the issues. For example, it may be a matter of urgency to correct customer addresses so that the invoices go to the right place. However, there will be far less incentive to look at historical transactions because they have no immediate impact on today’s business.
Once you have exhausted the quick wins from tidying up at source, ETL tools can help cleanse the data automatically. You can apply business rules to fill missing values or create consistency. It is easy to obsess over data quality, but keep in mind that cleansing data is expensive and applying business rules too early can make it difficult for end users to reconcile back to source. You need to ask what level of quality is required for the intended application. For instance, if we need the average sales and transaction count per month for an EPOS (electronic point of sale) system in a supermarket, it is very unlikely that a few thousand bad transaction records over 5 years will have a material influence on our decision process. We return to the theme of continuous interaction between the BI specialist and the subject matter expert. This is the quickest route to judging the return on investment of an expensive data cleansing exercise.
It is great news if the data is one hundred per cent accurate, but in lieu of this nirvana, it is more important that the target users understand the degree of accuracy. They need to judge whether their decision process can tolerate the level of error. Different decisions processes will require different levels of confidence.

Security

Security requires a cultural solution rather than a technical one. It seems inevitable that more information, be it personal, technical, or political, will be in the public domain as time moves on. The music industry tried to swim against the tide by being too protective of their data. They appeared to be out of touch, and some companies have come close to going out of business.
Meanwhile search engine providers have published valuable proprietary information and services free of charge, whilst still earning huge revenues from other channels. Although, this might seem more a discussion on intellectual property than security, I think the two share similar qualities.
When I first specialised in BI, I found data protectionism to be more of an issue than it is today. Security can avoid embarrassment, maintain a monopoly on the provision of information (as with the music industry), or enforce a misplaced or politically motivated need-to-know culture. Denying access to data is easier than permitting it, especially if the benefits of sharing are not obvious.
Sharing data is riskier than not sharing it. The trick, as with any other business activity, is measuring the relative risks and benefits. The BI process-improvement strategy should highlight real benefits to sharing data. This tips the scales towards the benefits.

Business goals and security

One mechanism for removing security barriers starts with the organisation hierarchy and business goals that we discussed in Chapter 3. This high-level view is ideal for exposing goals that cross business units and functions. With this in mind, sharing data becomes a natural and mutually beneficial next step.
The culture of security is one that must filter down from the top of the organisation. If the boss is relaxed about close cooperation with other business units or functions, others will follow suit. Of course, the opposite applies if senior management foster a culture of protectionism.

Shared responsibility

A culture of openness requires everyone to be aware of the value of information. In the early days of email and the internet, companies invested a lot of effort educating staff on using these tools responsibly. A single misplaced email could do irrevocable damage to a company’s reputation. The same is true for inappropriate web browsing, and yet these tools are now pervasive and business must accept the risk to remain competitive.
However, even here, discrepancies exist between different organisations that are indicative of the culture of trust. Some companies filter out webmail, social network sites, and job searching sites. Some allow them all. I have not observed a logical pattern to explain these discrepancies. Industries you might expect to have heightened security are often the most trusting and open.
You have to ask yourself, if users wanted to jeopardise the company, could they do so already with the data they currently use. If the answer is yes, then this alone should not be a reason for restricting access to data outside of their immediate responsibility.

Data source structure

There are several approaches to working with unstructured data.

Alternative source

It might seem a cop-out to say, find another source. However, it should be the first avenue to exhaust before considering expensive ETL. Unstructured content is often the presentation layer for more structured data. A webpage might retrieve data from a database. An invoice is just a formatted report of data in the operational system. In some cases, you will not be able to access the alternative source but it is worth checking nonetheless.

Request a different format

External entities are a common source of unstructured data. We may receive invoices from suppliers, bank statements, industry survey results, all of which come in complex or apparently unstructured document formats. More often than not, these entities are happy and able to provide the data in a more structured format like CSV or XML. If you are one of their customers then they have a strong incentive to keep you happy. If they are the customer, then consider what additional information you could provide them and discuss a reciprocal agreement. In Chapter 1, we identified numerous benefits to providing our customers with BI solutions.

ETL and DW

ETL tools provide built in support for automatically parsing and extracting data from an array of different formats. Ideally, we should extract the data from the source format and load it into the DW. Once the data is stored in an RDBMS then, irrespective of the data modelling approach, it should be much easier to work with and to share across the organisation.

Non-existing data

A DW is the best place to preserve data that would otherwise be lost with the passing of time. Unlike operational systems that frequently overwrite or delete data not required for current processes, we design DWs to store a history of data. We can take snapshots of volatile operational data and load them into the DW to capture a point-in-time record of attribute values.
If no operational systems capture the data, it may still exist in spreadsheets or documents. Again, we use the DW to store a snapshot of the artefact at regular intervals or in response to an event like the creation of a new document or change in an existing one. This may require some coordination with the process owner to ensure the documents have a defined structure and are located in an accessible area.

Update timing

If the data does not exist in the source system at the time we need it, then we can either look at alternative sources or streamline the process of updating the original system.
It may be impractical to update an operational system until after the time when we need the data. This is often the case where the information we need is just a by-product of recording another process in the system. In such cases, we should look upstream from the operational system. It is likely that the information is recorded elsewhere in a semi or unstructured format. If we can capture this data, we can load it into a DW, increasing the availability by faster publishing.

BI in practice
Dealing with delays in updating operational systems
A university system tracks the expenses related to laboratory experiments. It also records the outcome of the experiment. For operational efficiency the payments clerk does not add the experiment to the system until they have accounted for all the expenses. This can take several weeks.
We are interested in analysing recent trends in experiment effectiveness. The delay in updating the operational system had made previous analysis unreliable. We cannot convince the clerk to change their working practices so we request the lead scientist to copy us in on the email that first notifies the clerk about a completed experiment. The email contains a spreadsheet with information about the experiment. We can load the spreadsheet data into a DW and create timely analysis and reports.
The solution is ideal because it does not require any additional work from the clerk or the lead scientist. In early iterations, we want to prove the value of the BI approach before we burden others with additional work.
The solution also demonstrates the additional flexibility of having a DW to hold data that is not present in operational systems.