If you want to buy clothing from an online retailer, would you ask a friend to point you directly to the items you should buy or would you consult the website to see what options were available? Most of us would choose the latter in order to get the best combination of selection and price. We might rely on friends to point in the right direction, but not to make the selections for us.
The same dynamic applies to data, especially as the age of self-service analytics approaches. Gartner predicts that self-service platforms will comprise 80% of all enterprise reporting by 2020. Democratized analytics is a great trend, but giving users the power to choose and manage their own data is the equivalent of throwing them into the deep end of the pool. Data catalogs have never been more critical.
A data catalog is basically the same as a retail catalog. It displays an inventory of all the data that is available in the organization by maintaining the metadata that describes it. It shows people not only what data is available but where to find it and how to use it. It may also include crowd sourcing capabilities that enable users to apply their own meta tags and comments. IT organizations have used data catalogs for a long time, but exposing them to a non-technical user audience creates a new set of challenges.
Most users have only a small snapshot of what data an organization holds. Absent a catalog, they go hunting or ask friends for advice. Both approaches invite disaster.
Searching or relying on tribal knowledge for data yields, at best, an incomplete view of what’s available. When they don’t have the best data, users tend to settle for good enough. Worse is that their searches or friends could point them to data that is out of date, incorrect, or incomplete. When users copy and share that data, the quality problem multiplies. Organizations end up with multiple, conflicting versions of the same data instead of a single canonical view.
Foraging for data also wastes time. Dave Wells, an analyst at Eckerson Group, tells of one healthcare CEO who said he never gets analyses of his operations because his analysts spend 80% of their time finding data and 15% whipping it into shape. That leaves precious little to do the job they’re paid to do.
A data catalog with the ability to automatically discover your organization’s data and tag it with meaningful and consistent labels that business can understand can reverse those ratios. It eliminates the risk of duplication and synchronization errors. More importantly, it puts the data that users really need into their hands.
The need for data catalogs is becoming more pressing as the number of data sources grows. In addition to the standard customer and product data that companies create and own, many organizations are now acquiring information from third-party sources like data brokers and public records. These external sources can shed valuable light on factors that influence the business, but they also introduce new demands. For example, imported data may not match the format or meta tags that the organization uses. This only increases the need to be able to automatically re-tag the data with consistent labels. As the volume of data increases, managing it manually can become a major drain on resources. Finally, if users don’t know the data is even there, they can’t take advantage of it.
The answer is a flexible and scalable data catalog based on machine learning that can automatically tag and label your data. Today’s artificial intelligence technology can automatically tag your data and learn from feedback provided by human operators and quickly adapt to the classification, formatting, and tagging rules of the organization. This enables companies to scale their data resources smoothly and make it easily available to everyone who needs it. Without a data catalog, a self-service BI initiative won’t get out of the starting blocks.