This is about the approaches to “big-data” technology. However, I start with a little detour and analogy on innovation. The purpose of this analogy is to trigger “Enterprise IT Organizations” to think outside convention before embarking on a journey to adopt or build “big-data”, “data management”, “data integration” and related services.
This is about the approaches to “big-data” technology. However, I start with a little detour and analogy on innovation. The purpose of this analogy is to trigger “Enterprise IT Organizations” to think outside convention before embarking on a journey to adopt or build “big-data”, “data management”, “data integration” and related services.
Start with the “Why” (not the “What”): If you have not already, I would highly recommend watching Simon Sinek’s famous TED Talk on “Start with the Why”. In this talk Simon synthesis how successful and inspirational leaders / brands, such as Dr. Martin Luther King Jr., Apple Inc., and The Wright Brothers, drew strategic intent from their respective missions / purpose and how they brought together all that thinking for seamless execution.
Whether we build products, sell services, look for a job, support a social cause, whatever……most of us start with “what we do”, go on to explain “how we do it” and finally try to get to “why we do it”. While this inside out approach looks logical to each of us – going from our zones of comfort of what we know best (i.e., “what” we do) to the more fuzzy and relatively unknown areas (i.e., “why” we do something) – it does not matter to our customers / citizens / followers / audience etc. To clarify, success / rewards / money etc. is not a “why” but a “result” from things we do.
Simon Sinek summarizes – ”People buy into WHY we do something, not WHAT we do”. Its not about us, its about them. So we have to reverse that order and draft our strategic intent to Why -> How -> What.
So what does this have to do with “Big-Data Services” and “Enterprise IT Organizations”?: Over the last couple weeks, I have had three different conversations on the topic of big-data technology with – (i) a senior IT executive of a Tier 1 Bank, (iii) a senior business executives / CFO and (iii) Ray Wang from Constellation Research – futurist / big-data strategist / digital disruption thought leader / industry analyst. All three conversations, directly or indirectly alluded to “how enterprise should think about “data services” (including big-data) and what should their strategy be?”. Questions that we all noodled were –
- What will be the next killer technology – Hadoop with MapReduce Vs NoSQL Datbases, or Graph Databases Vs Columnar Databases etc.?
- Will Hadoop (or a variant such as Apache Spark) replace Enterprise Data Warehouses (EDW) (e.g., Teradata)?
- Will data stack vendors with integrated in-memory analytical processing engines (e.g., SAP HANA or OBIEE etc.) be disrupted?
- Will self-service driven enterprise data-marts / data discovery & visualization tools (e.g.. Tableau or Qlik) / business user driven ETL tools (e.g., Paxata) replace traditional BI / ETL?
The single factor that came up in all three conversations – IT organizations are struggling to understand how to bring all these technologies together – driven by the rapid pace of evolution (and marketing hype from technology vendors).
While we have all heard about the macros trends or 3Vs of Volume, Variety, Velocity of Big-Data, little has been debated, beyond technical communities and silicon valley think thanks, on the many technical factors that need to be evaluated before defining a “big-data” technology services strategy. Here are some key technical factors driving data architectural discussions (and by no means this list is comprehensive) –
- Analytics is going to Data (not the other way around): With storage becoming cheap, data volumes exploding and analytical processing over large data sets becoming performance intensive – data is no longer required to be brought into an analytical engine. Its the other way around, with analytics now taken to data lakes. (e.g., MapReduce takes analytical processing into Hadoop).
- Data stores reflecting real world semantics: Some special purpose databases are reflecting the semantic structure of data and their inherent relationships and not storing it in relational or object constructs (e.g., Graph Databases reflect Customer -> Orders -> Products Purchased as a graphical construct i.e., how our minds think not how we store data in rows and columns)
- Capture versus analyze: Is the big-data coming into the enterprise being captured for future analysis or analyzed as it is captured? (e.g., Hadoop Distributed File Systems versus Complex Event Processing)
- Optimizing analytical queries: While SQL based querying has become a standard over the last couple decades, newer optimization techniques are becoming popular requiring different data stores (e.g., columnar databases using key-value pairs versus SQL result sets)
- Self service, Data Agility and Late Binding: Business use of last mile analytical processing / self service layers is becoming very popular versus pre-prepared data (i.e., data that has gone through ETL and loaded into warehouses). In fact, business is no longer willing to wait on turnaround times from IT and demanding data agility. Consequently, “late binding”where the transformation of as part of the ETL process is getting closer to analytical querying – ETL is increasingly becoming ELT. (e.g., Data discovery & visualization tools such as Tableau or business user centric tools such as Paxata or Trifacta are becoming popular. If you are interested in the topic, please read one of my earlier blog post – Business is winning the BI battle, but should it be a battle?).
So how do IT organizations decide which big-data technology to use?: As a “business analytics solutions” professional, I have always focused on “business outcomes based on insights from big-data”. Please see my earlier blog on “Data = Opportunity, but is your company monetizing information?” that gives an overview of outcomes driven business analytics or “How to transform Healthcare Performance through Data Analytics?” if you want to know more about specifics of my views within a particular industry.
If we agree that choice of big-data technology depends on the specific use cases for business outcomes using better insights from data (the “why”), the choice of technology (the “what”) becomes fairly simple. In such as scenario, all we have to do is to follow a straight forward process to weigh the pros and cons of factors of a particular technology (as listed above) with respect to the specific use case for business insight.
Here are some high level use cases for business insights from data, with an indicative choice of technology (again this list is by no means comprehensive and can change based on exact needs):
- Integrated vendor specific analytics stacks: This is the best bet for purpose driven applications that require seamless reporting / analytics. For example, it is better for General Ledgers (GL) systems to have core analytics / reporting integrated within the applications stacks. Having supported Financial Comptrollers, I can say with a fair degree of certainty, that financial reporting within or close to an integrated System of Record (say a Oracle Financials or SAP Finance – GL Module) is best achieved in an integrated analytical stack (Oracle’s OBIEE or SAP’s HANA respectively). Comptrollers are very uncomfortable about data integrity when a transaction leaves the System of Record – even if it is for reporting purposes (accuracy and reconciliation are of paramount concerns).
- Datawarehouses: While I am not a big fan of datawarehouses from an agility standpoint, I would tend to disagree with some pundits who predict demise of Datawarehouses is around the corner. While I believe technologies such as Hadoop Distributed Fie System will co-exist from a data capture / storage standpoint, datawarehouses still have a role to play within the ecosystem particularly given their relative analytical strength (stemming from ETL). For example, datawarehouses as a cross systems “Single Version of the Truth” will continue to stay e.g., marrying GL with operational transactional systems or sub-ledgers. So if you are asking me, Teredata is here to stay! No, that’s not a stock tip.
- Data-marts / Last Mile OLAPs / Business Driven ETLs / Data Visualizations: As discussed earlier, there is a fast growing business user driven need for departmental analytics. In particular, finance and marketing users who are data savvy e.g., with financial performance management or digital marketing analytics, will use OLAP cubing style tools such as Oracle Essbase / IBM Cognos TM1 or even some of the next generation SaaS tools such as Anaplan (I hear a lot of good things in the market about this offering). Tableau and Qlik will emerge as data discovery and visualization tools of choice for this user base. The issue with this particular approach will be data governance challenges. However, my belief is, with sufficient governance and better processes these late binding capabilities / last mile analytics are here to stay.
- Hadoop Distributed File Systems / Map Reduce (or Apache Spark): The strength of this approach is its ability to capture large volumes of data in a scalable fashion for subsequent processing. Look at this approach becoming popular with emerging large data sets such as mobile, social, web etc. For example, a company might capture fan engagement on its Facebook page or track interactivity on its website or location details from mobile devices to analyze and tailor campaigns.
- Machine Learning / Artificial Intelligence: This is an emerging area and will require large volumes of data being pumped into analytical engines for finding patters, correlations, etc. These platforms / frameworks are in early stages of their evolution cycle and will evolve based on needs.
Conclusion: There was a time when data technology lagged business needs. Today, that’s not the case. There are quite a few sophisticated offerings as discussed above and more in the offing. In my opinion, the choice of technology will follow an evolutionary path to support the disruptive business models being designed by companies. Disrupt or be disrupted, and for the ones that disrupt big-data is here to stay (and grow) by their side.
Welcome all thoughts.