The topic of big data continues to pulsate with vigor in the market, as demonstrated by the wide variety of data innovations emerging daily and the talented professionals successfully pursuing the creation and use of big data solutions.
We are reaching an inflection point in the market where the level of hype and frequent confusion about “Big Data” will soon be replaced by customer success stories.
The topic of big data continues to pulsate with vigor in the market, as demonstrated by the wide variety of data innovations emerging daily and the talented professionals successfully pursuing the creation and use of big data solutions.
We are reaching an inflection point in the market where the level of hype and frequent confusion about “Big Data” will soon be replaced by customer success stories.
Already we’re seeing such stories emerge as businesses willingly share their triumphs. As with any paradigm shift in computing where a particular topic draws great attention from the press, investors and innovators, this shift is supported by strong business value proof points. This was the case during the paradigm shifts and hype cycles of client server, distributed computing, the internet as well as service oriented architectures and languages, such as Java.
We’re also seeing the emergence of a beneficial ecosystem that quickly compliments or extends the capabilities of the core enabling technology; in this case, technologies that compliment big data systems such as Hadoop, Cassandra, Accumulo and solutions from industry data titans Oracle and IBM.
So what trends might we see emerge in the Big Data ecosystem?
Continuous Expansion and Unification Of SQL On Hadoop. A number of technology companies are working hard to build a layer of technology on non-SQL enabled big data solutions like Hadoop. The depth and breadth of support for the SQL language varies, but SQL smart professionals will be able to take advantage of these advances to enable highly interactive SQL on big data. Examples include Hadapt, Impala, Teradata Aster and EMC Greenplums Pivotal HD.
Unified Support for Structured, Unstructured and Semi-Structured Data as the growth of unstructured continues. IDC projects that the amount of digital data, mostly in the form of unstructured data, will grow 40-50% per year. By 2020, that total will reach 40 zettabytes. Unstructured data comes from email, forums, blogs, social networks, point-of-sale systems and machine generated data. In order to capture and analyze this mass amount of varied data, innovators are expanding their big data solutions beyond just capturing one or the other.
In addition, we will see the emergence and adoption of solutions including the Oracle MDEX engine, Accumulo and Attivio to capture this varied data in a single store.
Advances in Search. Sifting through massive amounts of data to find that preverbal needle in the haystack is no simple task. Over time we will likely see more big data solutions injecting search support into their solutions. Leading the way in this endeavor are LucidWorks, IBM, Oracle through the acquisition of Endeca (full disclosure, I’m a former Endeca employee), Autonomy and MarkLogic. LucidWorks combines an open source stack of Lucene/Solr, Hadoop, Mahout and NLP.
Expanded ETL and ELT Support. Many have spoken about Hadoop’s primary use case being to perform ETL workflows because of the batch nature of Hadoop. However, if you were to look at all of the pieces of infrastructure necessary to build and maintain a complex Hadoop based ETL solution, you might end up running the other way towards pure play ETL solutions from Informatica, Talend, Syncsort, CloverETL. For years these have focused on building best-of-breed ETL solutions, now more frequently called Data Integration solutions.
Pure play ETL vendors have worked diligently to ensure support for big data solutions. This includes support for not only ETL, but for ELT where the transforms are being executed by Hadoop inside of Hadoop. This would enable one to use the environments of popular ETL solutions against the strong capabilities of Hadoop. Overtime, these ETL pure plays will support a wide range of big data solutions from the NewSQL and NoSQL providers.
In addition, I expect that many of the big data solutions will embed ETL and ELT support within their stacks, just as many of the traditional database vendors have done through embedding or acquisition of ETL solutions.
Big Data In Motion takes hold. As I’ve previously written (Big Data “In Motion”—The Next Phase of Big Data), the open source framework Apache Hadoop has traditionally been used for batch oriented processing of very large data sets in a distributed environment, primarily in the context of analytics. As brands begin to focus more on how to reign in and leverage the vast data assets available today for real-time decisioning, we anticipate significant impact and growth of “Big Data In Motion”. The “in motion” represents the real-time information flow for handling extremely large streams of data present in a variety of businesses, including capital markets, healthcare, energy and social media.
Added Data Mining and Analytic Functions. Industry leaders in the big data space understand the requirements to expand the underlying analytics and statistical capabilities in their platform. This goes beyond typical analytic functions into the world of very sophisticated data mining functionality. Teradata Aster Data includes a wide variety of analytic capabilities including support for statistical, text analytics, graph, sentiment analysis and in-database PMML execution through the support of Zementis. Other companies including IBM Netezza have embedded support for the popular R statistical language as well as Matrix engine, a parallelized linear algebra package. Over time, we will see a significant expansion of these capabilities across a broad range of big data solutions.
Gains in Popularity of the R Language. There is no doubt that R is becoming more and more popular as an open statistical language. Revolution Analytics has made significant progress in developing a “production-grade” version of R with performance enhancements and other enterprise features. Furthermore, they have developed solutions including R for Hadoop, R for IBM PureData as well as R for Big Data.
Universities are also ramping up with courses in the R that will expose many students to the powerful capabilities of this language and equip them with skills required to perform complex statistical analysis. We will likely see it being embedded in many more big data solutions along with significant improvements in this language and higher performing capabilities.
As the big data ecosystem evolves, so must your business. Those implementing data-driven strategies will surpass the competition and thrive in today’s marketplace.
(image: Big Data / shutterstock)