Decision! Decision! Decision! What a hazardous and difficult human endeavor it is! Those of us who had to make decisions in personal life, business or profession know that the chance of our decision producing the desired end-result is always in doubt. This is so mainly because decision made today fructify tomorrow. If all decision makers were clairvoyant no one would make wrong decisions – making decisions would be a routine job.
Decision! Decision! Decision! What a hazardous and difficult human endeavor it is! Those of us who had to make decisions in personal life, business or profession know that the chance of our decision producing the desired end-result is always in doubt. This is so mainly because decision made today fructify tomorrow. If all decision makers were clairvoyant no one would make wrong decisions – making decisions would be a routine job.
Unfortunately this is never going to happen. We will keep making wrong decisions as we do today. However, we could enhance, in a substantial way, our chances of making the right decision by reviewing everything that is happening now or has happened in the past relating to our target area of activity. Of course, this may be a tall order to follow but in the information age that we are living in, it seems feasible. There is so much data available at diverse sources that if we evolve a scientific and feasible way of disseminating all this data we will find answers to our queries as we have never been able to do so before. The ‘way’ we mentioned above is the ‘Big Data Analytics’.
Genesis of Big Data
Big Data, one of the hottest IT buzzwords of 2012, has emerged as a new technology paradigm to address the volume, velocity and variability of massive data coming from different sources. The social media is one well known source of big data. A somewhat less known source but big nonetheless is the data generated by data acquisition systems (DAS) in machinery and structures in the field of engineering. Large volumes of data are also being generated by health monitoring devices of interest to medical professionals. There are (many) other sources too. Within this heaps of massive data, there is treasure of information that can be extracted for saving major disasters, accidents, outbreak of epidemics, etc. In the field of business and marketing, big data available through the ‘social network’ is already being proactively used in propelling growth of businesses.
Businesses have been relying on tools such as Business Intelligence (BI) dashboards and reports for decisions based on transactional data stored in relational databases. With evolution of social media, we started seeing emergence of non-traditional, less structured data such as weblogs, social media feeds, email, sensors, photographs and YouTube videos that can be analyzed for useful information. With reduction of cost in both storage and compute power, it is now feasible to store and analyze this data, as well, for meaningful purposes. As a result, it is important that businesses cast a new look at the extended range of data, i.e. Big Data, for business intelligence and for decision making.
Sources of Big Data
The major sources of Big Data may be listed as follows:
- Enterprise Applications data, that generally includes data emanating from Enterprise Resource Planning (ERP) Systems, customer information from Customer Relationship Management (CRM) systems, Supply Chain Management systems, e-commerce transactions, and Human Resource (HR) and Payroll transactions.
- Semantic data that comprise Call Detail Records (CDR) from Call centers, weblogs, smart meters, manufacturing sensors, equipment logs and trading systems data generated by machine and computer systems.
- Social Media data that includes customer feedback streams, micro-blogging sites like Twitter, and social media platforms like Facebook.
The McKinsey Global Institute [1] estimates that data volume is growing 40% per year, and will grow 44 xs between 2009 and 2020. There are four key characteristics, 4 Vs of volume, velocity, variety and value that are commonly used to characterize different aspects of big data.
Characteristics of Big Data
We consider these characteristics in some detail in the following few paragraphs.
Volume
Social Media (Facebook, Twitter, LinkedIn, Foursquare, YouTube and many more) is an obvious source of large volume of data. Machine generated data or Semantic web data are other large but somewhat less known source of data. To judge the volume of this type of data it may be sufficient to know that a single jet aircraft engine can generate 10TB of data in 30 minutes. With more than 25,000 airline flights per day, the daily volume of just this single data source runs into petabytes (1015 bytes). Smart meters and heavy industrial equipment like oil refineries and drilling rigs generate similar data volumes.
Velocity
The data comes into the data management system rapidly and often requires quick analysis for decision making. The speed of the feedback loop is extremely important for taking data from input through to analysis and decision making state. The tighter the feedback loop, the greater will be the usefulness of the data. It is this need for speed, particularly on the web, that has driven the development of key-value stores and columnar databases, optimized for the fast retrieval of pre-computed information. These databases form part of an umbrella category known as NoSQL ( Not Only Structured Query Language) used when relational models does not suffice. Social media data streams bring large input of opinions and relationships valuable to customer relationship management in retail business. Even at 140 characters per Tweet, the high velocity of Twitter data generates large volumes (over 8 TB per day). Most of these data received may be of low value and analytical processing may be required in order to transform the data into usable form or derive the meaningful information.
Variety
Big Data brings variety of data types. It varies from text, image and video from social networks, and raw feed directly from sensor sources to semantic weblogs generated by machines. These data are often not associated with application. A common use of big data processing is to take unstructured data and extract meaningful information for consumption either by humans or as a structured input to an application. Big Data brings a lot of data that has patterns, sentiments and behavioral information that need analysis. Relational Database Management Systems (RDBMS) were not designed to address this sort of data.
Value
The value of different data varies significantly. Generally, there is good information hidden within a larger body of non-traditional data. Big data offers great value to businesses in bringing real time market and customer insights enabling improvement in new products and services. Big data analytics can reveal insights such as peer influence among customers, revealed by analyzing shoppers’ transactions, social and geographical data. The past decade’s (1990s) successful web startups are prime examples of big data used as an enabler of new products and services. For example, by combining a large number of signals from a user’s actions and those of their friends, Facebook has been able to craft a highly personalized user experience and create a new kind of advertising business. Recently the machine tool industry has developed MTConnect protocol through which it will be possible to collect and broadcast key performance indicators (KPI) to interested parties who can evaluate efficiency of operation of the machine tools and be able to anticipate machine problems [2].
Big Data Solutions
With the evolution of Cloud deployment model, majority of big data solutions are offered as software only, as an appliance or cloud-based offerings. As is the case with any other applications deployments, big data deployment will also depend on several issues such as data locality, privacy and governmental regulations, human resources and project requirements. Many organizations are opting for a hybrid solution using on-demand cloud resources to supplement in-house deployments.
Big data is messy and needs enormous efforts in cleansing and quality enhancement. The phenomenon of big data is closely tied to the emergence of data science, a discipline that combines math, programming and scientific instinct.
Current data warehousing projects take long to offer meaningful analytics to business users. It depends on extract, transform and load (ETL) processes from various data sources. Big Data analytics on the other hand can be defined as a process in relationship or in context to the need to parse large data sets from multiple sources, and to produce information in real-time or near-real-time.
Big Data analytics represents a big opportunity. Many large businesses are exploring the analytics capabilities to parse web-based data sources and extract value from the social media. However, an even larger opportunity, the Internet of Things (IOT) is emerging as a data source. Cisco Systems Inc. estimates[3] that there are approximately 35 billion electronic devices that can connect to the Internet. As a matter of fact any electronic device can be connected to the Internet, and even automakers are building Internet connectivity into vehicles. “Connected” cars will become commonplace by 2012 and generate millions of transient data streams.
The big data market (comprising of technology and services) is on the verge of a rapid growth in tune of $50 billion mark worldwide within the next five years. As of early 2012, the big data market stands at just over $5 billion based on related software, hardware, and services revenue. Increased interest in and awareness of the power of big data and related analytic capabilities to gain competitive advantage and to improve operational efficiencies, coupled with developments in the technologies and services that make big data a practical reality, will result in a super-charged CAGR(Compound annual growth rate) of 58% between now and 2017. Of the current market, big data pure-play vendors account for $310 million in revenue. Despite their relatively small percentage of current overall revenue (approximately 5%), vendors such as Vertica, Splunk and Cloudera are responsible for the vast majority of new innovations and modern approaches to data management and analytics that have emerged over the last several years and made big data the hottest sector in IT. Wikibon considers big data pure-plays as those independent hardware, software, or services vendors whose big data-related revenue accounts for 50% or more of total revenue.
The big data market includes technologies, tools, and services designed to address these opportunities. These include Hadoop distributions, software, projects and related hardware. Next-generation data warehouses and related hardware, Data integration tools and platforms as applied to big data; Big data analytic platforms, applications, and data visualization tools;
Pure-plays Vendors Delivering Big Data Innovation
The most impactful innovations in the big data market are coming from the numerous pure-play vendors that own just a small share of the overall market. Hadoop distributions Cloudera and Hortonworks are significant contributors to the Apache Hadoop project that is significantly improving the open source big data framework’s performance capabilities and enterprise-readiness. Cloudera, for example, contributes significantly to Apache HBase, the Hadoop-based non-relational database that allows for low-latency, quick lookups.
Hortonworks engineers are working on a next-generation MapReduce architecture that promises to increase the maximum Hadoop cluster size beyond its current practical limitation of 4,000 nodes. MapR takes a more proprietary approach to Hadoop, supplementing HDFS (Hadoop Distributed File System) with its API-compatible Direct Access NFS in its enterprise Hadoop distribution, adding significant performance capabilities. Next Generation vendors such as Vertica, Greenplum, and Aster Data are redefining the traditional enterprise data warehouse market with massively parallel, columnar analytic databases that deliver lightening fast data loading and real-time analytic capabilities.
The latest iteration of the Vertica Analytic Platform, Vertica 5.0, for example, includes new elastic capabilities to easily expand or contract deployments and many in-database analytic functions.
Big Data Analytics Platforms and Applications
Hadoop based platforms: Few Niche vendors are developing applications and platforms that leverage the underlying Hadoop infrastructure to provide both data scientists and business users with easy-to use tools for experimenting with big data. These include Datameer [5], which has developed a Hadoop based business intelligence platform with a familiar spreadsheet like interface, Karmasphere[6], whose platform allows data scientists to perform ad hoc queries on Hadoop-based data via a SQL interface; and Digital Reasoning[7], whose Synthesis platform sits on top of Hadoop to analyze text-based communication.
Cloud-based applications and services are increasingly allowing small and mid-sized businesses to take advantage of big data without needing to deploy on-premises hardware or software. Tresata’s [8] cloud-based platform, for example, leverages Hadoop to process and analyze large volumes of financial data and returns results via on-demand visualizations for banks, financial data companies, and other financial services companies. 1010data [9] offers a cloud-based application that allows business users and analysts to manipulate data in the familiar spreadsheet format but at big data scale. And the ClickFox [10] platform mines large volumes of customer touch-point data to map the total customer experience with visuals and analytics delivered on-demand.
Non-Hadoop Big Data Platforms: Other non-Hadoop vendors contributing significant innovation to the big data landscape include Splunk[11], which specializes in processing and analyzing log file data to allow administrators to monitor IT infrastructure performance and identify bottlenecks and other disruptions to service. HPCC (High Performance Computing Cluster) Systems, a spin-off of LexisNexis [12], that offers a competing big data framework to Hadoop that its engineers built internally over the last ten years to assist the company in processing and analyzing large volumes of data for its clients in finance, utilities, educational and research institutions and government and DataStax [13], which offers a commercial version of the open source Apache Cassandra NoSQL database along with related support services bundled with Hadoop.
In order to make best meaningful use of big data, businesses must evolve their IT infrastructures to handle the rapid rate of delivery of extreme volumes of data, with varying data types, which can then be integrated with an organization’s other enterprise data to be analyzed. When big data is captured, optimized and analyzed in combination with traditional enterprise data, businesses can develop a more thorough and insightful understanding of their business, which can lead to enhanced productivity, a stronger competitive position and greater innovation to impact on the bottom line. For example, in the delivery of healthcare services, management of chronic or long-term conditions is expensive. Use of in-home monitoring devices to measure vital signs, and monitor progress is just one way that sensor data can be used to improve patient health and reduce both office visits and hospital admittance.
Manufacturing companies deploy sensors in their products to return a stream of telemetry. The proliferation of smart phones and other GPS devices offers advertisers an opportunity to target consumers when they are in close proximity to a store, a coffee shop or a restaurant. This opens up new revenue for service providers and offers many businesses a chance to target new customers.
Use of social media and web log files from their ecommerce sites can help retailers understand their customers buying pattern, behavior, likes and dislikes. This can enable much more effective micro customer segmentation and targeted marketing campaigns, as well as improve supply chain efficiencies. As with data warehousing, web stores or any IT platform, an infrastructure for big data has unique requirements. In considering all the components of a big data platform, it is important to to easily integrate big data with enterprise data to conduct deep analytics on the combined data set.
The requirements in a big data infrastructure involve data acquisition, data organization and data analysis. Because big data refers to data streams of higher velocity and higher variety, the infrastructure required to support the acquisition of big data must deliver low, predictable latency in both capturing data and in executing short, simple queries; be able to handle very high transaction volumes, often in a distributed environment; and support flexible, dynamic data structures.
NoSQL databases are frequently used to acquire and store big data. They are well suited for dynamic data structures and are highly scalable. The data stored in a NoSQL database is typically of wide variety because the systems are intended to simply capture all data without categorizing and parsing the data. For example, NoSQL databases are often used to collect and store social media data. To allow for use in varying customer applications, underlying storage structures are kept simple.
Instead of designing a schema with relationships between entities, these simple structures often just contain a major key to identify the data point, and then a content container holding the relevant data. This simple and dynamic structure allows changes to take place without costly reorganizations at the storage layer. In classical data warehousing terms, organizing data is called data integration. Because there is such a high volume of data, there is a tendency to organize data at its original storage location, thus saving both time and money by not moving around large volumes of data. The infrastructure required for organizing big data must be able to process and manipulate data in the original storage location; support very high throughput (often in batch) to deal with large data processing steps; and handle a large variety of data formats, from unstructured to structured.
Apache Hadoop is a new technology that allows large data volumes to be organized and processed while keeping the data on the original data storage cluster. HDFS is the long-term storage system for web logs, for example. These web logs are turned into browsing behavior (sessions) by running MapReduce programs on the cluster and 6 generating aggregated results on the same cluster. These aggregated results are then loaded into a RDBMS system. Since data is not always moved during the organization phase, the analysis may also be done in a distributed environment, where some data will stay where it was originally stored and be transparently accessed from a data warehouse. The infrastructure required for analyzing big data must be able to support deeper analytics such as statistical analysis and data mining, on a wider variety of data types stored in diverse systems; scale to extreme data volumes; deliver faster response times driven by changes in behavior; and automate decisions based on analytical models. Most importantly, the infrastructure must be able to integrate analysis on the combination of big data and traditional enterprise data. New insight comes not just from analyzing new data, but from analyzing it within the context of the old to provide new perspectives on old problems. For example, analyzing inventory data from a smart vending machine in combination with the events calendar for the venue in which the vending machine is located, will dictate the optimal product mix and replenishment schedule for the vending machine.
Many new technologies have emerged to address the IT infrastructure requirements outlined above.
- NoSQL solutions: developer-centric specialized systems
- SQL solutions: the world typically equated with the manageability, security and trusted nature of relational database management systems (RDBMS)
NoSQL systems are designed to capture all data without categorizing and parsing it upon entry into the system, and therefore the data is highly varied. SQL systems, on the other hand, typically place data in well-defined structures and impose metadata on the data captured to ensure consistency and validate data types.
Distributed file systems and transaction (key-value) stores are primarily used to capture data and are generally in line with the requirements discussed earlier in this paper. To interpret and distill information from the data in these solutions, a programming paradigm called MapReduce is used. MapReduce programs are custom written programs that run in parallel on the distributed data nodes.
The key-value stores or NoSQL databases are the OLTP (online transaction processing) databases of the big data world; they are optimized for very fast data capture and simple query patterns. NoSQL databases are able to provide very fast performance because the data that is captured is quickly stored with a single indentifying key rather than being interpreted and cast into a schema. By doing so, NoSQL database can rapidly store large numbers of transactions.
However, due to the changing nature of the data in the NoSQL database, any data organization effort requires programming to interpret the storage logic used. This, combined with the lack of support for complex query patterns, makes it difficult for end users to distill value out of data in a NoSQL database.
To get the most from NoSQL solutions and turn them from specialized, developer-centric solutions into solutions for the enterprise, they must be combined with SQL solutions into single proven infrastructure that meets the manageability and security requirements of today’s enterprises.
Reference
1.James Manyika, Michael Chui, Brad Brown, Jacques Bughin, Richard Dobbs, Charles Roxburgh, Angela Hung Byers-The McKinsey Global Institute- Big Data :The next frontier for innovation, Competition and productivity, May 2011.
2. Kathy Levy,”A New Age in Manufacturing Is At Hand”, The Shot Peener (ISSN 1069-2010), v26, Issue 2, spring 2012
3. John Webster-Understanding Big Data Analytics, May 2012
4. Jeff Kelly, David Vellante and David Floyer, “Big Data Market Size and Vendor revenues”, wikibon.org, May 2012.