Big Data, All Data, PureData, BLU Data

For some time now, when it comes to big data, my mantra has been “big data is simply all data“. IBM’s April 3 announcement served admirably to reinforce that point of view. Was it a big data announcement, a DB2 announcement, or a hardware announcement? The short answer is “yes”, to all the above and more.

Weaving together a number of threads, Big Blue created a credible storyline that can be summarized in three key thoughts: larger, faster and simpler. As many of you may know, I worked for IBM until early 2008, so my views on this announcement are informed by my knowledge of how the company works or, perhaps, used to work. Last Wednesday, I came away impressed. Here were a number of diverse, individual product developments that conform to a single theme across different lines and businesses.

Take BLU acceleration as a case in point. The headline, of course, is that DB2 LUW (on Linux, Unix and Windows) 10.5 introduces a hybrid architecture. Data can be stored in columnar tables with extensive compression, making use of in-memory storage and taking further advantage of parallel and vector processing techniques available on modern processors. The result is an up to 25% improvement in analytic and reporting performance (and considerably more in specific queries) and up to 90% data compression. In addition, the elimination of indexes and aggregates simplifies considerably the need for manual tuning and maintenance of the database. This is a direction that has long been shown by small, newer vendors such as ParAccel and Vertica (now part of HP), so it is hardly a surprise. IBM can claim a technically superior implementation, but more impressive is the successful retrofitting into the existing product base. And the re-use of the technology in the separate Informix TimeSeries code base to enhance analytics and reporting there too, as well as the promise that it will be extended to other data workloads in the future. It seems the product development organization is really pulling together across different product lines. That’s no mean feat within IBM.

Another hint at the strength of the development team was the quiet announcement of a technology preview of JSON support in DB2 at the same time as the availability of 10.5. JSON is one of the darlings of the NoSQL movement that provides significant agility to support unpredictable and changing data needs. See my May 2012 white paper “Business Intelligence–NoSQL… No Problem” for more details. As in its support for other NoSQL technologies, such as XML and RDF graph databases, IBM has chosen to incorporate support for JSON into DB2. There are pros and cons to this approach. Performance and scalability may not match a pure JSON database, but the ability to take advantage of the ACID and RAS characteristics of an existing, full-feature database like DB2 makes it a good choice where business continuity is a strong requirement. IBM clearly recognizes that the world of data is no longer all SQL, but that for certain types of non-relational data, the difference is sufficiently small that they can be handled as an adjunct to the relational model through a “subservient” engine, allowing easier joining of NoSQL and SQL data types. This is a vital consideration for machine-generated data, one of three information domains I’ve defined in a recent white paper, “The Big Data Zoo–Taming the Beasts“.

The announcement didn’t ignore the little yellow elephant, either. The PureData System family has been expanded with the PureData System for Hadoop, with built-in analytics acceleration and archiving, and provides significantly simpler and faster deployment of projects requiring the MapReduce environment. And InfoSphere BigInsights 2.1 offers the Big SQL interface to Hadoop, an alternative file system, GPFS-FPO, with enhanced security and no single point of failure, as well as high availability.

While the announcement clearly targeted Big Data–at the Speed of Business, the underlying message, as seen above, is much broader. This view is of an emerging information ecosystem that must be considered from a fully holistic viewpoint. A key role, and perhaps even the primary role, for BigInsights / Hadoop is in exploratory analytics, where innovative, what-if thinking is given free rein. But the useful insights gained here must eventually be transferred to production (and back) in a reliable, secure, managed environment–typically a relational database. This environment must also operate at speed, with large data volumes and with ease of management and use. These are characteristics that are clearly emphasized in this announcement. They are also key components of the integrated information platform I described in the Data Zoo white paper already mentioned. Missing still are some of the integration-oriented aspects such as the comprehensive, cross-platform metadata management, data integration and virtualization required to tie it all together. IBM has more to do to deliver on the full breadth of this vision, but this announcement is a big step in the right direction.