I was at DataWeek/API World in mid-September 2014 (last week at the time of this writing) and saw some interesting things, almost entirely around Big Data. The two items that stood out for me, were the Graph DataBase system Neo4j (which I wish I had time and a reason to dig into more), and SiSense, who absolutely blew my mind.
I was at DataWeek/API World in mid-September 2014 (last week at the time of this writing) and saw some interesting things, almost entirely around Big Data. The two items that stood out for me, were the Graph DataBase system Neo4j (which I wish I had time and a reason to dig into more), and SiSense, who absolutely blew my mind.
Ever since I first heard of Hadoop and researched it, it seemed like a very poor solution. Way too much work, detached data, not real time, reliant on IT to put their queries together, etc.. SiSense saw this issue as well, but they addressed it in a totally different, two pronged approach. This isn’t a product review, but rather an overview of the technology and possibilities. SiSense is a provider of Business Intelligence (BI) technology, that includes a back-end powered by “in-chip” technology that easily enables non-techies to access and analyze large data sets from multiple sources, and a front-end for creating dashboards and reports that will display on any device, including mobile. I’m going to focus on the former.
BI applications typically process and extraordinarily large amount of data to provide useful feedback in an easily digestible fashion. Because of that, they tend to be fairly slow as there is a constant stream of data from disk to memory to the CPU. Many vendors process as much in RAM as possible to speed things up, but this requires lots and lots of RAM. To make more efficient use of the RAM, pretty much all the BI vendors are using columnar databases as the staging area, this is about as efficient as you can get with your result set as you drop the data you don’t need.
This is where SiSense diverges from the crowd and how they can declare they can process 100 times the data at 10 times the speed of the competition. SiSense leaves that data on the disk instead of trying to jam it into RAM, then compresses the heck out of it.
Now here is where their secret sriracha sauce comes in: They do the decompression in the CPU cache.
As you can see from our graphic, even the L3 cache of the CPU is orders of magnitude faster than even doing it in RAM, and because the data is all highly compressed, it is moving off disk and through RAM at a significantly faster pace than could normally be achieved.
The application doing all this work is Prism, and it doesn’t stop there. Prism is holding a memory map of the current location of all data. When it process data or does any type of calculation, it is applying vector algebra to the data, thus enabling Prism to take advantage of the x86 in-chip Single Instruction Multiple Data (SIMD) vector instructions. This allows short arrays of data to be processed by a single instruction. As a result, the CPU cores are able to process data much faster and in parallel.
Prism is designed to keep the CPU as active as possible. It is the fastest piece of hardware in your box and is often under utilized, but Prism changes all that. The learning algorithm that is built in to Prism also means that over time it will even get faster and it more intelligently optimizes and pre-fetches data. This also means, that counter intuitively, it can get faster with more people using it because pre-loading optimizations improve as more queries are performed.
Back to my opening statement about Hadoop. SiSense and Prism will work with Hadoop and mapreduce data. However, the whole distributed and difficult nature of Hadoop evolved around having to manage insanely large amounts of data and no realistic way to do it in real time. With SiSense, you are able to crunch through terrabytes of data with a large number of concurrent users, on a single commodity server without having a team of IT guys constantly managing it.
CPUs just get more and more powerful, with more cores and presumably, more cache. The technology that SiSense has created is about as durable as it gets in terms of future proofing. Nearly three quintillion bytes of data are created every day; 80 percent of it is unstructured, and only 20 percent of it is available to be processed.
The need for tools to effectively dig through all that data and present useful results is clear, and there are many vendors providing them, but SiSense, in my opinion, is genuinely addressing it the way it needs to be addressed.