A chat with Max Schireson, President of 10gen makers of MongoDB (from humongous database), yesterday provided some food for thought on the topic of our assumptions about the best database choices for different applications. Such thinking is particularly relevant for BI at the moment, as the database choices expand rapidly.
A chat with Max Schireson, President of 10gen makers of MongoDB (from humongous database), yesterday provided some food for thought on the topic of our assumptions about the best database choices for different applications. Such thinking is particularly relevant for BI at the moment, as the database choices expand rapidly.
But first, for traditionalist BI readers, a brief introduction to MongoDB, which is one of the growing horde of so-called NoSQL “databases”, some of which have very few of the characteristics of databases. NoSQL stores come in a half a dozen generic classes, and MongoDB is in the class of “document stores” along with tools such as Apache’s CouchDB and Terrastore. Documents? If you’re confused, you are not alone. In this context, we don’t mean textual documents used by humans, but rather use the word in a programming sense as a collection of data items stored together for convenience and ease of processing, generally without a predefined schema. From a database point of view, such a document is rather like the pre-First Normal Form set of data fields that users present to you as what they need in their application. Think, for example, of an order consisting of an order header and multiple line items. In the relational paradigm, you’ll make two (or more) tables and join them via foreign keys. In a document paradigm, you’ll keep all the data related to that order in one document.
Two characteristics–the lack of a predefined schema and the absence of joins–are very attractive in certain situations, and these turn out to be key design points for MongoDB. The lack of a schema makes it very easy to add new data fields to an existing database without having to reload the old data; so if you are in an emerging industry or application space, especially where data volumes are large, this is very attractive. The absence of joins also plays well for large data volumes; if you have to shard your data over multiple servers, joins can be very expensive. So, MongoDB, like most NoSQL tools, play strongly in the Web space with companies needing fact processing of large volumes of data with emergent processing needs. Sample customers include Shutterfly, foursquare, Intuit, IGN Entertainment, Craigslist and Disney. In many cases, the use of the database would be classed as operational by most BI experts. However, there are some customers using it for real-time analytics, and that leads us to the question of using non-relational databases for BI.
When considering implementing Operational BI solutions, many implementers first think of copying the operational data to an operational data store (ODS), data warehouse or data mart and analysing it there. They are immediately faced with the problem of how to update the informational environment fast enough to satisfy the timeliness requirement of the users. As that approaches real-time, traditional ETL tools begin to struggle. Furthermore, in the case of the data warehouse, the question arises of the level of consistency among these real-time updates and between the updates and the existing content. The way MongoDB is used points immediately to an alternative, viable approach–go directly against the operational data.
As always, there are pros and cons. Avoiding storing and maintaining a second copy of large volumes of data is always a good thing. And if the analysis doesn’t require joining with data from another source, using the original source data can be advantageous. There are always questions about performance impacts on the operational source, and sometimes security implications as well. However, the main question is around the types of query possible against a NoSQL store in general or a document-oriented database in this case. It is generally accepted that normalizing data in a relational database leads to a more query-neutral structure, allowing a wider variety of queries to be handled. On the other hand, as we saw with the emergence of dimensional schemas and now columnar databases, query performance against normalized databases often leaves much to be desired. In the case of Operational BI, however, most experience indicates that the queries are usually relatively simple, and closely related to the primary access paths used operationally for the data concerned. The experience with MongoDB bears this out, at least in the initial analyses users have required.
I’d be interested to hear details of readers’ experience with analytics use of this and other non-Relational approaches.