Isn't In-database processing old news yet?

A bit of a Clementine plug, but hear me through…

I’m puzzled by a few recent articles I’ve read describing in-database processing, the practice of doing very sophisticated data warehouse analysis (lets call it data mining 🙂 on large amounts of data without having to extract the data into an external analytics tool (for example, a tool like SPSS or SAS).

As an example see the current Teradata magazine article. I was fortunate enough to spend a …

A bit of a Clementine plug, but hear me through…

As an example see the current Teradata magazine article. I was fortunate enough to spend a few evenings chatting with Stephen Brobst (chief technology officer of Teradata) on these topics during a Teradata conference in Beijing last June 2008 *. I think he’s right on the money concerning his top 4 predictions for data warehousing. As a Data Miner I am concerned with how I might be expected to analyse the data, so in-database processing is the biggest topic for me. I’m not so sure it is a ‘future’ thing though. In my view its here now, just maybe not so widespread. My only guess is that it’s another plug for the SAS partnership. Although I don’t use SAS I do like the thinking and development plan going forward. I simply don’t think its the new concept it being touted as. I’m sure its not necessary for a data miner or data analyst to need custom plug-ins (and corresponding expense) to reach in-database data mining nirvana.

In-database processing is nothing new. I’ve doing it using SPSS Clementine and Teradata for a few years now. SPSS Clementine has supported this functionality for quite a few years. In real-time Clementine will convert the stream (a graphical icon-based proprietary query file) into SQL (structured query language) and submits the SQL query(s) to the data warehouse. Any computation that cannot be represented as SQL will cause a data extraction and further processing by the Clementine analysis engine itself (commonly the Clementine Server on a dedicated analytical server box will do this, and keep the data and temp files on the server file system. Not the desktop). In practice I usually avoid heavy statistical functions and all my data processing is usually performed in the Teradata warehouse and only the data sample required to build a predictive or clustering model is extracted. When it comes to scoring the created data mining model (such as a neural network or decision tree) Clementine also converts the data mining model into SQL transparently for truly high scale processing on the data warehouse.

The real advantage comes from not just being able to score existing data mining models, but also being able to build predictive models entirely in the data warehouse, and this is a comparatively new development (a couple of years). Not something I do much of (I’ve done it for fun on my home SQL Server, but not in a corporate production environment). If the data warehouse provides the capability to create neural networks or clustering models (which some now do) then there is no need to ever extract data from the data warehouse into an external analysis application such as SPSS or SAS. More data can therefore be used to build models, and this usually beats tweaking algorithm options.

The data warehouses have actually supported embedded code and adding custom functions that might include a data mining algorithm for quite a while. For info see a recent post by Seth Grimes titled In-Database Analytics: A Passing Lane for Complex Analysis.

Only in the past year or so have full blown embedded data mining algorithms taken off. The question is though will these algorithms always run fast(er)? Custom code can be good or bad! One advantage of the ‘algorithms converted into SQL’ route is that the data warehouse can quite easily determine and control how to process and prioritise the SQL query and be optimised specifically for it. Custom code and embedded data mining algorithms can also be optimised, but I’m guessing that requires far more effort (and expense!). One worry is also that custom code brings dangers and risks (not to mention the testing and issues for IT and the DB admin). Still, its necessary for in-database data mining model building capability.

Ok, I’m guessing some of my peers might know this stuff anyway, but one thing has recently occurred to me; considering that once we have data processing, model building and model scoring all occurring within the data warehouse, what need have we for the data mining tool?

My preference is for a easy tool that makes querying the data warehouse and constructing highly complex analysis easy. My queries could not possibly be prepared by hand since they are often transformed into many thousands of lines of SQL code. I use a clever user interface to make understanding the logic of the analysis possible. The data mining tool I use primarily is a tool that optimises my interaction with the data warehouse.

So considering these things, my current view is that data mining applications/tools such as SPSS Clementine (or even SAS 🙂 will stick around for quite a bit longer because the user interface optimises a data miner’s ability to query the data warehouse and perform data mining efficiently, but maybe in a few years we start to see what we commonly refer to as data mining ‘algorithms’ being developed for data warehouses and no longer in data mining tools (or simply as plug-ins for data warehouses). An interesting thought indeed!

– Tim

* Whilst at the Teradata User Conference in Beijing I presented some data mining analysis work, mostly my data analysis work on churn prediction and product upsell in telcommunications, and chatted to the China mobile analysts afterward. On a more personal note, that is also when my soon-to-be-born son was conceived. Don’t worry, my wife was with me at the time! In true Hollywood fashion I thought it appropriate we therefore name him ‘Beijing’ or ‘Teradata’ but my wife doesn’t share my enthusiasm 🙂

Link to original post