Here’s an interesting use of R outside the “usual” statistics domains: using advanced analytics to estimate how long a typical content-management system (CMS) remains in use. Some industry analysts cite a lifetime of 3 years, but can that estimate be backed up with data? To investigate, Michael Marth uses the customer records from a CMS provider, and looks at how long their support contracts were maintained (as a proxy for the system actually being in use). These data require a special kind of analysis, so lets take a look in detail. In the data, some of the contracts are still…
Here's an interesting use of R outside the "usual" statistics domains: using advanced analytics to estimate how long a typical content-management system (CMS) remains in use. Some industry analysts cite a lifetime of 3 years, but can that estimate be backed up with data? To investigate, Michael Marth uses the customer records from a CMS provider, and looks at how long their support contracts were maintained (as a proxy for the system actually being in use). These data require a special kind of analysis, so lets take a look in detail.
In the data, some of the contracts are still active: for example, the customer took out a support contract 4 years ago, and the contract has not yet been terminated. In statistics, this is called a right-censored data point: we know the contract will terminate eventually, but as of today, we don't know when. We do know that when it does terminate, it will have lasted at least 4 years. A naive analysis would just include this data point with a duration of 4 years, but that would bias the estimated average lifetime downwards. By the same token, we can't just ignore this data point either (not least because it would waste much of our data!).
Fortunately, Statistics (and R) comes to the rescue with a technique called
survival analysis. As the name suggests it originated in the medical field where the goal was to identify medical treatments that prolong life (lung cancer treatments, for example) without having to wait for all the patients to die before identifying a life-saving treatment. In that situation, the analysis includes data from some patients who lived some years since treatment and then died, but
also includes patients that had the treatment some years ago and remain alive today. You can apply the same technique to any kind of duration data where the ultimate duration may not be known at the time of analysis. Examples include: time to failure of a machine component, time to resolution of a customer support issue, and duration of a service contract.
One of the useful things with survival analysis is that you don't simply get an estimate of the average contract time (in our original example): you can also find out what percentage of contracts last at least 3 years (or any other duration), and get error bars on that estimate to boot. R calculates this readily using the built in "survfit" function, which displays the results in a chart called a Kaplan-Meier chart:
The chart looks complicated, but it's easy to read with practice. Look along the horizontal axis to choose a time period: let's choose 1100 days (about 3 years). Now look upwards to find the solid line between the two dashed lines, and check its position on the vertical axis — I read it to be about 0.75. This indicates that 75% of support contracts (plus or minus about 10%) last at least three years, much better than the average duration of 3 years indicated by the analyst. In fact, according to this analysis the mean survival time is 6.75 years.
I've seen plenty of such analyses done the "naive" way in Excel, and you can see why this might lead you to the wrong conclusion. If you have data like this, it's worth taking a look at
R to get better information out of censored data.