When some models are significantly better than others

I’m not a statistician, nor have I played one on TV. That’s not to say I’m not a big fan of statistics. In the age-old debate between data mining and statistics, there is much to say on both sides of the aisle. While much of this kind of debate I find unnecessary, and conflicts have arisen as much over terminology rather than the actual concepts, there are some areas where I have found a sharp divide.

One of these areas is the idea of significance. Most statisticians who excel in their craft that I have spoken with are well-versed in discussions of p-values, t-values, and confidence intervals. Most data miners, on the other had, have probably never heard of these, or even if they have, never use them. Aside from the good reasons to use or not use these kind of metrics, I think it typifies an interesting phenomenon in the data mining world, which is the lack of measures of significance. I want to consider that issue in the context of model selection: how does one assess whether or not two models are different enough so that there are compelling reasons to select one over the other?

One example of this is what one sees when using a tool like Affinium Model (Unica Corporation)—a tool I like to use very much. If you are building a binary classification model, it will build for you, automatically, dozens, hundreds, potentially even thousands of models of all sorts (regression, neural networks, C&RT trees, CHAID trees, Naïve Bayes). After the models have been built, you get a list of the best models, sorted by whatever metric you have decided (typically area under the lift curve or response rate at a specified file depth). All of this is great. The table below shows a sample result:

Model………Rank..Total Lift….Algorithm

NeuralNet1131…1….79.23%….Backpropagation Neural Network
NeuralNet1097…2….79.20%….Backpropagation.Neural.Network
NeuralNet1136…3….79.18%….Backpropagation.Neural.Network
NeuralNet1117…4….79.10%….Backpropagation.Neural.Network
NeuralNet1103…5….79.09%….Backpropagation.Neural.Network
Logit774……..6….78.91%….Logistic.Regression
Bayes236……..7….78.50%….Naive Bayes
LinReg461…….8….78.48%….Linear.Regression
CART39……….9….75.75%….CART
CHAID5………10….75.27%….CHAID

Yes, the Neural Network model (NeuralNet1131) has won the competition and has the best total lift. But the question is this: is it significantly better than the other models? (Yes, linear regression was one of the options for a binary classification model—and this is a good thing, but a topic for another day). How much improvement is significant? There is no significance test applied here to tell us this.

Continue Reading

What is Big Data?

One of my favorite terms at the moment is “Big Data”. While all terms are by nature subjective, in this post I will try and explain what Big Data means to me.

So what is Big Data?

Big Data is the “modern scale” at which we are defining or data usage challenges. Big Data begins at the point where need to seriously start thinking about the technologies used to drive our information needs.

While Big Data as a term seems to refer to volume this isn’t the case. Many existing technologies have little problem physically handling large volumes (TB or PB) of data. Instead the Big Data challenges result out of the combination of volume and our usage demands from that data. And those usage demands are nearly always tied to timeliness.

Big Data is therefore the push to utilize “modern” volumes of data within “modern” timeframes. The exact definitions are of course are relative & constantly changing, however right now this is somewhere along the path towards the end goal. This is of course the ability to handle an unlimited volume of data, processing all requests in real time.

So what are Big Data technologies?

More than at any point in the past, data related technologies are the focus of research & innovation. But Big Data challenges won’t be solved anytime soon by a single approach. Keeping in mind all the different platforms that Big Data is having an impact on (web, cloud, enterprise, mobile) combined with all the Big Data domain challenges (transaction processing, analytics, data mining, visualization) as well as many of the Big Data characteristic requirements (volume, timeliness, availability, consistency), it is easy to see how no single technology will provide a cover-all solution for the eclectic mix of needs. Instead a broad set of technologies that are each focused on meeting specific set of needs are improving our ability to manage data at scale.

A few common areas of innovation that I describe as Big Data technologies include:

MPP Analytics
Cloud Data Services
Hadoop & Map/Reduce (and associate technologies such as HBase, Pig & Hive)
In-Memory Databases
Distributed NoSQL databaes and some Distributed Transaction Processing databases.

So what is the point of Big Data?

Someone asked me if Big Data was just tools to “try and sell them more relevant crap they don’t want”. While up-sell & targeted advertising are too major uses of Big Data technologies I hope that mine and others work in this field does result achievements more significant than just these.

When describing the point of Big Data I like to think about how the Internet has changed my life in general. By having unlimited & timely access to information we are now better informed in all areas of our existence than ever before. However, we are now facing the problem that there is fast becoming too much data for us to digest in its raw form. To move forward in our understanding we will need to rely on technology to provide timely, summarized & relevant data across all aspects of our lives. This is what those working in Big Data are setting out to achieve.

Continue Reading

What is Big Data?

One of my favorite terms at the moment is “Big Data”. While all terms are by nature subjective, in this post I will try and explain what Big Data means to me.

So what is Big Data?

Big Data is the “modern scale” at which we are defining or data usage challenges. Big Data begins at the point where need to seriously start thinking about the technologies used to drive our information needs.

While Big Data as a term seems to refer to volume this isn’t the case. Many existing technologies have little problem physically handling large volumes (TB or PB) of data. Instead the Big Data challenges result out of the combination of volume and our usage demands from that data. And those usage demands are nearly always tied to timeliness.

Big Data is therefore the push to utilize “modern” volumes of data within “modern” timeframes. The exact definitions are of course are relative & constantly changing, however right now this is somewhere along the path towards the end goal. This is of course the ability to handle an unlimited volume of data, processing all requests in real time.

So what are Big Data technologies?

More than at any point in the past, data related technologies are the focus of research & innovation. But Big Data challenges won’t be solved anytime soon by a single approach. Keeping in mind all the different platforms that Big Data is having an impact on (web, cloud, enterprise, mobile) combined with all the Big Data domain challenges (transaction processing, analytics, data mining, visualization) as well as many of the Big Data characteristic requirements (volume, timeliness, availability, consistency), it is easy to see how no single technology will provide a cover-all solution for the eclectic mix of needs. Instead a broad set of technologies that are each focused on meeting specific set of needs are improving our ability to manage data at scale.

A few common areas of innovation that I describe as Big Data technologies include:

MPP Analytics
Cloud Data Services
Hadoop & Map/Reduce (and associate technologies such as HBase, Pig & Hive)
In-Memory Databases
Distributed NoSQL databaes and some Distributed Transaction Processing databases.

So what is the point of Big Data?

Someone asked me if Big Data was just tools to “try and sell them more relevant crap they don’t want”. While up-sell & targeted advertising are too major uses of Big Data technologies I hope that mine and others work in this field does result achievements more significant than just these.

When describing the point of Big Data I like to think about how the Internet has changed my life in general. By having unlimited & timely access to information we are now better informed in all areas of our existence than ever before. However, we are now facing the problem that there is fast becoming too much data for us to digest in its raw form. To move forward in our understanding we will need to rely on technology to provide timely, summarized & relevant data across all aspects of our lives. This is what those working in Big Data are setting out to achieve.

Continue Reading