Which Data Mining Algorithm Is Right For You?

The choice of data mining algorithm is not an easy task. According to the “Data Mining Guide“, if you’re just starting out, it’s probably a good idea to experiment with several techniques to give yourself a feel for how they work. Your choice of algorithm will depend upon:

the data you’ve gathered,
the problem you’re trying to solve,
the computing tools you have available to you.
Let’s take a brief look at four of the more popular algorithms.

1. Regression

Regression is the oldest and most well-known statistical technique that the data mining community utilizes. Basically, regression takes a numerical dataset and develops a mathematical formula that fits the data. When you’re ready to use the results to predict future behavior, you simply take your new data, plug it into the developed formula and you’ve got a prediction! The major limitation of this technique is that it only works well with continuous quantitative data (like weight, speed or age). If you’re working with categorical data where order is not significant (like color, name or gender) you’re better off choosing another technique.

2. Classification

Working with categorical data or a mixture of continuous numeric and categorical data? Classification analysis might suit your needs well. This technique is capable of processing a wider variety of data than regression and is growing in popularity. You’ll also find output that is much easier to interpret. Instead of the complicated mathematical formula given by the regression technique you’ll receive a decision tree that requires a series of binary decisions. One popular classification algorithm is the k-means clustering algorithm. Take a look at the Classification Trees chapter from the Electronic Statistics Textbook for in-depth coverage of this technique.

3. Neural Networks

Neural networks have seen an explosion of interest over the last few years, and are being successfully applied across an extraordinary range of problem domains, in areas as diverse as finance, medicine, engineering, geology and physics. Indeed, anywhere that there are problems of prediction, classification or control, neural networks are being introduced. This sweeping success can be attributed to a few key factors:

Power. Neural networks are very sophisticated modeling techniques capable of modeling extremely complex functions. In particular, neural networks are nonlinear (a term which is discussed in more detail later in this section). For many years linear modeling has been the commonly used technique in most modeling domains since linear models have well-known optimization strategies. Where the linear approximation was not valid (which was frequently the case) the models suffered accordingly. Neural networks also keep in check the curse of dimensionality problem that bedevils attempts to model nonlinear functions with large numbers of variables.
Ease of use. Neural networks learn by example. The neural network user gathers representative data, and then invokestraining algorithms to automatically learn the structure of the data. Although the user does need to have some heuristic knowledge of how to select and prepare data, how to select an appropriate neural network, and how to interpret the results, the level of user knowledge needed to successfully apply neural networks is much lower than would be the case using (for example) some more traditional nonlinear statistical methods.
Neural networks are also intuitively appealing, based as they are on a crude low-level model of biological neural systems. In the future, the development of this neurobiological modeling may lead to genuinely intelligent computers.

4. Evolutionary Computation

Evolutionary algorithms employ this powerful design philosophy to find solutions to hard problems. Generally speaking, evolutionary techniques can be viewed either as search methods, or as optimization techniques. Evolutionary algorithm (EA) consists of stochastic search that are based on abstractions of the processes of Darwinian evolution. EA maintains a population of “individuals”, each of them a candidate solution to a given problem. Each individual is evaluated by a fitness function, which measures the quality of its corresponding candidate solution. Individuals evolve towards better and better individuals via a selection procedure based on natural selection (survival of the fittest) and operators based on genetics (crossover and mutation). In essence, the crossover operator swaps genetic material between individuals, whereas the mutation operator changes the value of a “gene” (a small part of the genetic material of an individual) to a new random value. Genetic Algorithms (GA) is the most popular paradigm of Evolutionary algorithms.

For More Information about Data Minining click here

Continue Reading

Top 10 Data Mining Mistakes

Maybe some of you have read this white paper before, but I just want to add here as resource collection for future data mining beginners. The paper is a book excerpts from “Handbook of Statistical Analysis and Data Mining Applications“, Elsevier (ISBN: 978-0-123747655). According to the authors, mining data to extract useful and enduring patterns remains a skill arguably more art than science itself. In the paper, they briefly describe, and illustrate from examples, what they believe are the “Top 10” mistakes of data mining, in terms of frequency and seriousness.

Top 10 DM Mistakes (white paper)

0. Lack of Data (important too!)
1. Focus on Training
2. Rely on One Technique
3. Ask the Wrong Question
4. Listen (Only) to the Data
5. Accept Leaks from the Future
6. Discount Pesky Cases
7. Extrapolate
8. Answer Every Inquiry
9. Sample Casually
10. Believe the Best Model

I would like to emphasize on mistake no. 2 (Rely on 1 technique only) which I think is important for us to consider. In data mining task, it is important that we try variations of modeling algorithms to make sure that we get the best result. Find new algorithms/tools that are available in the market (sometimes it is good to read new publication in conference/journal to get latest improvement of the algorithms) to mine your data. There is a popular folklore “No Free Lunch” (NFL Theorem) that states no algorithm is better to solve all the problems!

Continue Reading

Data Mining Trends (2004-2010): By Country, City and Language

I managed to have a look at the current trend (using Google Trends) for data mining by country (search volume), city (traffic source) and language (search language). Although I am no surprise that most of the interest came from Indian people, what interest me is that Iranian also want to join the data mining community year by year!

So have a look: Data Mining Trend (2004-2010)

Country Ranking (Search Volume)
India 1
Pakistan 2
Taiwan 3
Hong Kong 4
Iran 5
Indonesia 6
Singapore 7
South Korea 8
Malaysia 9
Thailand 10
City Ranking (Traffic Volume)
Chennai (India) 1
Mahape (India) 2
Bangalore (India) 3
Delhi (India) 4
Mumbai (India) 5
Taipei (Taiwan) 6
Hong Kong (Hong Kong) 7
Singapore (Singapore) 8
Jakarta (Indonesia) 9
Bangkok (Thailand) 10
Language Ranking (Search Language)
Indonesian 1
Korean 2
Thai 3
English 4
Chinese 5
Portuguese 6
Russian 7
Arabic 8
Italian 9
German 10

 

Continue Reading

Watch Online Data Mining Tutorial with Weka

Hi, if you are not eager to learn some data mining tasks by reading, I recommend you watch this online video (can also be download-mp4) on practical data mining using Weka (open source data mining tool in java). The presenter shows you step by step on how to install weka, selecting real world data source, how to clean the data, how to upload the data into Weka, selecting learning algorithm and result (model) interpretation. The current available data mining videos are:

  • text mining
  • neural network
  • clustering
  • cluster and neural network
  • filtering tool
  • experimenter (weka module)

I bet you can learn the tasks faster more than you know (at least it works for me..).

Watch Online Data Mining Tutorial with Weka.

For More Information about Data Minining click here

Continue Reading