Wednesday, August 01, 2012

Clustering Through Decision Tree Construction



As is often the case, any idea you may have, no matter how novel you may think it is, has already been thought of by someone else. Some time ago I wanted to modify a decision tree induction algorithm for clustering. I thought that this could be a new method for clustering, but fortunately after an internet search, I came across a paper that described a method to do clustering using decision trees called CLTree. I like to re-use what other people have already done – it saves me a lot of time. Sadly, people writing papers do not always provide the source code. However, the paper did provide a fairly detailed description of the algorithm. In this post, I present an implementation of that algorithm.

Here are some of the benefits of this algorithm as described by the paper:

·         CLTree is able to find clusters in the full dimension space as well as in subspaces.
·         It provides descriptions of the resulting clusters in terms of hyper-rectangle regions.
·         It provides descriptions of the resulting empty (sparse) regions.
·         It deals with outliers effectively.

The basic idea is to assign each instance in the dataset a class Y. A decision tree requires at least two classes to partition a dataset. Assume that the data space is uniformly distributed with “other” instances of class N, called non-existing or virtual instances. By adding the N instances to the original dataset, the problem of partitioning the dataset into dense data regions and sparse empty regions becomes a classification problem. For the details please read the paper.

Below, the algorithm found three clusters for the trivial two dimensional dataset (gen.arff) that you can find with the source code I provide.

Please be aware that this implementation has only been tested on a couple of trivial datasets. One with two dimension and the other with five (provided with the source code). It probably will not scale very well with large datasets.