Download: https://github.com/dimitrs/CLTree
As is often the case, any idea you may have, no matter how
novel you may think it is, has already been thought of by someone else. Some
time ago I wanted to modify a decision tree induction algorithm for clustering.
I thought that this could be a new method for clustering, but fortunately after
an internet search, I came across a paper that described a method to do clustering
using decision trees called CLTree. I like to re-use what other people have
already done – it saves me a lot of time. Sadly, people writing papers do not
always provide the source code. However, the paper did provide a fairly
detailed description of the algorithm. In this post, I present an
implementation of that algorithm.
Here are some of the benefits of this algorithm as described
by the paper:
·
CLTree is able to find clusters in the full
dimension space as well as in subspaces.
·
It provides descriptions of the resulting
clusters in terms of hyper-rectangle regions.
·
It provides descriptions of the resulting empty
(sparse) regions.
·
It deals with outliers effectively.
The basic idea is to assign each instance in the dataset a
class Y. A decision tree requires at least two classes to partition a dataset.
Assume that the data space is uniformly distributed with “other” instances of
class N, called non-existing or virtual instances. By adding the N instances to
the original dataset, the problem of partitioning the dataset into dense data
regions and sparse empty regions becomes a classification problem. For the
details please read the paper.
Below, the algorithm found three clusters for the trivial
two dimensional dataset (gen.arff) that you can find with the source code I
provide.
Please be aware
that this implementation has only been tested on a couple of trivial datasets.
One with two dimension and the other with five (provided with the source code).
It probably will not scale very well with large datasets.