Download: https://github.com/dimitrs/CLTree
As is often the case, any idea you may have, no matter how
novel you may think it is, has already been thought of by someone else. Some
time ago I wanted to modify a decision tree induction algorithm for clustering.
I thought that this could be a new method for clustering, but fortunately after
an internet search, I came across a paper that described a method to do clustering
using decision trees called CLTree. I like to re-use what other people have
already done – it saves me a lot of time. Sadly, people writing papers do not
always provide the source code. However, the paper did provide a fairly
detailed description of the algorithm. In this post, I present an
implementation of that algorithm.
Here are some of the benefits of this algorithm as described
by the paper:
·
CLTree is able to find clusters in the full
dimension space as well as in subspaces.
·
It provides descriptions of the resulting
clusters in terms of hyper-rectangle regions.
·
It provides descriptions of the resulting empty
(sparse) regions.
·
It deals with outliers effectively.
The basic idea is to assign each instance in the dataset a
class Y. A decision tree requires at least two classes to partition a dataset.
Assume that the data space is uniformly distributed with “other” instances of
class N, called non-existing or virtual instances. By adding the N instances to
the original dataset, the problem of partitioning the dataset into dense data
regions and sparse empty regions becomes a classification problem. For the
details please read the paper.
Below, the algorithm found three clusters for the trivial
two dimensional dataset (gen.arff) that you can find with the source code I
provide.
Please be aware
that this implementation has only been tested on a couple of trivial datasets.
One with two dimension and the other with five (provided with the source code).
It probably will not scale very well with large datasets.
Hi,
ReplyDeleteFirst of all thanks for the code. I have one doubt regarding the method. I have data without any class information. How can I use this method to cluster it. As the example data sets both have class information in them.
Thanks
The quickest way is to just label all instances with the same "class" name -- any name. Otherwise, you could modify the code to ignore the Data.class_names variable.
ReplyDeleteHi,
ReplyDeleteMany thanks for your code.
I have a question as for the dataset D05.arff that you provide for test. Have the class labels {100,701,101,201,200,601,501,401,300,301,302,-1} a special meaning ?
Thanks in advance.
How is it different than CART or CHAID?
ReplyDeleteHow is it different than CART or CHAID?
ReplyDeleteI think things like this are really interesting. I absolutely love to find unique places like this. It really looks super creepy though!!
ReplyDeletemachine learning with python course in chennai | best training institute for machine learning
First, thank you for the source code.
ReplyDeleteHowever, there is one problem.
>>
PycharmProjects\CLTree_original\build.py", line 132, in split
l = dataset.instance_values[0:idx]
TypeError: only integer scalar arrays can be converted to a scalar index
>>
Is there a solution to this problem?
Python version 3 for reference.
Thank You and I have a neat offer you: How Much Home Renovation Cost house renovation ideas
ReplyDelete