Clustering Through Decision Tree Construction ~ Software Ideas

Wednesday, August 01, 2012

Clustering Through Decision Tree Construction

Download: https://github.com/dimitrs/CLTree

As is often the case, any idea you may have, no matter how novel you may think it is, has already been thought of by someone else. Some time ago I wanted to modify a decision tree induction algorithm for clustering. I thought that this could be a new method for clustering, but fortunately after an internet search, I came across a paper that described a method to do clustering using decision trees called CLTree. I like to re-use what other people have already done – it saves me a lot of time. Sadly, people writing papers do not always provide the source code. However, the paper did provide a fairly detailed description of the algorithm. In this post, I present an implementation of that algorithm.

Here are some of the benefits of this algorithm as described by the paper:

· CLTree is able to find clusters in the full dimension space as well as in subspaces.

· It provides descriptions of the resulting clusters in terms of hyper-rectangle regions.

· It provides descriptions of the resulting empty (sparse) regions.

· It deals with outliers effectively.

The basic idea is to assign each instance in the dataset a class Y. A decision tree requires at least two classes to partition a dataset. Assume that the data space is uniformly distributed with “other” instances of class N, called non-existing or virtual instances. By adding the N instances to the original dataset, the problem of partitioning the dataset into dense data regions and sparse empty regions becomes a classification problem. For the details please read the paper.

Below, the algorithm found three clusters for the trivial two dimensional dataset (gen.arff) that you can find with the source code I provide.

Please be aware that this implementation has only been tested on a couple of trivial datasets. One with two dimension and the other with five (provided with the source code). It probably will not scale very well with large datasets.

Socializer Widget By Blogger Yard

SOCIALIZE IT →

8 comments:

Krishna Mohan1 July 2013 at 11:04
Hi,

First of all thanks for the code. I have one doubt regarding the method. I have data without any class information. How can I use this method to cluster it. As the example data sets both have class information in them.

Thanks
ReplyDelete
Replies
Dimitri1 July 2013 at 13:14
The quickest way is to just label all instances with the same "class" name -- any name. Otherwise, you could modify the code to ignore the Data.class_names variable.
ReplyDelete
Replies
Anonymous17 January 2017 at 05:25
Hi,

Many thanks for your code.

I have a question as for the dataset D05.arff that you provide for test. Have the class labels {100,701,101,201,200,601,501,401,300,301,302,-1} a special meaning ?
Thanks in advance.
ReplyDelete
Replies
Unknown1 February 2018 at 08:40
How is it different than CART or CHAID?
ReplyDelete
Replies
Unknown1 February 2018 at 08:41
How is it different than CART or CHAID?
ReplyDelete
Replies
Unknown11 September 2018 at 05:41
I think things like this are really interesting. I absolutely love to find unique places like this. It really looks super creepy though!!
machine learning with python course in chennai | best training institute for machine learning
ReplyDelete
Replies
Anonymous3 December 2018 at 18:49
First, thank you for the source code.
However, there is one problem.
>>
PycharmProjects\CLTree_original\build.py", line 132, in split
l = dataset.instance_values[0:idx]
TypeError: only integer scalar arrays can be converted to a scalar index
>>
Is there a solution to this problem?
Python version 3 for reference.
ReplyDelete
Replies
Anonymous14 December 2024 at 22:08
Thank You and I have a neat offer you: How Much Home Renovation Cost house renovation ideas
ReplyDelete
Replies

Add comment

Software Ideas

Wednesday, August 01, 2012