Wednesday, August 01, 2012

Clustering Through Decision Tree Construction



As is often the case, any idea you may have, no matter how novel you may think it is, has already been thought of by someone else. Some time ago I wanted to modify a decision tree induction algorithm for clustering. I thought that this could be a new method for clustering, but fortunately after an internet search, I came across a paper that described a method to do clustering using decision trees called CLTree. I like to re-use what other people have already done – it saves me a lot of time. Sadly, people writing papers do not always provide the source code. However, the paper did provide a fairly detailed description of the algorithm. In this post, I present an implementation of that algorithm.

Here are some of the benefits of this algorithm as described by the paper:

·         CLTree is able to find clusters in the full dimension space as well as in subspaces.
·         It provides descriptions of the resulting clusters in terms of hyper-rectangle regions.
·         It provides descriptions of the resulting empty (sparse) regions.
·         It deals with outliers effectively.

The basic idea is to assign each instance in the dataset a class Y. A decision tree requires at least two classes to partition a dataset. Assume that the data space is uniformly distributed with “other” instances of class N, called non-existing or virtual instances. By adding the N instances to the original dataset, the problem of partitioning the dataset into dense data regions and sparse empty regions becomes a classification problem. For the details please read the paper.

Below, the algorithm found three clusters for the trivial two dimensional dataset (gen.arff) that you can find with the source code I provide.

Please be aware that this implementation has only been tested on a couple of trivial datasets. One with two dimension and the other with five (provided with the source code). It probably will not scale very well with large datasets.

Socializer Widget By Blogger Yard
SOCIALIZE IT →
FOLLOW US →
SHARE IT →

7 comments:

  1. Hi,

    First of all thanks for the code. I have one doubt regarding the method. I have data without any class information. How can I use this method to cluster it. As the example data sets both have class information in them.

    Thanks

    ReplyDelete
  2. The quickest way is to just label all instances with the same "class" name -- any name. Otherwise, you could modify the code to ignore the Data.class_names variable.

    ReplyDelete
  3. Hi,

    Many thanks for your code.

    I have a question as for the dataset D05.arff that you provide for test. Have the class labels {100,701,101,201,200,601,501,401,300,301,302,-1} a special meaning ?
    Thanks in advance.

    ReplyDelete
  4. How is it different than CART or CHAID?

    ReplyDelete
  5. How is it different than CART or CHAID?

    ReplyDelete
  6. I think things like this are really interesting. I absolutely love to find unique places like this. It really looks super creepy though!!
    machine learning with python course in chennai | best training institute for machine learning

    ReplyDelete
  7. First, thank you for the source code.
    However, there is one problem.
    >>
    PycharmProjects\CLTree_original\build.py", line 132, in split
    l = dataset.instance_values[0:idx]
    TypeError: only integer scalar arrays can be converted to a scalar index
    >>
    Is there a solution to this problem?
    Python version 3 for reference.

    ReplyDelete