Pages: [1]
  Print  
Author Topic: K-Means initialization (newbie question)  (Read 286 times)
Geoffrey
Newbie
*
Posts: 2


« on: January 28, 2012, 11:49:45 PM »

Hi All,

being a newbie to both data mining and Rapidminer I run into a question with regards to the k-means clustering algorithm.

First of all, from my theory book (Introduction to datamining by Tan & Steinbach) I learned that choosing the initial centroids for k-means is essential for the success of the algorithm. With the examples in the book, I understand that. But I would like to learn how to do this in practice using rapidminer.

Is there a way to set the initial centroids? I don't see any attribute for it on the k-means component.

Am I misunderstanding theory? Is this something rapidminer just doesn't support (and thus does random initialization)

Another question is whether RapidMiner is using Euclidean distance, Manhattan distance or another distance algorithm (and if this can be influenced?)

Regards,

Geoffrey
 
Logged
wessel
Sr. Member
****
Posts: 366


« Reply #1 on: January 29, 2012, 01:09:36 AM »

Hey,

@ option to set initial clusters yourself
I don't think there is.
But since RM is open source it is very easy to modify the code.
Here is the source of k-means:
http://pastebin.com/TvGxrwdJ

Within source of public class CentroidClusterModel extends ClusterModel {
      for (int i = 0; i < k; i++) {
         centroids.add(new Centroid(dimensionNames.size()));
      }

Please also read: http://rapid-i.com/content/view/25/72/lang,en/




@ Theory
I think random initializations are fine.
You should simply do multiple runs with random initializations.
Logged
Ingo Mierswa
Administrator
Hero Member
*****
Posts: 1196



WWW
« Reply #2 on: January 29, 2012, 09:11:01 PM »

Hi,

Quote
Another question is whether RapidMiner is using Euclidean distance, Manhattan distance or another distance algorithm (and if this can be influenced?)

For k-Means, Euclidean distance is used. This is necessary in order to perform the cluster optimization in O(n log n) runtime and can hence not be changed. If you want to use other distance functions, you could use k-Medoids. This allows for all distance functions but is slightly slower and runs in O(n*n).

Cheers,
Ingo
Logged

Did you try our new Marketplace? Upload or download new Extensions, add comments, and organize your operators. Have a look at  http://marketplace.rapid-i.com
Pages: [1]
  Print  
 
Jump to: