Overview "SOM"
- Summary: Qualitative visualization of high-dimensional data sets on a 2-dimensional "geographical" map
- Number of Dimensions: unlimited in theory, but results tend to get worse for larger numbers
- Data Types: Numerical plus one numerical / nominal for point color
This is the next post of a series describing all RapidMiner plotters in detail. A list of the plotters discussed so far can be found at the end of this article including the links to them. Since many options and controls of these plotters are also relevant for the one discussed here - as well as for many other plotters - I recommend to check out the first parts of this series before reading this one.
Before we start our discussion about the SOM plotter, we will first have a look on the final result:
A SOM (Self-Organizing Map) is a visual representation of your data set on a two-dimensional area which resembles a geographical map. The basic idea is that data points which are close together in the original high-dimensional space should also be close together in the resulting two-dimensional space. In order to visualize those distances in the resulting space, a color mapping is used.
Have a look at the map above. Mountains indicate that the distances between points are high. Deep sea means that those points are closer together. For example, the green points on the left are separated less from the red points in the lower left corner (upper arrow) than the green and blue points (lower arrow).
Another property of the map is that the top border and the bottom border are connected, i.e. it behaves like a world map. You can continue the map seamlessly from top to bottom. The same is true for the left and the right border.
Ok, after having seen the results we are after we will now have a look on how to create the plot and configure the plotter.
There is a major difference between the SOM plotter and other plotters in RapidMiner. Internally, a SOM is an unsupervised neural network. The data points are sorted to the nodes of the network. The consequence of this is that the network has to be trained for each data set anew. And as you might know if you are familiar with neural networks: the training can take some time. Therefore, most changes of the SOM settings will not have any affect until you press the Calculate button at the bottom of the plotter options on the left.
After having pressed the button, the calculation of the network is performed which might take some time. The progress indicator above the calculate button might give you a hint how long you will have to wait. After a couple of seconds (or minutes - depending on the data set), you will get the visualization of the two-dimensional map like in the following picture:
Please note that you will have to select a Point Color in order to show the data points on the map. This most often will be the class of the data points or any other property you might be interested in. If you select an attribute of your data or model here, the values of this attribute will be used for determining the color of each of the data points. It does not matter if the selected column is numerical or nominal, both scenarios will work.
The next two options Matrix and Style are specific for the SOM visualization. With the Matrix option, you can choose if you want to display the distances (U-Matrix), the density of the data space (P-Matrix) or a combination of both (U*-Matrix). Please compare the difference between the U-Matrix (the picture above) with the U*-Matrix (the first picture in this post). The Style option indicated the color scheme which is used for displaying those information. The default Landscape produces a geographical map like the one, just play around in order to search a color scheme which is most appropriate for your data set.
As we have stated above, the SOM is internally represented by a network consisting of a fixed number of nodes. The size of this network can be determined with the settings Net Width and Net Height. There are also two important training options for the underlying neural network, namely the two training parameters Training Rounds and Adaptation Radius. The default values are fine for most settings but you might want to optimize those for certain data sets. After having changed those settings, you will have to re-calculate the plot again by pressing on Calculate.
Since the data points are initially all located on the network nodes, it often happens that multiple data points are located on a single node and are overlapping. For this reason, the Jitter option is very useful for SOMs. Just move around the jitter slider and look what's happening: the points are moving a bit to a random direction showing if and which points are lying below.
The last option is pretty simple and does the same as for the other plotters: Export Image opens a dialog which allows you to export the current plotter with all its settings into one of the dozens supported image formats.
A last note on SOM visualizations: the calculation of the neural network depends on random numbers and pressing Calculate another time might deliver a different - and sometimes more appropriate - result. Just try to recalculate a visualization by pressing Calculate again.
Other parts of the plotter series: