|Plotter||22 Dec 2010|
|The RapidMiner Plotters 11: Survey by Ingo Mierswa||
- Summary: Compact visualization of high-dimensional data sets; often shows correlations quite well
- Number of Dimensions: unlimited in theory, but useful up several hundred depending on the screen resolution
- Data Types: Numerical, Nominal, Dates
This is the next post of a series describing all RapidMiner plotters in detail. A list of the plotters discussed so far can be found at the end of this article including the links to them. Since many options and controls of these plotters are also relevant for the one discussed here - as well as for many other plotters - I recommend to check out the first parts of this series before reading this one.
Before we start our discussion about the Survey plotter, we will first have a look:
The basic idea behind a survey plot is not widely known but it is actually pretty simple: the plot consists of n rectangular areas if you have n dimensions, each area represents one of the dimensions. Similar to a parallel plot , each area is placed beside the other dimensions but in the case of a survey plot, the axis is placed horizontally instead of vertically.
Each data point is then printed like in a bar chart (covered later in this series but probably widely known). A line is used to represent the data for each dimension, with its length proportional to the dimensional value it represents. The complete data point is then represented by all line segments in the same row.
In the example above, you can see about 200 examples, one row consisting of short horizontal line segments for each of those examples. Those short line segments represent the values of those 200 examples for each of the about 60 attributes (dimensions). This visualization is very compact and works well even for high numbers of dimensions and thousands of examples. One of the first things you can easily see is the distribution of values in the different dimensions. For example, the center part has much more high values compared to the attributes on the left or those on the right. This is especially useful if the attributes are ordered like, for example, it is often the case for series data.
The second advantage of the survey plot is that it quickly gives some insight to the correlation between any two dimensions. This is even easier if you sort the data to one or several of the dimensions. And this is where the options on the left become important.
Especially for classification data sets, the last option named Color is probably the most important. Here you can select one of the dimensions and define a color scheme based on the selected attribute’s values. The result might look like this:
If you look closely, you can probably already see that for some of the attributes (attribute_10, attribute_11, and attribute_12 for example) the distribution differs a lot between the two classes. For example, the values of attribute 12 are in general higher for one of the classes. You can even make this effect stronger by sorting the data. The survey plot in RapidMiner offers a sorting for up to three dimensions by selecting them with the parameter boxes named First column, Second column, and Third column. The following picture shows the same data set first sorted according to attribute_12 and then according to attribute_45, i.e. in cases where the values of attribute_12 are equal, the second attribute determines the order:
Now you can see several things: there are other attributes highly correlated to attribute_12, which is now placed as the first dimension in the plot. You can identify those correlated attributes easily since they are also (almost) sorted now. Check for example attribute_10 or attribute_11. By the way, the name of an attribute is shown simply by clicking on one of the graphical columns.
When color is used for different classifications as it is the case here, one can sometimes see by using a sort which dimensions are best at classifying data. For example, the combination of attribute 12 and 45 already defines colorized bands of the same class, i.e. a classification scheme should be able to define those regions and identify those with a high probability for one of the classes.
The last option is pretty simple and does the same as for the other plotters: Export Image opens a dialog which allows you to export the current plotter with all its settings into one of the dozens supported image formats.
Other parts of the plotter series:
- Scatter Multiple
- Scatter Matrix
- Scatter 3D
- Scatter 3D Color
- Parallel Plot
- Series Multiple