Open source software for big data analytics.
No programming required.

HomeContact UsSearchSitemapPrivacy PolicyImprint
  • Deutsch
  • English
Rapid-I. Report the Future. Home Download
Rapid-I Blog
Home Home
Search Search
RSS Feed RSS Feed

 

 

Blog Tags
Login Form





Lost Password?
No account yet? Register
Tag >> Plotter
ReleaseRapidMinerPlotter 23 Jan 2012
New Plotters for RapidMiner by Marius Helf Comment (1)
After quite some time of hard development, the Rapid-I team is proud to announce the birth of its latest baby: a brand new plot component presenting you a shiny, powerful and flexible visualization of your data and process results.

The new plotters support bar charts, area charts, scatter and series plots with a single configuration. Instead of preselecting a diagram type from a list of templates the new plotters allow you to freely choose the visualization type of each attribute. You can plot more than one attribute at a time, create additional y-axes, combine aggregated bar charts with scatter plots and add a number of error indicators if you feel the need for it. Enough talking, this is what the new plotters can do for you (of course with your all-time favourite data set):

 

What do we see in this plot? As you might recognize, the points depict a scatter plot of two attributes of the Iris dataset, namely sepal length versus sepal width, where sepal length is placed on the domain axis (x-axis) and sepal width on the left range axis (y-axis). The colors and also the shapes of the points are chosen accordingly to the label of the data point. This is also represented in the legend on the right.

Talking about the legend, you might want to have a closer look on it. The upper part reveals the plots in this diagram. The first entry labelled sepal length (cm) with the circle in front of it shows us, that the plot consists of single data points, i.e. it is the scatter plot we just talked about. The missing color and quite undefined shape tells us to look at the bottom part of the legend to get the semantics for colors and shapes: moving our attention here we discover that each unique color and shape represent one of the label values iris setosa, iris virginica and iris versicolor.

Now everything left to explain is the bar chart, which is also easily spotted in the legend: it is a histogram of Iris, grouped by label,  over the sepal length. Note that the heights of the bars refer to a second range axis on the right.

The attentive reader will have noted that the bars are slightly transparent: this shows another feature of our new plotters - everything is formattable and customizable, starting at customizable presets and gradients for the plot colors, different shapes for each data series, plot and legend background up to the fonts of the title and the axes. What else do you desire? Bars oriented from left to right instead of vertical ones? No problem, two clicks and you are done. Aggregate your data to calculate averages and plot the standard deviation of each data point? No problem, everything is possible :)

The true plotter experts will even be able to beam good old Iris to New York and celebrate the arrival of the new plot engine with a fireworks never seen before in RapidMiner:

Oh yes, this truly is the Iris dataset. Can you guess from the legend what you are seeing?

We hope that we could awake your interest for this new feature. It will be part of RapidMiner 5.2 beta which is expected to be shipped at the end of this week. As usual you will be notified via RapidMiner's auto update about its availability, or you can just download from our website.

Plotter 22 Sep 2011
The RapidMiner Plotters 16: Bars by Ingo Mierswa Comment (2)

Overview "Bars"

  • Summary: Perform simple aggregations on your data (like sums, min or max) and show those values with respect to defined groups
  • Number of Dimensions: 2, one for the grouping and one for the (aggregated) values
  • Data Types: Numerical, Nominal

This is the next post of a series describing all RapidMiner plotters in detail. A list of the plotters discussed so far can be found at the end of this article including the links to them. Since many options and controls of these plotters are also relevant for the one discussed here - as well as for many other plotters - I recommend to check out the first parts of this series before reading this one.

Before we start our discussion about the plotters Bars,  we will again first have a look:

 

As you can see, the bar plotter consists of several bars representing values (on the y-axis) for selected groups (on the x-axis). In principle, the bar plotter is very similar to the plotters Pie, Pie 3D, and Ring which we have discussed in a previous blog post . The basic idea of this type of charts is to present a number of numerical value where each value represents a group. There are two typical application areas for this:

  • You have a data set with two columns, one column with a set of (un-)ordered nominal values and a second one containing a numerical value for each group;
  • You again have a data set with a nominal and a numerical column, but now you have each nominal value several times in your table. The goal then is to aggregate the numerical values for each group defined by each of the nominal labels.

It is important to see that in the first case, each nominal value only occurs once and hence there is no need for any calculation on the numerical values. In the second case, you usually would like to perform simple aggregations on your numerical data (like sums, min or max) or at least to calculate the count of your nominal values for each group. Hence, you would like to show those calculated / aggregated values with respect to the defined groups.

Each (calculated) number will be presented by a bar where the height of the bar corresponds to the absolute value. This differs from the Pie charts, where each slice represents the relative amount the number builds of the total sum. Look at the example above, where we used the famous Iris data set and where you can see the different average values for attribute "a3" with respect to the three groups defined by the labels / classes.

As always, you can find a list of settings on the left. The first setting is the Group-By Column. This will typically be a nominal-valued column from your data set which defined the groups into which the data set will be divided and presented by the elements of the chart. The setting Legend Column changes the labels at the bars to the values of the selected column. Since the only useful option is None or the grouping column, it can be ignored in most cases and will probably be removed in one of the next versions anyway.

The next important setting is the Value Column. Here you can select the usually numerical column which is used for value calculation. If you only have one row for each nominal value in the grouping column, you most often already have aggregated values ready for displaying. In other cases, you will have to define a matching Aggregation function, for example the sum or average of the values in each group. There are two additional settings which can be used to further fine-tune the plotting: Absolute Values means that only absolute values of the value column are used as input for the aggregation function. And the setting Use Only Distinct means that each value only is used exactly once in the aggregation, i.e. additional equal values are ignored.

The next setting allows to rotate the labels on the x-axis by 90 degree which allows to read longer labels or prevent label overlapping in case of large amounts of groups. Finally, you can define the orientation of the bar plotter, i.e. if the bars should be displayed vertically (default) or horizontally.

Other parts of the plotter series:

Plotter 8 Sep 2011
The RapidMiner Plotters 15: Pie, Pie 3D, and Ring by Ingo Mierswa Comment (0)

Overview "Pie", "Pie 3D", and "Ring"

  • Summary: Perform simple aggregations on your data (like sums, min or max) and show those values with respect to defined groups
  • Number of Dimensions: 2, one for the grouping and one for the (aggregated) values
  • Data Types: Numerical, Nominal

This is the next post of a series describing all RapidMiner plotters in detail. A list of the plotters discussed so far can be found at the end of this article including the links to them. Since many options and controls of these plotters are also relevant for the one discussed here - as well as for many other plotters - I recommend to check out the first parts of this series before reading this one.

Before we start our discussion about the plotters Pie, Pie 3D, and Ring, we will again first have a look:

 

 

 

The three plotters Pie, Pie 3D, and Ring are very similar to each other. We will demonstrate all plotter functions with the Pie chart and show screenshots for the other two plotters later on. The basic idea of this type of charts - which also include Bar charts which will be discussed in the next part of the series - is to present a number of numerical value where each value represents a group. There are two typical application areas for this:

  • You have a data set with two columns, one column with a set of (un-)ordered nominal values and a second one containing a numerical value for each group;
  • You again have a data set with a nominal and a numerical column, but now you have each nominal value several times in your table. The goal then is to aggregate the numerical values for each group defined by each of the nominal labels.

It is important to see that in the first case, each nominal value only occurs once and hence there is no need for any calculation on the numerical values. In the second case, you usually would like to perform simple aggregations on your numerical data (like sums, min or max) or at least to calculate the count of your nominal values for each group. Hence, you would like to show those calculated / aggregated values with respect to the defined groups.

The charts Pie, Pie 3D, and Ring are different to almost all other types of charts: there is no background or scales involved. Instead of that, each (calculated) number will be presented by a slice of the pie where the area of the slice corresponds to the relative amount the number builds of the total sum. Look at the example above, where we used the famous Iris data set and where you can see the different average values for attribute "a3" with respect to the three groups defined by the labels / classes.

 

 

As always, you can find a list of settings on the left. The first setting is the Group-By Column. This will typically be a nominal-valued column from your data set which defined the groups into which the data set will be divided and presented by the elements of the chart. The setting Legend Column changes the labels at the slices to the values of the selected column. Since the only useful option is None or the grouping column, it can be ignored in most cases and will probably be removed in one of the next versions anyway.

The next important setting is the Value Column. Here you can select the usually numerical column which is used for value calculation. If you only have one row for each nominal value in the grouping column, you most often already have aggregated values ready for displaying. In other cases, you will have to define a matching Aggregation function, for example the sum or average of the values in each group. There are two additional settings which can be used to further fine-tune the plotting: Absolute Values means that only absolute values of the value column are used as input for the aggregation function. And the setting Use Only Distinct means that each value only is used exactly once in the aggregation, i.e. additional equal values are ignored.

 

 

The last possible setting, which is only available for Pie and Ring but not for Pie 3D, is the definition of so-called Explosion Groups. You can here select one or several of the possible groups and move them out of the rest with the slider Explosion Amount. This can help to highlight selected groups as shown in the first picture above.

Other parts of the plotter series:

Plotter 1 Apr 2011
The RapidMiner Plotters 14: Density by Ingo Mierswa Comment (0)

Overview "Density"

  • Summary: Showing dependencies between up to four dimensions in a two-dimensional space
  • Number of Dimensions: 2 plus 1 encoded by color plus 1 encoded by point color
  • Data Types: Numerical, Nominal, Dates

This is the next post of a series describing all RapidMiner plotters in detail. A list of the plotters discussed so far can be found at the end of this article including the links to them. Since many options and controls of these plotters are also relevant for the one discussed here - as well as for many other plotters - I recommend to check out the first parts of this series before reading this one.

Before we start our discussion about the Density plotter, we will again first have a look:

 

The density plotter is very similar to the block plotter described previously and is basically also a scatter plot with two dimensions on the x-axis and the y-axis and one dimension used for the definition of the data points. But in addition to the block plotter, the density plotter also allows the selection of an additional attribute which is used to colorize the background of the plot. Another difference to the block plotter is the fact that the data points are actually points instead of the blocks we have seen before. The background can be interpreted as a heat map.

Like the scatter plot, the block plot is a simple two-dimensional plot with two axes: x and y. The x-axis is plotted horizontally and the y-axis vertically. If you plot a data set, each point will be located at the position which corresponds to the values with respect to those two axes but instead of a point a block is printed.

As always, you can find two boxes where you can select the attributes (variables, dimensions) of your data set or model which should be used for the x-Axis and for the y-Axis. Those two options both have to be set, the plotter will not show anything otherwise. By the way, you can use numerical attributes as well as nominal attributes for the axes. Even date attributes are supported.

The next option is called Point Color Column. If you select an attribute of your data or model here, the values of this attribute will be used for determining the color of each of the data points just as for the traditional scatter plot.

The most important setting of this plotter is the Density Color. Here you can select the attribute which is used for colorizing the background. The used colorization algorithm is quite simple: each data point contributes to all pixels depending on the distance of the data point to the pixel. The color is then calculated as the distance-weighted average of all points for each pixel position.

Of course the density plot also supports zooming and panning as described here .

Other parts of the plotter series:

Plotter 23 Feb 2011
The RapidMiner Plotters 13: Block by Ingo Mierswa Comment (1)

Overview "Block"

  • Summary: Showing dependencies between two three dimensions
  • Number of Dimensions: 2 plus 1 encoded by color
  • Data Types: Numerical, Nominal, Dates

This is the next post of a series describing all RapidMiner plotters in detail. A list of the plotters discussed so far can be found at the end of this article including the links to them. Since many options and controls of these plotters are also relevant for the one discussed here - as well as for many other plotters - I recommend to check out the first parts of this series before reading this one.

Before we start our discussion about the Block plotter, we will again first have a look:

 

 

The block plotter is basically a scatter plot with two dimensions on the x-axis and the y-axis and one dimension used for the definition of the data points. The main difference to a scatter plot actually is the point format which is not a dot but a block. This is quite useful if the data set contains points on a two-dimensional grid.

Like the scatter plot, the block plot is a simple two-dimensional plot with two axes: x and y. The x-axis is plotted horizontally and the y-axis vertically. If you plot a data set, each point will be located at the position which corresponds to the values with respect to those two axes but instead of a point a block is printed.

As always, you can find two boxes where you can select the attributes (variables, dimensions) of your data set or model which should be used for the x-Axis and for the y-Axis. Those two options both have to be set, the plotter will not show anything otherwise. By the way, you can use numerical attributes as well as nominal attributes for the axes. Even date attributes are supported.

As you can see, you can also identify if the selected attribute should be transformed on a log scale. Just check the box below the corresponding axis.

The next option is called Color Column. If you select an attribute of your data or model here, the values of this attribute will be used for determining the color of each of the blocks.

The block plot also provides the Jitter option although it is certainly not used as often as for the scatter plot. However, this option is quite useful if several data points are located at the same point in the two-dimensional space. Just move around the jitter slider and look what's happening: the blocks are moving a bit to a random direction showing if and which points are lying below.

The last two options are pretty simple: Rotate Labels causes that the labels of the x-Axis are rotated by 90 degrees. Especially if you use a nominal attribute for the x-axis, the values can then be easily read. Export Image opens a dialog which allows you to export the current plotter with all its settings into one of the dozens supported image formats.

Of course the block plot also supports zooming and panning as described here .

Below you can find another useful example for the block plot, namely the visualization of a correlation matrix (here done for the data set "Sonar"). You can easily see on the diagonal that attributes close to each other are more correlated and that there also regions of attribute combinations with a high (negative) correlation:

 

 

Other parts of the plotter series:

Plotter 9 Feb 2011
The RapidMiner Plotters 12: SOM by Ingo Mierswa Comment (0)

Overview "SOM"

  • Summary: Qualitative visualization of high-dimensional data sets on a 2-dimensional "geographical" map
  • Number of Dimensions: unlimited in theory, but results tend to get worse for larger numbers
  • Data Types: Numerical plus one numerical / nominal for point color

This is the next post of a series describing all RapidMiner plotters in detail. A list of the plotters discussed so far can be found at the end of this article including the links to them. Since many options and controls of these plotters are also relevant for the one discussed here - as well as for many other plotters - I recommend to check out the first parts of this series before reading this one.

Before we start our discussion about the SOM plotter, we will first have a look on the final result:

 

A SOM (Self-Organizing Map) is a visual representation of your data set on a two-dimensional area which resembles a geographical map. The basic idea is that data points which are close together in the original high-dimensional space should also be close together in the resulting two-dimensional space. In order to visualize those distances in the resulting space, a color mapping is used.

Have a look at the map above. Mountains indicate that the distances between points are high. Deep sea means that those points are closer together. For example, the green points on the left are separated less from the red points in the lower left corner (upper arrow) than the green and blue points (lower arrow).

Another property of the map is that the top border and the bottom border are connected, i.e. it behaves like a world map. You can continue the map seamlessly from top to bottom. The same is true for the left and the right border.

Ok, after having seen the results we are after we will now have a look on how to create the plot and configure the plotter.

 

There is a major difference between the SOM plotter and other plotters in RapidMiner. Internally, a SOM is an unsupervised neural network. The data points are sorted to the nodes of the network. The consequence of this is that the network has to be trained for each data set anew. And as you might know if you are familiar with neural networks: the training can take some time. Therefore, most changes of the SOM settings will not have any affect until you press the Calculate button at the bottom of the plotter options on the left.

 

 

After having pressed the button, the calculation of the network is performed which might take some time. The progress indicator above the calculate button might give you a hint how long you will have to wait. After a couple of seconds (or minutes - depending on the data set), you will get the visualization of the two-dimensional map like in the following picture:

 

Please note that you will have to select a Point Color in order to show the data points on the  map. This most often will be the class of the data points or any other property you might be interested in. If you select an attribute of your data or model here, the values of this attribute will be used for determining the color of each of the data points. It does not matter if the selected column is numerical or nominal, both scenarios will work.

The next two options Matrix and Style are specific for the SOM visualization. With the Matrix option, you can choose if you want to display the distances (U-Matrix), the density of the data space (P-Matrix) or a combination of both (U*-Matrix). Please compare the difference between the U-Matrix (the picture above) with the U*-Matrix (the first picture in this post). The Style option indicated the color scheme which is used for displaying those information. The default Landscape produces a geographical map like the one, just play around in order to search a color scheme which is most appropriate for your data set.

As we have stated above, the SOM is internally represented by a network consisting of a fixed number of nodes. The size of this network can be determined with the settings Net Width and Net Height. There are also two important training options for the underlying neural network, namely the two training parameters Training Rounds and Adaptation Radius. The default values are fine for most settings but you might want to optimize those for certain data sets. After having changed those settings, you will have to re-calculate the plot again by pressing on Calculate.

Since the data points are initially all located on the network nodes, it often happens that multiple data points are located on a single node and are overlapping. For this reason, the Jitter option is very useful for SOMs. Just move around the jitter slider and look what's happening: the points are moving a bit to a random direction showing if and which points are lying below.

The last option is pretty simple and does the same as for the other plotters: Export Image opens a dialog which allows you to export the current plotter with all its settings into one of the dozens supported image formats.

A last note on SOM visualizations: the calculation of the neural network depends on random numbers and pressing Calculate another time might deliver a different - and sometimes more appropriate - result. Just try to recalculate a visualization by pressing Calculate again.

Other parts of the plotter series:

 

Plotter 22 Dec 2010
The RapidMiner Plotters 11: Survey by Ingo Mierswa Comment (0)

Overview "Survey"

  • Summary: Compact visualization of high-dimensional data sets; often shows correlations quite well
  • Number of Dimensions: unlimited in theory, but useful up several hundred depending on the screen resolution
  • Data Types: Numerical, Nominal, Dates

This is the next post of a series describing all RapidMiner plotters in detail. A list of the plotters discussed so far can be found at the end of this article including the links to them. Since many options and controls of these plotters are also relevant for the one discussed here - as well as for many other plotters - I recommend to check out the first parts of this series before reading this one.


Before we start our discussion about the Survey plotter, we will first have a look:

 

 

The basic idea behind a survey plot is not widely known but it is actually pretty simple: the plot consists of n rectangular areas if you have n dimensions, each area represents one of the dimensions. Similar to a parallel plot , each area is placed beside the other dimensions but in the case of a survey plot, the axis is placed horizontally instead of vertically.

Each data point is then printed like in a bar chart (covered later in this series but probably widely known). A line is used to represent the data for each dimension, with its length proportional to the dimensional value it represents. The complete data point is then represented by all line segments in the same row.

In the example above, you can see about 200 examples, one row consisting of short horizontal line segments for each of those examples. Those short line segments represent the values of those 200 examples for each of the about 60 attributes (dimensions). This visualization is very compact and works well even for high numbers of dimensions and thousands of examples. One of the first things you can easily see is the distribution of values in the different dimensions. For example, the center part has much more high values compared to the attributes on the left or those on the right. This is especially useful if the attributes are ordered like, for example, it is often the case for series data.

The second advantage of the survey plot is that it quickly gives some insight to the correlation between any two dimensions. This is even easier if you sort the data to one or several of the dimensions. And this is where the options on the left become important.

Especially for classification data sets, the last option named Color is probably the most important. Here you can select one of the dimensions and define a color scheme based on the selected attribute’s values. The result might look like this:

 

If you look closely, you can probably already see that for some of the attributes (attribute_10, attribute_11, and attribute_12 for example) the distribution differs a lot between the two classes. For example, the values of attribute 12 are in general higher for one of the classes. You can even make this effect stronger by sorting the data. The survey plot in RapidMiner offers a sorting for up to three dimensions by selecting them with the parameter boxes named First column, Second column, and Third column. The following picture shows the same data set first sorted according to attribute_12 and then according to attribute_45, i.e. in cases where the values of attribute_12 are equal, the second attribute determines the order:

 

Now you can see several things: there are other attributes highly correlated to attribute_12, which is now placed as the first dimension in the plot. You can identify those correlated attributes easily since they are also (almost) sorted now. Check for example attribute_10 or attribute_11. By the way, the name of an attribute is shown simply by clicking on one of the graphical columns.

When color is used for different classifications as it is the case here, one can sometimes see by using a sort which dimensions are best at classifying data. For example, the combination of attribute 12 and 45 already defines colorized bands of the same class, i.e. a classification scheme should be able to define those regions and identify those with a high probability for one of the classes.

The last option is pretty simple and does the same as for the other plotters: Export Image opens a dialog which allows you to export the current plotter with all its settings into one of the dozens supported image formats.

Other parts of the plotter series:

 

Plotter 13 Dec 2010
The RapidMiner Plotters 10: Series Multiple by Ingo Mierswa Comment (0)

Overview "Series Multiple"

  • Summary: Useful plotter for multiple series in different value ranges
  • Number of Dimensions: unlimited in theory, but useful up to a dozen
  • Data Types: Numerical, Dates
This is the next post of a series describing all RapidMiner plotters in detail. A list of the plotters discussed so far can be found at the end of this article including the links to them. Since many options and controls of these plotters are also relevant for the one discussed here - as well as for many other plotters - I recommend to check out the first part of the series before reading this one.

Before we start our discussion about the Series Multiple plotter, we will first have a look:

 

 Series Multiple Plotter

 

The series multiple plot is very similar to the series plotter discussed in the last session. Like the non-multiple variant, it is the basic reprensentation for series values, i.e. in cases where the values are ordered by a known or unknown dimension. It most often is used for time series plots where the x-axis is representing the time and the change of values in the series is plotted according to the value range depicted by the y-axis.

The series values have to be encoded in the attributes / columns of the data table. This means that each series is stored by a single column in the table. In the example above, we have selected three series in the Plot Series selection list on the left. The trajectory of those series is now plotted and each series gets its own color. Those colors are explained in a legend at the left of the plot together with the different value ranges. The fact that different ranges are used for the selected dimensions is the major difference to the traditional series plot.

Just as for the traditional series plotter, the multiple variant also offers a setting which is called index dimension. With this setting, you can select one of the dimensions which should be used for defining the range on the x-axis. If your data set contains, for example, a date column specifying on which date and / or time the measurements of the other series (the other columns) were taken, you probably want to define the date as index dimension.You can see an example for using an index dimension in our last session about the traditional series plotter.

The last option is pretty simple and does the same as for the other plotters: Export Image opens a dialog which allows you to export the current plotter with all its settings into one of the dozens supported image formats.

Other parts of the plotter series:

Plotter 2 Dec 2010
The RapidMiner Plotters 9: Series by Ingo Mierswa Comment (0)

Overview "Series"

  • Summary: Natural representation for ordered values like for measurements changing over time
  • Number of Dimensions: unlimited in theory, but useful up to a couple of dozens
  • Data Types: Numerical, Dates
This is the next post of a series describing all RapidMiner plotters in detail. A list of the plotters discussed so far can be found at the end of this article including the links to them. Since many options and controls of these plotters are also relevant for the one discussed here - as well as for many other plotters - I recommend to check out the first part of the series before reading this one.

Before we start our discussion about the Series plotter, we will first have a look:

 

 

The series plot is the basic reprensentation for series values, i.e. in cases where the values are ordered by a known or unknown dimension. It most often is used for time series plots where the x-axis is representing the time and the change of values in the series is plotted according to the value range depicted by the y-axis.

The series values have to be encoded in the attributes / columns of the data table. This means that each series is stored by a single column in the table. In the example above, we have selected three series in the Plot Series selection list on the left. The trajectory of those series is now plotted and each series gets its own color. Those colors are explained in a legend at the top of the plot. As you can see, the values are quiet similar for the three series. They only differ a bit more between for example 30 and 40 on the x-axis.

Sometimes, you want to display a value range which is defined by two series (like one series for the possible minimum values for each point of time and one for the corresponding maximum values). It is then nice to show one or several additional series and be able to compare them against each other or against the depicted range. This could look like in the following plot:

 

 

As you can see, we have selected one series for the lower bound and one for the upper bound. The range in between is painted in a transparent grey and the mean value is depicted by a dashed line. This grey area is named bounds in the legend at the top. We have created two new columns named Min and Max with the operator Generate Aggregation in advance and selected those two attributes for the bounds.

Another setting which is often used is the index dimension. With this setting, you can select one of the dimensions which should be used for defining the range on the x-axis. If your data set contains, for example, a date column specifying on which date and / or time the measurements of the other series (the other columns) were taken, you probably want to define the date as index dimension. The following picture shows how this looks like:

 

 

As you can easily see, the year is now written at the x-axis. We have data spanning about 10 years here. Of course, you can also zoom in and pan (move) in the zoomed plot. This described in the first part of our plotter series. If you zoom in, the index dimension is updated accordingly as can be seen in the following picture where we concentrate on one and a half years only:

 

 

The last option is pretty simple and do the same as for the other plotters: Export Image opens a dialog which allows you to export the current plotter with all its settings into one of the dozens supported image formats.

I hope that you like this series of plotter explanations. Please let me know what you think and if we should continue with this!

Other parts of the plotter series:

 

Plotter 25 Nov 2010
The RapidMiner Plotters 8: Deviation by Ingo Mierswa Comment (0)

Overview "Deviation"

  • Summary: Great for large numbers of dimensions and also high numbers of examples
  • Number of Dimensions: unlimited plus 1 as color
  • Data Types: Numerical, Nominal, Dates
This is the next post of a series describing all RapidMiner plotters in detail. A list of the plotters discussed so far can be found at the end of this article including the links to them. Since many options and controls of these plotters are also relevant for the one discussed here - as well as for many other plotters - I recommend to check out the first part of the series before reading this one.

Before we start our discussion about the Deviation plotter, we will first have a look:

 

Deviation Plot

 

The deviation plot is a high-dimensional visualization techniques for almost arbitrary numbers of dimensions. It is very similar to the Parallel plot discussed in the last session . But in contrast to the parallel plot, the deviation plot does not show the set of data points as lines in an n-dimensional space but one line for a group of examples only.

Like for the parallel plot, each dimension is displayed as a vertical grid line which is parallel to the other dimension grid lines. Now, the average and the standard deviation for each attribute are calculate and all average values are then represented as a line consisting of multiple line segments. This average line (printed in bold) intersects the dimension lines at the position on the dimension line corresponding to the average values for the attributes. In addition to the average values, a transparent region is drawn around the bold average line indicating the range of the standard deviations for all attributes. You can now easily see the mean value together with the range in which most examples usually are located in - and this for a quite large number of attributes.

The deviation plot becomes even more powerful if you select a nominal attribute to be used as the Color of the lines. This definition basically works like a grouping of the examples before the average values and standard deviations are calculated. The lines and transparent regions are then drawn for each of the possible nominal values of the attribute selected for the col0r. Those colors are explained in a legend at the top of the plot. This becomes quite handy if you select a class label for the color since in many cases this allows you to see first hints which attributes are well suited for classifications. Those attributes differ more in the average values (the bold lines) and have less overlapping transparent regions. Here is a plot of the same data set like the one above (it is called Sonar and is delivered as sample data together with RapidMiner) but now we use the class label for the color:

 

Deviation Plot with Color

 

It is very easy to see that there are two or three attribute regions which might be more helpful to distinguish between the two classes. For exactly this reason, the deviation often plot offers insights where the traditional parallel plot is not able to show anything simply because of the high number of examples.

The deviation plot also has an additional parameter called Local Normalization which can be used to rescale the values in all dimensions on a range between 0 and 1 (normalization). This is quite helpful if the values differ a lot between the dimensions.

The last two options are pretty simple and do the same as for the other plotters: Rotate Labels causes that the labels of the x-Axis are rotated by 90 degrees. Especially if you use a nominal attribute for the x-axis, the values can then be easily read. Export Image opens a dialog which allows you to export the current plotter with all its settings into one of the dozens supported image formats.

I hope that you like this series of plotter explanations. Please let me know what you think and if we should continue with this!

Other parts of the plotter series:

 

<< Start < Prev 1 2 Next > End >>
  • Share/Bookmark
  • Stay tuned with our RSS feed!
  • Watch videos on our YouTube channel!
  • Rapid Insight / Inside Rapid-I (Blog)
  • Visit Rapid-I on Facebook and become our fan!
  • Follow Rapid-I on Twitter!
  • Read the Rapid-I Newsletter