|R, Example||15 Dec 2010|
|Simple Example for R in RapidMiner by Ingo Mierswa||Comment (0)|
We got a lot of positive feedback after the release of the R extension , which allows the integration of R scripts directly in the analysis processes of RapidMiner. Many people really like this approach and for exactly that reason I would like to ease the first steps for those of you who are less experienced in programming in general and programming with R.
The following example performs probably one of the simplest data transformations you can think of: we want to use R to add two columns of a data set and store the results in a new column called “sum”.
Of course it is even simpler to use a special operator for this task, namely the operator “Generate Attributes”. However, the process below should be simple enough in order to demonstrate some of the necessary R concepts for less experienced users. In a programming lesson, the example below would probably be called “Hello World” example for R in RapidMiner.
Of course you will need a correctly installed R extension in order to be able to follow this short tutorial. Please refer to our forum if you have any problems during the installation. Ok, let’s start. We assume we have a data set with four columns named a1 to a4 and another special attribute, the label. We take this input from our RapidMiner repository which is the first step in the process below:
After loading the data with “Retrieve” we simply add a new operator “Execute Script (R)” and connect the output port of Retrieve delivering the data set during execution with the input port of the new operator. We now define the inputs of the script by clicking on the parameter button “inputs” which will open the following dialog:
We define the first input (we only have one) by giving it the name “data”. You can reference the delivered data set then in the script by using this name.
The second definition is the R script itself. Click on the parameter button “script” in order to open a dialog where you can enter an arbitrary R script. This dialog looks like the following one:
Here is what the script does:
Line 1: sum_column <- data + data
This line generates a new data vector named “sum_column” and calculates the sum of the first column of data – indicated by the 1 in brackets – with the second one. Please note that we have used the defined name “data” here.
Line 2: complete_data <- c(data, sum_column)
We now concatenate (command: c) the newly generated column “sum_column” with the given data set named “data” and store it under the name “complete_data”.
Line 3: result <- as.data.frame(complete_data)
We now transform the result into a data frame. Data frames are the R concept for data tables or matrices which can consist of columns of mixed types which can also have a name. They are pretty similar to the Example Sets known from RapidMiner. Please note that you have to transform your results to data frames with the command “as.data.frame” if you want to deliver the results back to RapidMiner as an Example Set (see below).
Line 4: colnames(result) = "Sum"
This last step is optional and simply renames the new column to “Sum”. Of course this could also be done afterward with the operator “Rename”.
The final step is to define the results and how they are delivered back to RapidMiner. Simply click on the parameter button named “results” and the following dialog will be shown:
Here you can define which variables used in the script should be delivered. In our case it should only be the variable “result” which contains the resulting data set. If the variables are a data frame (see above), you could directly transform it to a RapidMiner Data Table / Example Set. Otherwise, you can only deliver a generic R result.
There you go. Now you can simply run the process and add two columns with R directly within a RapidMiner process. Have fun to try out other data transformations!
I have also uploaded the process to myExperiment with our Community Extension . You can simply download it from there and directly try the scripting operator. The uploaded process also contains a parallel way for this calculation by using the native operator “Generate Attributes” instead.