First Steps With RapidMiner

From Rapid-I-Wiki

Jump to: navigation, search

This site describes some basic concepts of RapidMiner. In the description, we assume that most of the processes are performed in batch mode (or command line mode). Of course you can also use RapidMiner in the Graphical User Interface mode which is more convenient and offers a large amount of additional features. A short documentation of the GUI mode is separately available in the download section of the RapidMiner website. RapidMiner provides an online tutorial which also describes the usage of the GUI mode and the basic concepts of machine learning with RapidMiner. Probably, you will not need to read all sections of this tutorial after making the online tutorial and reading the short GUI manual. However, you should at least read this site to get a first idea about some of the RapidMiner concepts.

All examples described in this tutorial are part of the sample directories of RapidMiner. Although only few of these examples are discussed here, you should take a look at all of them since they will give you some helpful hints. We suggest that you start approximately the first half of the process definitions in each of the sample directories in the order the directories are named, i.e. first the first half of directory 01_IO, then the first half of 02_Learner and so on. After this round, you should again start with the first directory and perform the second half of the process setups. This way the more complicated processes will be performed after you had a look at almost all of the simple building blocks and operators.

Contents

First example

Let us start with a simple example 03_XValidation_Numerical.xml which you can find in the 04_Validation subdirectory. This example process loads an example set from a file, generates a model using a support vector machine (SVM) and evaluates the performance of the SVM on this dataset by estimating the expected absolute and squared error by means of a ten-fold cross-validation. In the following we will describe what the parameters mean without going into detail too much. We will describe the used operators later in this section.

<source lang="xml"> <operator name="Root" class="Process">

 <parameter key="logfile" value="XValidation.log"/>
 <operator name="Input" class="ExampleSource">
   <parameter key="attributes" value="../data/polynomial.aml"/>
 </operator>
 <operator name="XVal" class="XValidation">
   <operator name="Training" class="LibSVMLearner">
     <parameter key="svm_type" value="epsilon-SVR"/>
     <parameter key="kernel_type" value="poly"/>
     <parameter key="C" value="1000.0"/>
   </operator>
   <operator name="ApplierChain" class="OperatorChain">
     <operator name="Test" class="ModelApplier">
     </operator>
     <operator name="Evaluation" class="PerformanceEvaluator">
       <parameter key="squared_error"	value="true"/>
     </operator>
   </operator>
 </operator>

</operator> </source> Simple example configuration file. This is the 03_XValidation_Numerical.xml sample process

But first of all let's start the process. We assume that your current folder contains the file 03_XValidation_Numerical.xml (see figure Simple example configuration file). Now start RapidMiner by typing rapidminer 03_XValidation_Numerical.xml or by opening that file with the GUI and pressing the start button. After a short while you should read the words "Process finished successfully". Congratulations, you just made your first RapidMiner process. If you read "Process not successful" instead, something went wrong. In either case you should get some information messages on your console (using RapidMiner in batch mode) or in the message viewer (GUI mode). In the latter case it should give you information about what went wrong. All kinds of debug messages as well as information messages and results like the calculated relative error are written to this output. Have a look at it now.

The log message starts with the process tree and contains a lot of warnings, because most of the parameters are not set. Don't panic, reasonable default values are used for all of them. At the end, you will find the process tree again. The number in squared brackets following each operator gives the number of times the operator was applied. It is one for the outer operators and ten within the ten-fold cross-validation. Every time an operator is applied a message is written to the log messages indicating its input objects (like example sets and models). When the operator terminates its application it writes the output to the log stream again. You can find the average performance estimated by the cross-validation close to the end of the messages.

Taking a look at the process tree in the log messages once again, you will quickly understand how the configuration file is structured. There is one operator tag for each operator specifying its name and class. Names must be unique and have the only purpose of distinguishing between instances of the same class. Operator chains like the cross-validation chain may contain one or more inner operators. Parameters can be specified in the form of key-value pairs using a parameter tag.

We will now focus on the operators without going into detail too much.

The outermost operator called "Root" is a Process operator, a subclass of a simple OperatorChain. An operator chain works in a very simple manner. It applies its inner operators successively passing their respective output to the next inner operator. The output of an operator chain is the output of the last inner operator. While usual operator chains do not take any parameters, this particular operator chain (being the outermost operator) has some parameters that are important for the process as a whole, e.g. the name of the log file (logfile) and the name of the directory for temporary files (temp_dir).

The ExampleSource operator loads an example set from a file. An additional file containing the attribute descriptions is specified (data/polynomial.xml). Then the resulting example set is passed to the cross-validation chain.

The XValidation evaluates the learning method by splitting the input example set into ten subsets S_1,..., S_10. The inner operators are applied ten times. In run number i the first inner operator, which is a LibSVMLearner, generates a model using the training set that are not S_i. The second inner operator, an evaluation chain, evaluates this model by applying it to the remaining test set S_i. The ModelApplier predicts labels for the test set and the PerformanceEvaluator compares them to the real labels. Afterwards the absolute and squared errors are calculated. Finally the cross-validation chain returns the average absolute and squared errors over the ten runs and their variances.

The processing of RapidMiner operator trees is similar to a depth first search of normal trees. In contrast to this usual way of traversing a tree, RapidMiner allows loops during the run (each learning child is used 10 times, the applier chain is used 10 times, too). Additionally, inner nodes may perform some operations before they pass the output of the first children to the next child. The traversal through a RapidMiner operator tree containing leaf operators and simple operator chains only is actually equivalent to the usual depth first search traversal.

Process configuration files

Process configuration files are XML documents containing only four types of tags (extension: .xml). If you use the GUI version of RapidMiner, you can display the configuration file by clicking on the XML tab. Process files define the process tree consisting of operators and the parameters for these operators. Parameters are single values or lists of values. Descriptions can be used to comment your operators.

<operator>

The operator tag represents one instance of an operator class. Exactly two attributes must be present:

name 
A unique name identifying this particular operator instance
class 
The operator class. See the operator reference for a list of operators.

For instance, an operator tag for an operator that reads an example set from a file might look like this:

<source lang="xml"> <operator name="MyExampleSource" class="ExampleSource"> </operator> </source>

If class is a subclass of OperatorChain, then nested operators may be contained within the opening and closing tag.

<parameter> and <list>

As discussed above, a parameter can have a single value or a set of values. For single value parameters the <parameter> tag is used. The attributes of the <parameter> tag are as follows:

key 
The unique name of the parameter.
value
The value of the parameter.

In order to specify a filename for the example above, there might be used the following parameter:

<source lang="xml"> <operator name="MyExampleSource" class="ExampleSource">

 <parameter key="attributes" value="myexamples.dat"/>

</operator> </source>

If the parameter accepts a list of values, the <list> tag must be used. The list must have a key attribute, just as the <parameter> tag. The elements of the list are specified by nested <parameter> tags, e.g. in case of a FeatureGeneration operator.

<source lang="xml"> <list key="functions">

 <parameter key="sum"     value="+(a1,a2)"/>
 <parameter key="product" value="*(a3,a4)"/>
 <parameter key="nested"  value="+(*(a1,a3),a4)"/>

</list> </source>


<description>

All operators can have an inner tag named <description>. It has only one attribute named text. This attribute contains a comment for the enclosing operator. If the root operator of the process has an inner description tag, the text is displayed after loading the process setup.

<source lang="xml"> <operator name="MyExampleSource" class="ExampleSource">

 <description text="Loads the data from file." />

</operator> </source>


Parameter Macros

All text based parameters might contain so called macrors which will be replaced by RapidMiner during runtime. For example, you can write a learned model into a file with the operator ModelWriter. If you want to do this for each learned model in a cross validation run, each model would be overwritten by the next one. How can this be prevented?

To save all models for each iteration in an own file, you need parameter macros. In a parameter value, the character '\%' has a special meaning. Parameter values are expanded as follows:

%{a} 
is replaced by the number of times the operator was applied.
 %{b} 
is replaced by the number of times the operator was applied plus one, i.e. %a + 1. This is a shortcut for %p[1].
%{p[number]} 
is replaced by the number of times the operator was applied plus the given number, i.e. %a + number.
%{t} 
is replaced by the system time.
%{n} 
is replaced by the name of the operator.
%{c} 
is replaced by the class of the operator.
%{%} 
becomes %.
%{process_name} 
becomes the name of the process file (without path and extension).
%{process_file} 
becomes the name of the process file (with extension).
%{process_path} 
becomes the path of the process file.

For example to enumerate your files with ascending numbers, please use the following value for the key model-file:

<source lang="xml"> <operator name="ModelWriter" class="ModelWriter">

 <parameter key="model_file"	value="model_%{a}.mod"/>

</operator> </source>

The macro %{a} will be replaced by the number of times the operator was applied, in case of model write after the learner of a 10-fold cross validation it will hence be replaced by the numbers 1 to 10.

You can also define own macros with help of the MacroDefinition operator.

File formats

RapidMiner can read a number of input files. Apart from data files it can read and write models, parameter sets and attribute sets. Generally, RapidMiner is able to read all files it generates. Some of the file formats are less important for the user, since they are mainly used for intermediate results. The most important file formats are those for "examples" or "instances". These data sets are provided by the user and almost all processes contain an operator that reads them.

Data files and the attribute description file

If the data files are in the popular arff format (extension: .arff), which provides some meta data, they can be read by the ArffExampleSource. Other operators for special file formats are also available. Additionally, data can be read from a data base using the DatabaseExampleSource. In that case, meta data is read from the data base as well.

The ExampleSource operator allows for a variety of other file formats in which instances are separated by newline characters. It is the main data input operator for RapidMiner. Comment characters can be specified arbitrarily and attributes can be spread over several files. This is especially useful in cases where attribute data and the label are kept in different files.

Sparse data files can be read using the SparseFormat-ExampleSource. We call data sparse if almost all values are equal to a default, e.g. zero.

The ExampleSource (for dense data) and some sparse formats need an attribute description file (extension: .aml) in order to retrieve meta data about the instances. This file is a simple XML document defining the properties of the attributes (like their name and range) and their source files. The data may be spread over several files. Therefore, the actual data files do not have to be specified as a parameter of the input operator.

The outer tag must be an <attributeset> tag. The only attribute of this tag may be default_source=filename. This file will be used as a default file if it is not specified explicitly with the attribute.

The inner tags can be any number of <attribute> tags plus at most one tag for each special attribute. The most frequently used special attributes are <label>, <weight>, <id>, and <cluster>. Note that arbitrary names for special attributes may be used. Though the set of special attributes used by the core RapidMiner operators is limited to the ones mentioned above, plugins or any other additional operators may use more special attributes. Please refer to the operator documentation to learn more about the specific special attributes used or generated by these operators.

The following XML attributes may be set to specify the properties of the RapidMiner attribute declared by the corresponding XML tag (mandatory XML attributes are set in italic font):

name 
The unique name of the attribute.
sourcefile
The name of the file containing the data. If this name is not specified, the default file is used (specified for the parent attributeset tag).
sourcecol 
The column within this file (numbering starts at 1). Can be omitted for sparse data file formats.
sourcecol_end 
If this parameter is set, its value must be greater than the value of sourcecol. In that case, "sourcecol-sourcecol_end" attributes are generated with the same properties. Their names are generated by appending numbers to the value of name. If the blocktype is value_series, then value_series_start and value_series_end respectively are used for the first and last attribute blocktype in the series.
valuetype 
One out of nominal, numeric, integer, real, ordered, binominal, polynominal, and file_path
blocktype 
One out of single_value, value_series, value_series_start, value_series_end, interval, interval_start, and interval_end.

Each nominal attribute, i.e. each attribute with a nominal (binominal, polynominal) value type definition, should define the possible values with help of inner tags <source lang="xml"> <value>nominal_value_1</value> <value>nominal_value_2</value> ... </source>

See this example attribute description file. <source lang="xml"> <attributeset default_source="golf.dat">

 <attribute
   name       ="Outlook"
   sourcecol  ="1"
   valuetype  ="nominal"
   blocktype  ="single_value"
   classes    ="rain overcast sunny"
 />
 <attribute
   name       ="Temperature"
   sourcecol  ="2"
   valuetype  ="integer"
   blocktype  ="single_value"
 />
 <attribute
   name       ="Humidity"
   sourcecol  ="3"
   valuetype  ="integer"
   blocktype  ="single_value"
 />
 <attribute
   name       ="Wind"
   sourcecol  ="4"
   valuetype  ="nominal"
   blocktype  ="single_value"
   classes    ="true false"
 />
 <label
   name       ="Play"
   sourcecol  ="5"
   valuetype  ="nominal"
   blocktype  ="single_value"
   classes    ="yes no"
 />

</attributeset> </source>

For classification learners that can handle only binary classifications (e.g. "yes" and "no") the first defined value in the list of nominal values is assumed to be the negative label. That includes the classification "yes" is not necessarily the positive label (depending on the order). This is important, for example, for the calculation of some performance measurements like precision and recall.

Note: Omitting the inner value tags for nominal attributes will usually "seem" to work (and indeed, in many cases no problems might occur) but since the internal representation of nominal values depend on this definition it might happend that the nominal values of learned models do not fit the given data set. Since this might lead to drastically reduced prediction accuracies you should always define the nominal values for nominal attributes.

Note: You do not need to specify a label attribute in cases where you only want to predict a label with a learned model. Simply describe the attributes in the same manner as in the learning process setup, the label attribute can be omitted.


Dense data files

The data files are in a very simple format (extension: .dat). By default, comments start with #. When a comment character is encountered, the rest of the line is discarded. Empty lines -- after comment removal -- are ignored. If the data is spread over several files, a non empty line is read from every file. If the end of one of the files is reached, reading stops. The lines are split into tokens that are whitespace separated by default, separated by a comma, or separated by semicolon. The number of the tokens are mapped to the sourcecol attributes specified in the attribute description file. Additional or other separators can be specified as a regular expression using the respective parameters of the ExampleSource. The same applies for comment characters.


Sparse data files

If almost all of the entries in a data file are zero or have a default nominal value, it may be well suitable to use a SparseFormatExampleSource. This operator can read an attribute description file as described above. If the attribute_description_file parameter is supplied, the attribute descriptions are read from this file and the default_source is used as the single data file. The sourcecol and sourcefile attributes are ignored. If the attribute_description_file parameter is not supplied, the data is read from the file data_file and attributes are generated with default value types. Regular attributes are supposed to be real numbers and the label is supposed to be nominal. In that case, the dimension parameter, which specifies the number of regular attributes, must be set.

Comments in the data file start with a '#'-character, empty lines are ignored. Lines are split into whitespace separated tokens of the form index:value where value is the attribute value, i.e. a number or a string, and index is either an index number referencing a regular attribute or a prefix for a special attribute defined by the parameter list prefix_map of the SparseFormatExampleSource. Please note that index counting starts with 1.


The SparseFormatExampleSource parameter format specifies the way labels are read.

xy
The label is the last token in the line.
yx
The label is the first token in the line.
prefix
The label is treated like all other special attributes.
separate_file
The label is read from a separate file. In that case, parameter label_file must be set.
no_label
The example set is unlabeled.

All attributes that are not found in a line are supposed to have default values. The default value for numerical data is 0, the default vallue for nominal attributes is the first string specified by the classes attribute in the attribute description file.

Example: Suppose you have a sparse file which looks like this: <source lang="xml"> w:1.0 5:1 305:5 798:1 yes w:0.2 305:2 562:1 yes w:0.8 49:1 782:1 823:2 no ... </source>

You may want each example to have a special attribute "weight, a nominal label taking the values "yes and "no", and 1,000 regular numerical attributes. Most of them are 0. The best way to read this file, is to use a SparseFormatExampleSource and set the parameter value of format to xy (since the label is the last token in each line) and use a prefix_map that maps the prefix "w" to the attribute "weight". See the following configuration of a SparseFormatExampleSource

<source lang="xml"> <operator name="SparseFormatExampleSource" class="SparseFormatExampleSource">

 <parameter key="dimension"      value="1000"/>
 <parameter key="attribute_file" value="mydata.dat"/>
 <parameter key="format"         value="xy"/>
 <list key="prefix_map">
   <parameter key="w"	value="weight"/>
 </list>

</operator> </source>

Model files

Model files contain the models generated by learning operators in previous RapidMiner runs (extension: .mod). Models can be written to a file by using the operator ModelWriter. They can be read by using a ModelLoader and applied by using a ModelApplier.

Attribute construction files

An AttributeConstructionsWriter writes an attribute set to a text file (extension: .att). Later, this file can be used by an AttributeConstructionsLoader operator to generate the same set of attributes in another process and/or for another set of data.

The attribute generation files can be generated by hand as well. Every line is of the form

<source lang="xml"> <attribute name="attribute_name" construction="generation_description"/> </source>

The generation description is defined by functions, with prefix-order notation. The functions can be nested as well. An example of a nested generation description might be: "f(g(a), h(b), c)". See page FeatureGeneration for a reference of the available functions.

Example of an attribute constructions file:

<source lang="xml"> <constructions version="4.0">

   <attribute name="a2" construction="a2"/>
   <attribute name="gensym8" construction="*(*(a1, a2), a3)"/>
   <attribute name="gensym32" construction="*(a2, a2)"/>
   <attribute name="gensym4" construction="*(a1, a2)"/>
   <attribute name="gensym19" construction="*(a2, *(*(a1, a2), a3))"/>

</constructions> </source>

Parameter set files

For example, the GridParameterOptimization operator generates a set of optimal parameters for a particular task (extension: .par). Since parameters of several operators can be optimized at once, each line of the parameter set files is of the form <source lang="xml"> OperatorName.parameter_name = value </source> These files can be generated by hand as well and can be read by a ParameterSetLoader and set by a ParameterSetter.


Attribute weight files

All operators for feature weighting and selection generate a set of feature weights (extension: .wgt). Attribute selection is seen as attribute weighting which allows more flexible operators. For each attribute the weight is stored, where a weight of 0 means that the attribute was not used at all. For writing the files to a file the operator AttributeWeightsWriter can be used. In such a weights file each line is of the form <source lang="xml"> <weight name="attribute_name" value="weight"/> </source> These files can be generated by hand as well and can be read by an AttributeWeightsLoader and used on example sets with the operator AttributeWeightsApplier. They can also be read and adapted with the InteractiveAttributeWeighting operator. Feature operators like forward selection, genetic algorithms and the weighting operators can deliver an example set with the selection / weighting already applied or the original example set (optional). In the latter case the weights can adapted and changed before they are applied.

Example of an attribute weight file: <source lang="xml"> <attributeweights version="4.0">

   <weight name="a1" value="0.8"/>
   <weight name="a2" value="1.0"/>
   <weight name="a3" value="0.0"/>
   <weight name="a4" value="0.5"/>
   <weight name="a5" value="0.0"/>

</attributeweights> </source>

File format summary

The following table summarizes all file formats and the corresponding file extensions.

Extension Description
.aml attribute description file (standard XML meta data format)
.arff attribute relation file format (known from Weka)
.att attribute set file
.bib BibTeX data file format
.clm cluster model file (clustering plugin)
.cms cluster model set file (clustering plugin)
.cri population criteria file
.csv comma separated values data file format
.dat (dense) data files
.ioc IOContainer file format
.log log file / process log file
.mat matrix file (clustering plugin)
.mod model file
.obf obfuscation map
.par parameter set file
.per performance file
.res results file
.sim similarity matrix file (clustering plugin)
.thr threshold file
.wgt attribute weight file
.wls word list file (word vector tool plugin)
.xrff extended attribute relation file format (known from Weka)
a table
Personal tools