| RapidMiner, Preprocessing, Modeling | 29 Jun 2009 |
| Preprocessing Models by Ingo Mierswa | Comment (0) |
A really nice feature of RapidMiner is the possibility to create preprocessing models, i.e. models which are not used for predictions but for transformations of the data set.
Most of the preprocessing operators support the generation of preprocessing models by simply activating the parameter "return_preprocessing_model". The following process, for example, generates two preprocessing models for the transformation from nominal attributes into binominal attributes (each attribute only has two different values now) and a second one for transforming all binominal attributes into numerical ones (consisting of 0 and 1 instead). This is very common preprocessing chain for the transformation of data sets containing nominal attributes into data sets consisting of numerical attributes only.
But what can be done if this transformation should not only be performed during the model training phase of your data mining project but also during the model application phase (scoring)? In that case, the preprocessing models become really handy since they can easily be applied to new data sets exactly like it is known for the usual prediction models. Check out the process below to see the details!
<operator name="Root" class="Process" expanded="yes">
<operator name="DirectMailingExampleSetGenerator (Training Set)" class="DirectMailingExampleSetGenerator">
<parameter key="number_examples" value="1000"/>
</operator>
<operator name="ChangeAttributeRole (Training Set)" class="ChangeAttributeRole">
<parameter key="name" value="name"/>
<parameter key="target_role" value="id"/>
</operator>
<operator name="Preprocessing Models" class="OperatorChain" expanded="yes">
<operator name="Nominal2Binominal" class="Nominal2Binominal">
<parameter key="return_preprocessing_model" value="true"/>
</operator>
<operator name="Nominal2Numerical" class="Nominal2Numerical">
<parameter key="return_preprocessing_model" value="true"/>
</operator>
</operator>
<operator name="Training" class="LinearRegression" breakpoints="after">
<parameter key="feature_selection" value="none"/>
</operator>
<operator name="DirectMailingExampleSetGenerator (Test Set)" class="DirectMailingExampleSetGenerator">
<parameter key="number_examples" value="1000"/>
</operator>
<operator name="ChangeAttributeRole (Test Set)" class="ChangeAttributeRole">
<parameter key="name" value="name"/>
<parameter key="target_role" value="id"/>
</operator>
<operator name="IOSelector" class="IOSelector">
<parameter key="io_object" value="Model"/>
<parameter key="select_which" value="3"/>
</operator>
<operator name="ModelApplier" class="ModelApplier">
<list key="application_parameters">
</list>
</operator>
<operator name="IOSelector (2)" class="IOSelector">
<parameter key="io_object" value="Model"/>
<parameter key="select_which" value="2"/>
</operator>
<operator name="ModelApplier (2)" class="ModelApplier">
<list key="application_parameters">
</list>
</operator>
<operator name="ModelApplier (3)" class="ModelApplier">
<list key="application_parameters">
</list>
</operator>
</operator>



) and the easiness of a flow / graph based layout when it comes to visualizing and understanding the data flows.