Open source software for big data analytics.
No programming required.

HomeContact UsSearchSitemapPrivacy PolicyImprint
  • Deutsch
  • English
Rapid-I. Report the Future. Home Download
Rapid-I Blog
Home Home
Search Search
RSS Feed RSS Feed

 

 

Blog Tags
Login Form





Lost Password?
No account yet? Register
Tag >> Preprocessing
RapidMinerPreprocessingOperator 21 Jul 2009
Subtract Mean Value from each Attribute by Ingo Mierswa Comment (0)

A question which was posted several times in the forum and which is also one often asked during our training courses is the following one:

"How can I calculate the mean value for each attribute and subtract it from the attribute values?"

 Of course, one could use the Normalization operator with a normalization type set to "standardization". But in this case not  only the mean value is subtracted but the value range is also changed in a way so that the standard deviation equals 1. This is of course not alway desired.

The following process shows how you can use the operator FeatureIterator in combination with a standard aggregation and a macro to achieve the desired goal. For each of the features, the mean value is calculated with the operator Aggregation and stored in a macro. Then the operator AttributeConstruction is used where for each feature the mean value is subtracted for each value.

After this has been done, the old features are removed and the new ones are renamed to the old names. That's it. Here is a picture of the process:

 

 

 And here is the complete XML code:

<operator name="Root" class="Process" expanded="yes">
    <operator name="ExampleSetGenerator" class="ExampleSetGenerator">
        <parameter key="target_function"    value="sum"/>
    </operator>
    <operator name="FeatureIterator" class="FeatureIterator" expanded="yes">
        <parameter key="work_on_input"    value="false"/>
        <operator name="Aggregation" class="Aggregation">
            <list key="aggregation_attributes">
              <parameter key="%{loop_feature}"    value="average"/>
            </list>
        </operator>
        <operator name="DataMacroDefinition" class="DataMacroDefinition">
            <parameter key="macro"    value="current_average"/>
            <parameter key="macro_type"    value="data_value"/>
            <parameter key="attribute_name"    value="average(%{loop_feature})"/>
            <parameter key="example_index"    value="1"/>
        </operator>
        <operator name="IOConsumer" class="IOConsumer">
            <parameter key="io_object"    value="ExampleSet"/>
            <parameter key="deletion_type"    value="delete_one"/>
        </operator>
        <operator name="AttributeConstruction" class="AttributeConstruction">
            <list key="function_descriptions">
              <parameter key="norm_%{loop_feature}"    value="%{loop_feature} - %{current_average}"/>
            </list>
        </operator>
    </operator>
    <operator name="AttributeFilter" class="AttributeFilter">
        <parameter key="condition_class"    value="attribute_name_filter"/>
        <parameter key="parameter_string"    value="norm_.*"/>
    </operator>
    <operator name="ChangeAttributeNamesReplace" class="ChangeAttributeNamesReplace">
        <parameter key="replace_what"    value="norm_"/>
    </operator>
</operator>

Have fun!

ScalabilityRapidMinerPreprocessingPMML 1 Jul 2009
PMML 4 and Preprocessing Models by Ingo Mierswa Comment (1)

A short time ago, the data mining group announced the release of the new version of the PMML standard:

http://www.dmg.org/pmml-v4-0.html

 

 One of my main concerns against PMML always was the fact that no preprocessing models were supported. As I have written before, preprocessing models are a very important feature of RapidMiner. This has a simple reason:  in my opinion, processing and transforming the data is probably the most important but also complicated part of data mining. For that reason, the focus in operator development for RapidMiner always was this vital part of data analysis instead of producing hundreds of additional and highly sophisticated learning schemes which would not scale up well enough for large real-world data sets anyway. As a consequence, our preprocessing models were introduced into RapidMiner already several years ago together with lots of operators for the different data transformation tasks.

 

With version 3, the PMML standard also started to support first preprocessing models. This was actually the first step towards a more scalable direct-in-database processing and analysis. I was really keen on seeing the next major version of PMML and checking out if more preprocessing steps would have been supported. And indeed, several new models were added and PMML seems to become a standard for the description of (almost) the complete analysis process which can then be deployed directly in the database. And for us, this means that PMML will gain more weight for the future development and will be a topic for future releases again.

 

Until then, a combination of the cached database access from RapidMiner, the layered view stack on data sets, and scalable analysis methods like Naive Bayes will be the solution for using RapidMiner on data sets of almost arbitrarily sizes.

RapidMinerPreprocessingOperatorModeling 29 Jun 2009
Grouping Models by Ingo Mierswa Comment (0)

In the last blog entry, we have discussed how preprocessing models can be created with RapidMiner and applied on new data sets. In the described setup, it was necessary to use the operator IOSelector twice in order to get the correct ordering of models for model application.

Since preprocessing models are an important feature of RapidMiner, we of course also provide a much easier way of handling the different models and applying them on new data sets. All models - including preprocessing models as well as prediction models - can easily be grouped together with the operator ModelGrouper. So you do not have to cope with several models but with a single model which can be applied on new data sets and performs the preprocessing as well as the prediction. This makes the previously posted process much cleaner and easier to understand. Just have a look into this picture of the process layout:

 

 

Here is the XML setup of the complete process:

<operator name="Root" class="Process" expanded="yes">
    <operator name="DirectMailingExampleSetGenerator (Training Set)" class="DirectMailingExampleSetGenerator">
        <parameter key="number_examples"    value="1000"/>
    </operator>
    <operator name="ChangeAttributeRole (Training Set)" class="ChangeAttributeRole">
        <parameter key="name"    value="name"/>
        <parameter key="target_role"    value="id"/>
    </operator>
    <operator name="Preprocessing Models" class="OperatorChain" expanded="yes">
        <operator name="Nominal2Binominal" class="Nominal2Binominal">
            <parameter key="return_preprocessing_model"    value="true"/>
        </operator>
        <operator name="Nominal2Numerical" class="Nominal2Numerical">
            <parameter key="return_preprocessing_model"    value="true"/>
        </operator>
    </operator>
    <operator name="Training" class="LinearRegression">
        <parameter key="feature_selection"    value="none"/>
    </operator>
    <operator name="ModelGrouper" class="ModelGrouper" breakpoints="after">
    </operator>
    <operator name="DirectMailingExampleSetGenerator (Test Set)" class="DirectMailingExampleSetGenerator">
        <parameter key="number_examples"    value="1000"/>
    </operator>
    <operator name="ChangeAttributeRole (Test Set)" class="ChangeAttributeRole">
        <parameter key="name"    value="name"/>
        <parameter key="target_role"    value="id"/>
    </operator>
    <operator name="ModelApplier" class="ModelApplier">
        <list key="application_parameters">
        </list>
    </operator>
</operator>
 

 

 

RapidMinerPreprocessingModeling 29 Jun 2009
Preprocessing Models by Ingo Mierswa Comment (0)

A really nice feature of RapidMiner is the possibility to create preprocessing models, i.e. models which are not used for predictions but for transformations of the data set.

Most of the preprocessing operators support the generation of preprocessing models by simply activating the parameter "return_preprocessing_model". The following process, for example, generates two preprocessing models for the transformation from nominal attributes into binominal attributes (each attribute only has two different values now) and a second one for transforming all binominal attributes into numerical ones (consisting of 0 and 1 instead). This is very common preprocessing chain for the transformation of data sets containing nominal attributes into data sets consisting of numerical attributes only.

But what can be done if this transformation should not only be performed during the model training phase of your data mining project but also during the model application phase (scoring)? In that case, the preprocessing models become really handy since they can easily be applied to new data sets exactly like it is known for the usual prediction models. Check out the process below to see the details!

<operator name="Root" class="Process" expanded="yes">
    <operator name="DirectMailingExampleSetGenerator (Training Set)" class="DirectMailingExampleSetGenerator">
        <parameter key="number_examples"	value="1000"/>
    </operator>
    <operator name="ChangeAttributeRole (Training Set)" class="ChangeAttributeRole">
        <parameter key="name"	value="name"/>
        <parameter key="target_role"	value="id"/>
    </operator>
    <operator name="Preprocessing Models" class="OperatorChain" expanded="yes">
        <operator name="Nominal2Binominal" class="Nominal2Binominal">
            <parameter key="return_preprocessing_model"	value="true"/>
        </operator>
        <operator name="Nominal2Numerical" class="Nominal2Numerical">
            <parameter key="return_preprocessing_model"	value="true"/>
        </operator>
    </operator>
    <operator name="Training" class="LinearRegression" breakpoints="after">
        <parameter key="feature_selection"	value="none"/>
    </operator>
    <operator name="DirectMailingExampleSetGenerator (Test Set)" class="DirectMailingExampleSetGenerator">
        <parameter key="number_examples"	value="1000"/>
    </operator>
    <operator name="ChangeAttributeRole (Test Set)" class="ChangeAttributeRole">
        <parameter key="name"	value="name"/>
        <parameter key="target_role"	value="id"/>
    </operator>
    <operator name="IOSelector" class="IOSelector">
        <parameter key="io_object"	value="Model"/>
        <parameter key="select_which"	value="3"/>
    </operator>
    <operator name="ModelApplier" class="ModelApplier">
        <list key="application_parameters">
        </list>
    </operator>
    <operator name="IOSelector (2)" class="IOSelector">
        <parameter key="io_object"	value="Model"/>
        <parameter key="select_which"	value="2"/>
    </operator>
    <operator name="ModelApplier (2)" class="ModelApplier">
        <list key="application_parameters">
        </list>
    </operator>
    <operator name="ModelApplier (3)" class="ModelApplier">
        <list key="application_parameters">
        </list>
    </operator>
</operator>
  • Share/Bookmark
  • Stay tuned with our RSS feed!
  • Watch videos on our YouTube channel!
  • Rapid Insight / Inside Rapid-I (Blog)
  • Visit Rapid-I on Facebook and become our fan!
  • Follow Rapid-I on Twitter!
  • Read the Rapid-I Newsletter