| Scalability, RapidMiner, Preprocessing, PMML | 1 Jul 2009 |
| PMML 4 and Preprocessing Models by Ingo Mierswa | Comment (1) |
A short time ago, the data mining group announced the release of the new version of the PMML standard:
http://www.dmg.org/pmml-v4-0.html
One of my main concerns against PMML always was the fact that no preprocessing models were supported. As I have written before, preprocessing models are a very important feature of RapidMiner. This has a simple reason: in my opinion, processing and transforming the data is probably the most important but also complicated part of data mining. For that reason, the focus in operator development for RapidMiner always was this vital part of data analysis instead of producing hundreds of additional and highly sophisticated learning schemes which would not scale up well enough for large real-world data sets anyway. As a consequence, our preprocessing models were introduced into RapidMiner already several years ago together with lots of operators for the different data transformation tasks.
With version 3, the PMML standard also started to support first preprocessing models. This was actually the first step towards a more scalable direct-in-database processing and analysis. I was really keen on seeing the next major version of PMML and checking out if more preprocessing steps would have been supported. And indeed, several new models were added and PMML seems to become a standard for the description of (almost) the complete analysis process which can then be deployed directly in the database. And for us, this means that PMML will gain more weight for the future development and will be a topic for future releases again.
Until then, a combination of the cached database access from RapidMiner, the layered view stack on data sets, and scalable analysis methods like Naive Bayes will be the solution for using RapidMiner on data sets of almost arbitrarily sizes.

