|
The data stream plugin for RapidMiner (formerly YALE)
provides operators for simulating data streams from one or
several data sets, for simulating concept drifts, and for
handling concept drifts on data streams with simulated or
real-world concept drift.
|
|
Features
The key features of the data stream plugin for RapidMiner are:
- enables data stream mining experiments and prototype application development and testing in RapidMiner
- integrates with all RapidMiner data mining operators
- 100% Java implementation
- very easy to extend
- operators for simulating data streams from one or several data sets (sources)
- operators for simulating different types of concept drift for controlled experiments
- operators for handling concept drift with simulated or real-world concept drift including:
- full memory: naive strategy always keeping all training examples for training classifiers,
i.e. there is no forgetting of old examples
- no memory: naive strategy only keeping the training examples in the most recent batch
and forgetting all older previous examples
- time windows of fixed size: this strategy uses a time window of configutable fixed length
on the data stream for training classifiers and discards examples outside the time window
- time window of adaptive size: this strategy automatically adjust the length of the time
window on the data stream to the current amount of concept drift, so that the expected prediction
is minimized.
- example weighting: local, global, and combined example weighting strategies consider
the age of examples and/or their helpfulness in predicting future example labels and assign
corresponding weights to the examples during the training process, allowing gradual forgetting
and performance-based weighting
- example selection: this strategy is a special case of example weighting using only
weights of zero or one, i.e. select or discard, and allows more flexibel example selection than
a time window of adaptive or fixed size, because it allows to reconsider old data, if it becomes
helpful again for classifying new instances; this approach selects the examples for training
that minimize the expected error rate of the resulting classification model; this stragegy
usually outperforms all of the above approaches in terms of accuracy
- ensemble-based learning: knowledge-based sampling (KBS) on data streams (KBS-stream)
is a very efficient and effective concept drift handling strategy using a boosted ensemble of
base classifiers and difference modelling to typically achieve a higher accuracy than any other
approach; in RapidMiner, this operator is called BayBoostStream (Bayesian Boosting on Data Streams).
For theoretical background on these concept drift handling data stream mining methods please
refer to the following two publications:
- Ralf Klinkenberg: Learning Drifting Concepts: Example Selection vs. Example weighting,
Intelligent Data Analysis (IDA) Journal, Volume 8, Number 3, 2004, pages 281-300.
- Martin Scholz and Ralf Klinkenberg: Boosting Classifiers for Drifting Concepts,
Intelligent Data Analysis (IDA) Journal, Volume 11, Number 1, March 2007,
Special Issue on Knowledge Discovery from Data Streams, pages 3-28.
Please note that this plugin currently is in beta status, i.e. it currently is not as well
tested and stable as the RapidMiner core.
Download and Documentation
The following files are available from the
RapidMiner Plugins download page:
| Type | Filename | Description |
| Plugin |
rapidminer-datastream-XXX.jar |
The main plugin as jar file |
|
rapidminer-datastream-XXX-installer.exe |
The main plugin as windows installer |
| Source |
rapidminer-datastream-XXX-src.jar |
The source code of the plugin |
| Javadoc |
rapidminer-datastream-XXX-javadoc.jar |
The javadoc of the plugin |
|