Integrating RapidMiner into your application
From Rapid-I-Wiki
RapidMiner can easily be invoked from other Java applications. You can both read process configurations from xml Files or Readers, or you can construct Processes by starting with an empty process and adding Operators to the created Process in a tree-like manner. Of course you can also create single operators and apply them to some input objects, e.g. learning a model or performing a single preprocessing step. However, the creation of processes allows RapidMiner to handle the data management and process traversal. If the operators are created without being part of a process, the developer must ensure the correct usage of the single operators himself.
Contents |
Initializing RapidMiner
Before RapidMiner can be used (especially before any operator can be created), RapidMiner has to be properly initialized. The method RapidMiner.init() must be invoked before the OperatorService can be used to create operators. Several other initialization methods for RapidMiner exist, please make sure that you invoke at least one of these. If you want to configure the initialization of RapidMiner you might want to use the method
RapidMiner.init(InputStream operatorsXMLStream, File pluginDir, boolean addWekaOperators, boolean searchJDBCInLibDir, boolean searchJDBCInClasspath, boolean addPlugins)
Setting some of the properties to false (e.g. the loading of database drivers or the Weka operators) might drastically improve the needed runtime during start-up. If you even want to use only a subset of all available operators you can provide a stream to a reduced operator description (operators.xml). If the parameter operatorsXMLStream is null, just all core operators are used. Please refer to the API documentation for more details on the initialization of RapidMiner.
You can also use the simple method RapidMiner.init() and configure the settings via this list of environment variables:
- rapidminer.init.operators (file name)
- rapidminer.init.plugins.location (directory name)
- rapidminer.init.weka (boolean)
- rapidminer.init.jdbc.lib (boolean)
- rapidminer.init.jdbc.classpath (boolean)
- rapidminer.init.plugins (boolean)
Creating Operators
It is important that operators are created using one of the createOperator(...) methods of
com.rapidminer.tools.OperatorService
The table shows the different factory methods for operators which are provided by OperatorService. Please note that few operators have to be added to a process in order to properly work.
| Method | Description |
|---|---|
| createOperator(String name) | Use this method for the creation of an operator from its name. The name is the name which is defined in the operators.xml file and displayed in the GUI. |
| createOperator(OperatorDescription description) | Use this method for the creation of an operator whose OperatorDescription is already known. Please refer to the RapidMiner API. |
| createOperator(Class clazz) | Use this method for the creation of an operator whose Class is known. This is the recommended method for the creation of operators since it can be ensured during compile time that everything is correct. However, some operators exist which do not depend on a particular class (e.g. the learners derivced from the Weka library) and in these cases one of the other methods must be used. |
Creating a complete process
From Rapidminer 5.1 the below mentioned approach has been deprecated due to major redesign in the architecture of RapidMiner.
The following code shows a detailed example for the RapidMiner API to create operators and setting its parameters.
import com.rapidminer.tools.OperatorService;
import com.rapidminer.RapidMiner;
import com.rapidminer.Process;
import com.rapidminer.operator.Operator;
import com.rapidminer.operator.OperatorException;
import java.io.IOException;
public class ProcessCreator {
public static Process createProcess() {
try {
// invoke init before using the OperatorService
RapidMiner.init();
} catch (IOException e) { e.printStackTrace(); }
// create process
Process process = new Process();
try {
// create operator
Operator inputOperator =
OperatorService.createOperator(ExampleSetGenerator.class);
// set parameters
inputOperator.setParameter("target_function", "sum classification");
// add operator to process
process.getRootOperator().addOperator(inputOperator);
// add other operators and set parameters
// [...]
} catch (Exception e) { e.printStackTrace(); }
return process;
}
public static void main(String[] argv) {
// create process
Process process = createProcess();
// print process setup
System.out.println(process.getRootOperator().createProcessTree(0));
try {
// perform process
process.run();
// to run the process with input created by your application use
// process.run(new IOContainer(new IOObject[] { ... your objects ... });
} catch (OperatorException e) { e.printStackTrace(); }
}
}
We can simply create a new process setup via new Process() and add operators to the created process. The root of the process' operator tree is queried by process.getRootOperator(). Operators are added like children to a parent tree. For each operator you have to
- create the operator with help of the OperatorService,
- set the necessary parameters,
- add the operator at the correct position of the operator tree of the process.
After the process was created you can start the process via
process.run();
If you want to provide some initial input you can also use the method
process.run(IOContainer);
If you want to use a log file you should set the parameter logfile of the process root operator like this
process.getRootOperator().setParameter(ProcessRootOperator.PARAMETER_LOGFILE, filename);
before the run method is invoked. If you want also to keep the global logging messages in a file, i.e. those logging messages which are not associated to a single process, you should also invoke the method
LogService.initGlobalLogging(OutputStream out, int logVerbosity);
before the run method is invoked.
If you have already defined a process configuration file, for example with help of the graphical user interface, another very simple way of creating a process setup exists. The code below shows how a process can be read from a process configuration file. Just creating a process from a file (or stream) is a very simple way to perform processes which were created with the graphical user interface beforehand.
public static IOContainer createInput() {
// create a wrapper that implements the ExampleSet interface and
// encapsulates your data
// ...
return new IOContainer(new IOObject[] { myExampleSet });
}
public static void main(String[] argv) throws Exception {
// MUST BE INVOKED BEFORE ANYTHING ELSE !!!
RapidMiner.init();
// create the process from the command line argument file
Process process = new Process(new File(argv[0]));
// create some input from your application, e.g. an example set
IOContainer input = createInput();
// run the process on the input
process.run(input);
}
As it was said before, please ensure that RapidMiner was properly initialized by one of the init methods presented above.
RapidMiner 5.1
From RapidMiner 5, new concept of Ports was introduced. This enabled multi branching logic to be implemented in a single Rapid Miner Process. In RapidMiner 5, the recommended way of integrating RapidMiner in a java app is to create the process in the RapidMiner Gui and then supply the xml to constructor of the process. Though chaining of operators dynamically is still possible, creating the process from XML offers various validity check such as version compatibility of operators, check for deprecated classes, etc.. which are difficult to handle.
Process rm5 = new Process(new File("pathtoProcessXML"));
process.run();
Chaining of Operators Dynamically
The below code shows a simple example of how to chain the operators dynamically. In simple terms from RapidMiner5, you need to create operators using OperatorService. Add them to the root Process, connect the ports between the operators correctly and also provide the parameters to the operators correctly. Below code shows a simple example for the same. This code reads the Gold dataset from repository, builds a NaiveBayes model using the training set and saves the model in a file. But you need to remember that this method is neither supported nor welcomed by RapidMiner support/forums.
import com.rapidminer.RapidMiner;
import com.rapidminer.RapidMiner.ExecutionMode;
import com.rapidminer.operator.Operator;
import com.rapidminer.operator.OperatorCreationException;
import com.rapidminer.operator.OperatorException;
import com.rapidminer.operator.io.CSVDataReader;
import com.rapidminer.operator.io.ModelWriter;
import com.rapidminer.operator.io.RepositorySource;
import com.rapidminer.operator.learner.bayes.NaiveBayes;
import com.rapidminer.tools.OperatorService;
import com.rapidminer.tools.ParameterService;
import com.rapidminer.Process;
public class SimpleClassifier {
/**
* @param args
*/
public static void main(String[] args) {
String rapidMinerHome = "/usr/local/rapidminer";
System.setProperty("rapidminer.home", rapidMinerHome);
RapidMiner.setExecutionMode(ExecutionMode.COMMAND_LINE);
RapidMiner.init();
try {
/* Reading Data */
Operator trainingDataReader = OperatorService.createOperator(RepositorySource.class);
trainingDataReader.setParameter(RepositorySource.PARAMETER_REPOSITORY_ENTRY, "//Samples/data/Golf");
/* Classifier */
Operator bayesClassifier = OperatorService.createOperator(NaiveBayes.class);
/* Save model */
Operator modelWriter = OperatorService.createOperator(ModelWriter.class);
modelWriter.setParameter("model_file", "/home/venki/test_model");
Process process = new Process();
process.getRootOperator().getSubprocess(0).addOperator(trainingDataReader);
process.getRootOperator().getSubprocess(0).addOperator(bayesClassifier);
process.getRootOperator().getSubprocess(0).addOperator(modelWriter);
trainingDataReader.getOutputPorts().getPortByName("output").connectTo(bayesClassifier.getInputPorts().getPortByName("training set"));
bayesClassifier.getOutputPorts().getPortByName("model").connectTo(modelWriter.getInputPorts().getPortByName("input"));
process.run();
} catch (OperatorCreationException e) {
// TODO Auto-generated catch block
e.printStackTrace();
} catch (OperatorException e) {
// TODO Auto-generated catch block
e.printStackTrace();
}
}
}
Using single operators (not applicable for RapidMiner 5)
The creation of a Process object is the intended way of performing a complete data mining process within your application. For small processes like a single learning or preprocessing step, the creation of a complete process object might include a lot of overhead. In these cases you can easily manage the data flow yourself and create and use single operators.
The data flow is managed via the class IOContainer. Just create the operators you want to use, set necessary parameters and invoke the method apply(IOContainer). The result is again an IOContainer which can deliver the desired output object. The code below shows a small programm which loads some training data, learns a model, and applies it to an unseen data set.
public static void main(String[] args) {
try {
RapidMiner.init();
// learn
Operator exampleSource =
OperatorService.createOperator(ExampleSource.class);
exampleSource.setParameter("attributes",
"/path/to/your/training_data.xml");
IOContainer container = exampleSource.apply(new IOContainer());
ExampleSet exampleSet = container.get(ExampleSet.class);
// here the string based creation must be used since the J48 operator
// do not have an own class (derived from the Weka library).
Learner learner = (Learner)OperatorService.createOperator("J48");
Model model = learner.learn(exampleSet);
// loading the test set (plus adding the model to result container)
Operator testSource =
OperatorService.createOperator(ExampleSource.class);
testSource.setParameter("attributes", "/path/to/your/test_data.xml");
container = testSource.apply(new IOContainer());
container = container.append(model);
// applying the model
Operator modelApp = OperatorService.createOperator(ModelApplier.class);
container = modelApp.apply(container);
// print results
ExampleSet resultSet = container.get(ExampleSet.class);
Attribute predictedLabel = resultSet.getPredictedLabel();
for (ExampleReader reader = resultSet.getExampleReader(); reader.hasNext(); ) {
System.out.println(reader.next().getValueAsString(predictedLabel));
}
} catch (IOException e) {
System.err.println("Cannot initialize RapidMiner:" + e.getMessage());
} catch (OperatorCreationException e) {
System.err.println("Cannot create operator:" + e.getMessage());
} catch (OperatorException e) {
System.err.println("Cannot create model: " + e.getMessage());
}
}
Please note that using an operator without an surrounding process is only supported for operators not directly depending on others in an process configuration. This is true for almost all operators available in RapidMiner. There are, however, some exceptions: some of the meta optimization operators (e.g. the parameter optimization operators) and the ProcessLog operator only work if they are part of the same process of which the operators should be optimized or logged respectively. The same applies for the MacroDefinition operator which also can only be properly used if it is embedded in a Process. Hence, those operators cannot be used without a Process and an error will occur.
Please note also that the method
RapidMiner.init();
or any other init() taking some parameters must be invoked before the OperatorService can be used to create operators (see above).
RapidMiner as a library
If RapidMiner is separately installed and your program uses the RapidMiner classes you can just adapt the examples given above. However, you might also want to integrate RapidMiner into your application so that users do not have to download and install RapidMiner themselves. In that case you have to consider that
- RapidMiner needs a rapidminerrc file in rapidminer.home/etc directory
- RapidMiner might search for some library files located in the directory rapidminer.home/lib
For the Weka jar file, you can define a system property named rapidminer.weka.jar which defines where the Weka jar file is located. This is especially useful if your application already contains Weka. However, you can also just omit all of the library jar files, if you do not need their functionality in your application. RapidMiner will then just work without this additional functionality, for example, it simply does not provide the Weka learners if the weka.jar library was omitted.
Transform data for RapidMiner
Often it is the case that you already have some data in your application on which some operators should be applied. In this case, it would be very annoying to write your data into a file, load it into RapidMiner with an ExampleSource operator and apply other operators to the resulting ExampleSet. It would therefore be a nice feature if it would be possible to directly use your own application data as input. This section describes the basic ideas for this approach.
As we can see in Data core, all data is stored in a central data table (called ExampleTable) and one or more views on this table (called ExampleSets) can be created and will be used by operators. The code below shows how this central ExampleTable can be created.
import com.rapidminer.example.*;
import com.rapidminer.example.table.*;
import com.rapidminer.example.set.*;
import com.rapidminer.tools.Ontology;
import java.util.*;
public class CreatingExampleTables {
public static void main(String[] argv) {
// create attribute list
List<Attribute> attributes = new LinkedList<Attribute>();
for (int a = 0; a < getMyNumOfAttributes(); a++) {
attributes.add(AttributeFactory.createAttribute("att" + a,
Ontology.REAL));
}
Attribute label = AttributeFactory.createAttribute("label",
Ontology.NOMINAL));
attributes.add(label);
// create table
MemoryExampleTable table = new MemoryExampleTable(attributes);
// fill table (here: only real values)
for (int d = 0; d < getMyNumOfDataRows(); d++) {
double[] data = new double[attributes.size()];
for (int a = 0; a < getMyNumOfAttributes(); a++) {
// fill with proper data here
data[a] = getMyValue(d, a);
}
// maps the nominal classification to a double value
data[data.length - 1] =
label.getMapping().mapString(getMyClassification(d));
// add data row
table.addDataRow(new DoubleArrayDataRow(data));
}
// create example set
ExampleSet exampleSet = table.createExampleSet(label);
// create a process
Process process = new Process();
process.getRootOperator().getSubprocess(0).addOperator(someOperator);
process.getRootOperator().getSubprocess(0).getInnerSources().getPortByIndex(0).connectTo( someOperator.getInputPorts().getPortByName("input"));
// run the process with new IOContainer using the created exampleSet
process.run(new IOContainer(exampleSet));
}
}
First of all, a list containing all attributes must be created. Each Attribute represents a column in the final example table. We assume that the method getMyNumOfAttributes() returns the number of regular attributes. We also assume that all regular attribute have numerical type. We create all attributes with help of the class AttributeFactory and add them to the attribute list.
For example tables, it does not matter if a specific column (attribute) is a special attribute like a classification label or just a regular attribute which is used for learning. We therefore just create a nominal classification label and add it to the attribute list, too.
After all attributes were added, the example table can be created. In this example we create a MemoryExampleTable which will keep all data in the main memory. The attribute list is given to the constructor of the example table. One can think of this list as a description of the column meta data or column headers. At this point of time, the complete table is empty, i.e. it does not contain any data rows.
The next step will be to fill the created table with data. Therefore, we create a DataRow object for each of the getMyNumOfRows() data rows and add it to the table. We create a simple double array and fill it with the values from your application. In this example, we assume that the method getMyValue(d,a) will deliver the value for the a-th attribute of the d-th data row. Please note that the order of values and the order of attributes added to the attribute list must be the same!
For the label attribute, which is a nominal classification value, we have to map the String delivered by getMyClassification(d) to a proper double value. This is done with the method mapString(String) of Attribute. This method will ensure that following mappings will always produce the same double indices for equal strings.
The last thing in the loop is to add a newly created DoubleArrayDataRow to the example table. Please note that only MemoryExampleTable provide a method addDataRow(DataRow), other example tables might have to initialized in other ways.
The last thing which must be done is to produce a view on this example table. Such views are called
ExampleSet in RapidMiner. The creation of these views is done by the method
createCompleteExampleSet(label, null, null, null). The resulting example set
can be encapsulated in a IOContainer and given to operators.
Remark: Since Attribute, DataRow, ExampleTable, and ExampleSet are all interfaces, you can of course implement one or several of these interfaces in order to directly support RapidMiner with data even without creating a MemoryExampleTable.