[...]I am not aware of any data cleaning or data preprocessing functionality offered by Data Cleaner that is not already provided by RapidMiner. Could you name any?
Datacleaner has very, very handy features for string analysis. Typically pattern analysis in the "profiler" gives aggregates, a kind of OLAP exampleset where each classifier is in fact the string pattern. If your string is the following email "email@example.com
", it will be classified in the category "firstname.lastname@example.org
". It is powerful for two reasons :
- It prepares preprocessing, verifying a few consistency points in your datas
- It gives you the main pattern to use in predicates or in regexps when you do linguistics analysis, NER, indexing, etc...
For each string, "String analysis" can give the number of blank spaces (useful for trimming), Lower/Uppercase, number of words in a string (string vs nominal).
In another profiler (FEBRL, not to give it
), you can use distances between words, for phonetic indexing widely spread in data quality :
- soundex, phonex, phonix, metaphone, NYSIIS, etc...
- block/canopy indexing
Other distances, as jaro-winkler or levenstein are available here :http://www.dcs.shef.ac.uk/~sam/stringmetrics.html
All this stuff is indeed string data quality and is not in RapidMiner, except a few algorithms in TextInput (TF-IDF, cosine distance)