Yesterday I stumbled upon an article called "Be cautious about open source data mining" written by Anh Nguyen about a talk given by Jos von Dongen at the Predictive Analytics World in London. My initial thought was just like "ok, the author is probably just a partner of some proprietary software vendor living great from the sales commissions for the sold licenses".
Hence, I did not expect anything neutral and objective but a completely proprietary-vendor-X-oriented article describing with greatest eloquence why proprietary solution X is so much better than any open source solution. Things like: Those open source solutions are free. They simply cannot work - for exactly this reason. And they are of course a danger not only for the complete IT infrastructure but also for the analyst's mind and of course for the whole enterprise. Which is by the way very likely to break down simply by introducing something they did not paid millions of license fees for. I have actually read enough articles like that before and initially I did not want to give this one another chance.
Since I had to wait for another couple of minutes before a meeting started, I clicked on the link and was deeply surprised. There were a set of theses which were completely reasonable. I liked those and hence I want to comment on them and extend them a bit:
"It's free but should be evaluated like any other software"
This is actually nothing new and I fully agree. Of course I like what we are doing here at Rapid-I and personally I think RapidMiner / RapidAnalytics are among the best solutions for almost every aspect of data analysis you can think of. Nevertheless, there are situations where other solutions might be more appropriate. At least there is a chance for this, so you should give all options a try. What did you just say? This is not easy since not all options are delivered as open source solutions? Right. But that's hardly our fault...
"It doesn’t matter if the software is free if it takes longer to build, manage and deploy solutions to end users, or if it is unstable, or missing key features. Don’t select just because it is open source”
Again I fully agree. Choosing a solution simply because it is an open source solution is probably as stupid than avoiding it for exactly that reason. Among the potential drawbacks connected to maintaining the software or software quality, I would like to add that exactly for this reason the successful commercial open source companies like Rapid-I offer their Enterprise Editions. Those editions help to overcome those software issues by providing stabilized releases, higher levels of quality assurance, and full support. If you want a fair comparison, you should go for the now-no-longer-free Enterprise Editions and compare those against proprietary solutions. By the way: from my experience, maintaining a software or worrying about missing features feels exactly the same for open and closed source products. There is no general difference connected to the software per se but to the service quality of the companies.
"van Dongen believes that if a business does not have any existing tools for data mining, they should make open source the default option. "
This is the strongest claim and I want to support it. The quintessence here is: if there already is a software solution for data mining, I think the optimal way is not to rip it out of your infrastructure and directly and completely replace it by an open source solution. Think gradually and employ RapidMiner for the next project before stocking up your licenses for the other solution. Or make it the default if you don't have a solution at all and have to get used to a new solution for data mining or business analytics anyway. We experienced all three ways during the last years: moving gradually from a closed-source solution to RapidMiner from project to project, starting with RapidMiner as primary data mining solution right away, and directly replacing the old solution by RapidMiner at once. I must say that the last way was the hardest option for all people involved in those projects. But this is nothing special to open source again but to replacing or migrating between different types of software in general.
Oh, and by the way: Another fact I really liked that van Dongen and Anh Nguyen recommended RapidMiner as open source solution for data mining. That made me liking this article even better than I did before anyway :-)
Here is a PDF file containing the article if it has been removed from the web.