Open Source Software für Big Data Analytics.
Ohne Programmierung.

HomeKontaktSucheSitemapDatenschutzImpressum
  • Deutsch
  • English
Rapid-I. Report the Future. Home Download
Rapid-I Blog
Home Home
Search Search
RSS Feed RSS Feed

 

 

Blog Tags
Login Form





Passwort vergessen?
Noch kein Benutzerkonto?
Registrieren
RapidAnalyticsBusiness IntelligenceBusiness Analytics 7 Apr 2011
Among the Top 5 Commercial OSBI Solutions by Ingo Mierswa Comment (0)

RapidMiner was just listed as being one of the Top 5 Commercial Open Source Business Intelligence Solutions - yeah!

It is great to see that advanced analytics techniques like data and text mining finally are seen as a fully integrated part of BI and this is exactly the reason why I think that Business Analytics (BA) will become the next big thing. Since its release, RapidAnalytics already got a high acceptance and as the first open source solution for BA will certainly push the limits in the field much further.

Link to the List (with an annoying advertisement screen before)

reportingRapidAnalytics Video Tutorial 6 Apr 2011
RapidAnalytics 6: Using Views by Simon Fischer Comment (0)

Up to now, the report we created in the RapidAnalytics video tutorial consisted of a single view: A collection of report components like charts, tables, etc. You can think of views as "pages" of your report or as "sub-reports". Now we are going to add more views between which the user can navigate by clicking on the navigation bar or by interacting with charts and tables.

 

Web ScrapingWeb MiningVideos 5 Apr 2011
Web Mining Video Series by Ingo Mierswa Comment (1)

Neil McGuigan, who already made a great series of Text Mining videos ,  has started a new video series about web crawling and web scraping . Until now, the video series consists of three parts:

 

Web Mining and Web Scraping

 

Part 1: Web Scraping with Google Spreadsheets and XPath

In his first video, Neil demonstrates how to grab parts of a web page (scraping) using Google Docs Spreadsheets and XPath. Although RapidMiner is not used here, the explanation of XPath expressions and his list of useful XPath constructs are really helpful if you want to set up a web scraping process with RapidMiner. 

 

Part 2: Web Crawling with RapidMiner

Here, Neil shows how to crawl about 500 pages from a site by a simple RapidMiner process. He also  discusses user agents, crawling rules, and robot exclusion files.

 

Part 3: Web Scraping with RapidMiner and XPath

In this video, Neil shows how to load the 500 html files from the previous web crawl, loop through each of them, use XPath to grab values from each page, and put them in a data table for later analysis. Here the XPath introduction becomes quite handy.

 

Thanks, Neil, for this second great series!

Plotter 1 Apr 2011
The RapidMiner Plotters 14: Density by Ingo Mierswa Comment (0)

Overview "Density"

  • Summary: Showing dependencies between up to four dimensions in a two-dimensional space
  • Number of Dimensions: 2 plus 1 encoded by color plus 1 encoded by point color
  • Data Types: Numerical, Nominal, Dates

This is the next post of a series describing all RapidMiner plotters in detail. A list of the plotters discussed so far can be found at the end of this article including the links to them. Since many options and controls of these plotters are also relevant for the one discussed here - as well as for many other plotters - I recommend to check out the first parts of this series before reading this one.

Before we start our discussion about the Density plotter, we will again first have a look:

 

The density plotter is very similar to the block plotter described previously and is basically also a scatter plot with two dimensions on the x-axis and the y-axis and one dimension used for the definition of the data points. But in addition to the block plotter, the density plotter also allows the selection of an additional attribute which is used to colorize the background of the plot. Another difference to the block plotter is the fact that the data points are actually points instead of the blocks we have seen before. The background can be interpreted as a heat map.

Like the scatter plot, the block plot is a simple two-dimensional plot with two axes: x and y. The x-axis is plotted horizontally and the y-axis vertically. If you plot a data set, each point will be located at the position which corresponds to the values with respect to those two axes but instead of a point a block is printed.

As always, you can find two boxes where you can select the attributes (variables, dimensions) of your data set or model which should be used for the x-Axis and for the y-Axis. Those two options both have to be set, the plotter will not show anything otherwise. By the way, you can use numerical attributes as well as nominal attributes for the axes. Even date attributes are supported.

The next option is called Point Color Column. If you select an attribute of your data or model here, the values of this attribute will be used for determining the color of each of the data points just as for the traditional scatter plot.

The most important setting of this plotter is the Density Color. Here you can select the attribute which is used for colorizing the background. The used colorization algorithm is quite simple: each data point contributes to all pixels depending on the distance of the data point to the pixel. The color is then calculated as the distance-weighted average of all points for each pixel position.

Of course the density plot also supports zooming and panning as described here .

Other parts of the plotter series:

FunData Mining 23 Mar 2011
Predictive Analytics and Cricket by Ingo Mierswa Comment (1)

I am not really deep into Cricket myself. However, I found this interesting blog entry which discusses some reasons for successful cricket games discoverey by data mining. It is not hard to tell that the author favors the Indian team :-)

The first thing to do is some basic statistics: How often did the Indian cricket team won in the past against certain other teams? For example, the Indian team won against England in 66% of all occasions during the last 5 years where both teams played against each other. Agains Australia, however, Indian won only in 40% of all those cases.

So the important point is: what were the circumstances under which India had won those 40%?  And here is where RapidMiner was used: the matches were described by attributes like "partnership", "pace bowlers", or "slow bowlers". The resulting decision tree looks like the following:

Decision Tree for Cricket

The model was built on all existing cases between India and Australia from the last 5 years. It is easy to tell that partnerships play the most significant role. In particular, 

  • India need to have 2 significant partnerships worth at least 77 runs
  • If not, the bowlers, specifically pace bowlers, have to step into the breach and take more than 7 wickets

Without any knowledge about cricket, I have hardly any idea what this actually means. I suppose that those two strong partnerships with 77 runs or more are two sets of good batting partners playing well with each other. If you don't have those, it seems that fast bowles taking down the wooden "goals" at least 7 times helps a lot.

This is what data mining is actually about: Finding insights in data without the need of having prior knowledge (of course you have to validate the findings!). The latter is actually missing in the blog post but maybe is part of the full report which can be downloaded on the web site. However, a fun read and a nice data mining application!

RapidMiner 8 Mar 2011
RapidMiner Miscellany by Ingo Mierswa Comment (3)

Recently, I stumbled upon a new blog called RapidMiner Miscellany .  This blog - written by Andrew Chisholm - is a great collection of small tips and hints about using RapidMiner for data analysis. I started reading the first couple of posts and found myself trying out his tips pretty soon :-)

 RapidMiner Miscellany

Beside usage hints for RapidMiner, Andrew also shares interesting insights into data mining, model evaluation, and series analysis with his readers.  And those insights are really worth sharing as well! Andrew states himself:

"For better or worse, I am learning RapidMiner and it helps me to learn if I write things down and try things out. Using a blog forces me to structure properly since I have to pretend I am addressing the content to someone else. "

Here is a small selection of topics from his blog:

However, I urgently recommend Andrew's blog for all interested readers since it contains a lot of helpful recommendations and hints for all levels of practitioners.
Plotter 23 Feb 2011
The RapidMiner Plotters 13: Block by Ingo Mierswa Comment (1)

Overview "Block"

  • Summary: Showing dependencies between two three dimensions
  • Number of Dimensions: 2 plus 1 encoded by color
  • Data Types: Numerical, Nominal, Dates

This is the next post of a series describing all RapidMiner plotters in detail. A list of the plotters discussed so far can be found at the end of this article including the links to them. Since many options and controls of these plotters are also relevant for the one discussed here - as well as for many other plotters - I recommend to check out the first parts of this series before reading this one.

Before we start our discussion about the Block plotter, we will again first have a look:

 

 

The block plotter is basically a scatter plot with two dimensions on the x-axis and the y-axis and one dimension used for the definition of the data points. The main difference to a scatter plot actually is the point format which is not a dot but a block. This is quite useful if the data set contains points on a two-dimensional grid.

Like the scatter plot, the block plot is a simple two-dimensional plot with two axes: x and y. The x-axis is plotted horizontally and the y-axis vertically. If you plot a data set, each point will be located at the position which corresponds to the values with respect to those two axes but instead of a point a block is printed.

As always, you can find two boxes where you can select the attributes (variables, dimensions) of your data set or model which should be used for the x-Axis and for the y-Axis. Those two options both have to be set, the plotter will not show anything otherwise. By the way, you can use numerical attributes as well as nominal attributes for the axes. Even date attributes are supported.

As you can see, you can also identify if the selected attribute should be transformed on a log scale. Just check the box below the corresponding axis.

The next option is called Color Column. If you select an attribute of your data or model here, the values of this attribute will be used for determining the color of each of the blocks.

The block plot also provides the Jitter option although it is certainly not used as often as for the scatter plot. However, this option is quite useful if several data points are located at the same point in the two-dimensional space. Just move around the jitter slider and look what's happening: the blocks are moving a bit to a random direction showing if and which points are lying below.

The last two options are pretty simple: Rotate Labels causes that the labels of the x-Axis are rotated by 90 degrees. Especially if you use a nominal attribute for the x-axis, the values can then be easily read. Export Image opens a dialog which allows you to export the current plotter with all its settings into one of the dozens supported image formats.

Of course the block plot also supports zooming and panning as described here .

Below you can find another useful example for the block plot, namely the visualization of a correlation matrix (here done for the data set "Sonar"). You can easily see on the diagonal that attributes close to each other are more correlated and that there also regions of attribute combinations with a high (negative) correlation:

 

 

Other parts of the plotter series:

reportingRapidAnalytics Video Tutorial 18 Feb 2011
RapidAnalytics 5: Creating Interactive Reports by Simon Fischer Comment (0)

If you follow our RapidAnalytics video series, you already know how to make simple reports with charts based on RapidMiner processes. This time, we spice that up a bit by defining interactions between report components to realize drill-downs etc.

Contest 17 Feb 2011
New Data Mining Contest by Ingo Mierswa Comment (0)

There is a new data mining contest on Research Garden in the field of nonlinear time series analysis. Come on, RapidMiners, get the 2500 Euro jackpot!

 

 

Research Garden

 

 

The goal of the contest is the development of a data analysis process which should optimize the performance of machines based on sensor data of those machines. The task comes from the field of multivariate, non-linear time series analysis and is rewarded with an overall price money of 2500 Euro. Nice.

More information can be found on the web site below (sorry, German only):

http://www.research-garden.de/web/guest/wettbewerbsdetails?cid=38180

Fun 16 Feb 2011
Go, Watson, Go: Win at Jeopardy with Basic Statistics by Ingo Mierswa Comment (2)
Well done, IBM. The new super computer named Watson was created and trained during the last 4 years by 25 IBM engineers in order to play (and win!) at Jeopardy. I just have viewed a short video about the event and the result really looks impressive:




Watson played quite well against two of the best Jeopardy players in the world. I especially liked to see the confidences at the bottom of the screen, this allowed me to check the quality of their model. And they did a good job: the clear cases were those where Watson was right in many cases.

Another nice thing was the reactions of the other contestants: Several times they seem to  know the answer (the question) as well but they are simply too slow.

And this was only day 1, on the second day of this three-day contest Watson performed even better. But after having digged a bit deeper I found out that the used techniques were pretty simple: at first, I thought that Watson understood the question by hearing instead of getting them directly. This is of course a big advantage since you don't lose any time with "understanding" what has been said or written. Talking about time, there is of course another big advantage of Watson that he does not lose any time for pressing the buzzer.

The basic thechniques are pretty simple as well: Watson stores about 200 million pages in a large search index - among them the complete Wikipedia - and searches for the given answer in those pages (ok, we probably all know how this works). From the top k results Watson extracts the most important person / concept / object etc. and creates an appropriate question. Little details have leaked about that but from that little I got the impression, that it's merely a topic detection or a named entity recognition and the confidence is based more or less on the average of the topic / NER confidences. Mix those simple ideas with the power of 2800 traditional computers and you get an impressive result...

The simple ideas most often are the most robust ones and the scientific and engineering efforts are impressive. Thanks, IBM, for those efforts and also for the positive effect this show probably has on the public acceptance of data mining and business analytics.
  • Share/Bookmark
  • Abbonieren Sie unseren RSS Feed!
  • Sehen Sie sich Videos in unserem YouTube Channel an!
  • Rapid Insight / Inside Rapid-I (Blog)
  • Besuchen Sie Rapid-I bei Facebook und werden Sie Fan!
  • Folgen Sie Rapid-I bei Twitter!
  • Lesen Sie den Rapid-I Newsletter