GfK uses Rapid-I solutions for the indicative analysis of internet data
Enterprise: GfK Marktforschung, digital research department
Branch of industry: Market research
The GfK group is one of the leading market research enterprises worldwide, operates in more than 100 countries and has over 11,000 employees. GfK provides services for all major consumer goods, pharmaceutical, media and service markets. These services are divided into two sectors: Consumer Choices and Consumer Experiences. Consumer Choices supplies data reflecting consumer decisions and activities. Consumer Experiences is concerned with consumer behaviour and attitudes and how people perceive and experience the world.
Since the internet is playing an ever greater role in the business life of today, the collection and analysis of complex online data are becoming increasingly important. Online market research in terms of surveys carried out over the internet has existed since the 1990s. Now the internet itself is being surveyed – and the subject is the traces of communication left behind on it by users, e.g. in the form of comments within a forum for a certain brand, enterprise etc. In order to benefit from this unstructured information, the data needs to be categorised, using for example the so called sentiment analysis, which collects sentiments and opinions expressed on the internet.
GfK acknowledged the increasing significance of information on the internet and therefore entrusted the digital research department with the evaluation of this very data. This requires text data to be collected from the web or from surveys and evaluated. Taking a wide variety of data sources into account, pages on the internet need to be searched (crawling) and content extracted and analysed. On the basis of this data analysis, reliable statements should then be possible e.g. concerning the attitude of internet users towards certain products. An adequate analysis solution was sought in order to cope with the increasing amount of texts from online sources.
As regards the right analysis solution for evaluating web contents, the digital research department knew exactly what it was looking for. The solution was to offer machine-based learning, e.g. in the form of an automatic categorisation of texts. A generic solution with high adaptability and reusability wassought, which could perform an indicative evaluation of the information found such as user generated content (UGC). Moreover, the requested data had to be accurately extracted from unstructured websites and given the necessary meta data (e.g. publication date).
One other important criterion when choosing the solution was an intuitive handling, since groups of people with varying levels of knowledge (IT specialists, research consultants or analysts) had to access it. In addition, it was required for the speedy computation of large quantities of data to be possible on “small” laptops and the uncomplicated exchange of analysis processes between employees to be ensured. The processing of typical data formats from market research (SPSS, MS-Office, ASCII, txt) and the integration with SQL databases were also essential.
In this context, source heterogeneity was an important point which the sought solution had to address, the processing of texts from the web toughening the challenge even more.
After a thorough market evaluation, GfK decided in 2007 to use the Rapid-I solution RapidMiner (Enterprise Edition). The analysis of data from social media sources played an increasingly important role. For this purpose GfK developed, using the RapidMiner solution, analysis processes enabling a generic content extraction from virtually any online sources.
The possibility of replication and reusability was a decisive factor for choosing RapidMiner. GfK saves a huge amount of time and effort with the generic content extraction model, which extracts the relevant data from almost any internet sources. This notably makes it possible to reuse processes or process elements and save them as a template or library, so that they do not have to be rewritten for each crawling action. In this way processes run automatically and manual interaction by the user is no longer required.
As time went on however, enormous volumes of data needed to be evaluated, which RapidMiner could no longer manage alone. The market research institute has therefore also been using the high-performance analysis server RapidAnalytics since 2011 and is enjoying advantages such as a greater ease of integration and interactive visualisations as well as higher performance. The alternative products that had previously been considered for data analysis were ultimately rejected for cost reasons or due to restrictions in functionality, a lack of support or insufficient system openness.
The introduction of RapidMiner did not involve a trial phase; decision-makers were able to try out the solution for themselves in a training event at Rapid-I instead. It was possible to install RapidMiner in just a few minutes – and no special know-how was necessary. RapidAnalytics required an IT specialist to come in for one day, who, in this time, carried out the server setup and the installation, created users and integrated the remote repository into RapidMiner.
The users on the other hand had to learn regular expressions (for the creation of filter criteria for text analysis). In addition, configuring RapidMiner was necessary in order to go onto the web. A particular technological challenge is the optimisation of processes requiring more RAM than is available. In order to make more space for data, sample operators are used to divide the data volume into smaller random samples, which are then analysed step by step. A further issue was the proxy support: Since RapidMiner had none at the beginning, tunnelling was carried out; proxy is now completely supported.
At GfK, data is collected from webtexts with RapidMiner. Internet sites are searched using crawling processes and content is extracted (advertising weeded out, purged of HTML structure), which is then stored in the data warehouse. A sentiment analysis is then carried out on the basis of this data. Germans websites are searched here as well as international ones. RapidMiner is in use at GfK in Germany, but the projects are international.
RapidMiner can be started quite simply as a program (similarly to Word) from the stationary PC. Used together with RapidAnalytics, RapidMiner also functions as a kind of user interface: Users access RapidAnalytics via RapidMiner, and analysis processes then run 24 hours a day in the background so that performance is not compromised and a high number of simultaneous users can be served. RapidAnalytics enables the use of much more powerful hardware and more working memory thanks to a client-server architecture, makes collaboration possible and improves collective working. Since the files are stored on the server in the repository, other users can work on the data and you are spared the hassle of transferring them from one computer to the other using a USB stick (as was still the case with the local version of RapidMiner). Otherwise the high volume of data could hardly be coped with.
RapidMiner scored points with its simple user interface, reasonable price and comprehensive support. The fact it is an open source solution was an important factor from the beginning. For this means temporary users such as external suppliers or temps can use the system without a license and at short notice.
“The open source thought was important when choosing a high-performance and equally cost-efficient data analysis solution“, points out Thomas Eggebrecht, senior IT specialist (head of programming) at GfK Marktforschung. “We looked at a number of other products on the market, but none were as able as RapidMiner to meet the requirements for flexible and sometimes short-notice use with reliable support.”
Another advantage of RapidMiner is the high degree of flexibility: The solution runs on all systems thanks to Java, and the exchange of analysis processes between employees is ensured by way of XML files. There is a simple update mechanism and processes can be executed by script both under Linux and under Windows. With its openness, the solution can also be extended at any time to include its own plugins or operators at the open source Java API. What is more, all usual file formats in market research are supported. RapidAnalytics also offers the unsupervised and cronjob-controlled running of long-lasting processes on a server. Since a virtually unlimited number of users can access the data via the analysis server, work between several persons on a project is made easier and data is exchanged via a remote repository. Using the Rapid-I solution does not make any special demands on the hardware or software either, since a normal customary Linux server with Java RE is used and so no complicated installation routines or root rights are necessary.
“GfK can offer its customers high-quality, controllable and comprehensible methods with the Rapid-I solutions“, explains Eggebrecht. “Thanks to the simple handling, the low software and hardware requirements, the ease of integration and last but not least the possibility of worldwide collaboration, we are optimally equipped to process more or less any request quickly and competently following web content analyses.“
Eggebrecht expects a high demand for text mining and data preparation for the future, including from other sectors such as retail and technology. “After the positive experience introducing the Rapid-I solution, I could definitely imagine it being used worldwide “, says Eggebrecht. “GfK is a global enterprise and our projects are international, so the region-wide use of a uniform software makes good sense.
Information about GfK
GfK is one of the world’s biggest market research enterprises. Its more than 11,000 employees research how people live, think and consume. GfK relies on continuous innovation and intelligent solutions here. In this way GfK supplies the knowledge needed by enterprises in over 100 countries to understand the people most important to them: their customers. GfK’s turnover amounted to 1.29 billion euro in 2010.