| Web Scraping, Web Mining, Videos | 5 Apr 2011 |
| Web Mining Video Series by Ingo Mierswa | Comment (1) |
Neil McGuigan, who already made a great series of Text Mining videos , has started a new video series about web crawling and web scraping . Until now, the video series consists of three parts:
Part 1: Web Scraping with Google Spreadsheets and XPath
In his first video, Neil demonstrates how to grab parts of a web page (scraping) using Google Docs Spreadsheets and XPath. Although RapidMiner is not used here, the explanation of XPath expressions and his list of useful XPath constructs are really helpful if you want to set up a web scraping process with RapidMiner.
Part 2: Web Crawling with RapidMiner
Here, Neil shows how to crawl about 500 pages from a site by a simple RapidMiner process. He also discusses user agents, crawling rules, and robot exclusion files.
Part 3: Web Scraping with RapidMiner and XPath
In this video, Neil shows how to load the 500 html files from the previous web crawl, loop through each of them, use XPath to grab values from each page, and put them in a data table for later analysis. Here the XPath introduction becomes quite handy.
Thanks, Neil, for this second great series!


