|Web Scraping, Web Mining, Videos||5 Apr 2011|
|Web Mining Video Series by Ingo Mierswa||Comment (1)|
Neil McGuigan, who already made a great series of Text Mining videos , has started a new video series about web crawling and web scraping . Until now, the video series consists of three parts:
In his first video, Neil demonstrates how to grab parts of a web page (scraping) using Google Docs Spreadsheets and XPath. Although RapidMiner is not used here, the explanation of XPath expressions and his list of useful XPath constructs are really helpful if you want to set up a web scraping process with RapidMiner.
Here, Neil shows how to crawl about 500 pages from a site by a simple RapidMiner process. He also discusses user agents, crawling rules, and robot exclusion files.
In this video, Neil shows how to load the 500 html files from the previous web crawl, loop through each of them, use XPath to grab values from each page, and put them in a data table for later analysis. Here the XPath introduction becomes quite handy.
Thanks, Neil, for this second great series!