OK, that's helped in a way. At least I started to work on date extraction. Thanks. However, even here I continue to have problems: I have a string "2011/03/17 20:06:08", use parsing format "yyyy/mm/dd hh:mm:ss" and have the result "17 January 2011 20:06:08 EET". WTF? Why January?
to give you an advice on how to parse out the information you want it might help if you give us the xml/html code of one example document.
Well, the url of an example document is given in the code I've attached to my previous post. If you go to this web-page and press Ctrl+U, you'll see the code of the page. But OK, maybe I tell my problem in a kind of messy way. Sorry for that. Now I'll try to explain more thoroughly.
1. I have to analyse news sites like http://www.bbc.co.uk/
From the news pages like http://www.bbc.co.uk/news/uk-12778022
I want to extract story title, story main text and story date.
2. To do this I use Crawl Web and Extract Information operators. I use "Regular Expression" query and it extracts the information I need, so I don't need xpath. On the page http://www.bbc.co.uk/news/uk-12778022
the date is extracted with a query <meta name="OriginalPublicationDate" content="(.*?)"/> (the original string is <meta name="OriginalPublicationDate" content="2011/03/17 20:06:08"/>), title is extracted with a query <meta name="OriginalPublicationDate" content="(.*?)"/> (the original string is <h1 class="story-header">Japan crisis: UK rescue team to withdraw</h1>), the main text of the story is extracted with a query <p class="introduction" id="story_continues_1">(.*?)</div><!-- / story-body -->.
3. Now, the problem is with the latter: it extracts the text full of tags garbage. It looks like this: http://usic.org.ua/upload/151029c6dbdb05409ca8506de206cb60485a0c27/Code.txt
. I want to clean the main text, but HTML processing operator doesn't work with data attributes. I tried data to documents, but it didn't work: it created Documents Collection IOO object, which is again not acceptable for HTML processing.
So, the question is: how to transform extracted data to documents, which can be processed as "normal" text documents like TXTs from hard drive? And then combine them again to data set.
Any ideas appreciated.