Pages: [1]
  Print  
Author Topic: Processing multiple xml files for tf-idf  (Read 698 times)
Ruca
Newbie
*
Posts: 13


« on: September 27, 2013, 02:08:27 PM »

Hi all,

I have an issue regarding processing several news articles available in multiple xml files.
The xml files look the following structure:

article_set1.xml
<article_set>
 <article id=1>
  <article_text>...</article_text>
 </article>

 <article id=2>
  <article_text>...</article_text>
 </article>
</article_set>


<article_set2.xml
<article_set>
 <article id=10>
  <article_text>...</article_text>
 </article>

 <article id=11>
  <article_text>...</article_text>
 </article>
</article_set>

Meaning that each xml contains different articles to be processed. An article must be considered as a document do be processed by the tf-idf.
My first attempt  was to use the "read xml" operator and connect to a "process documents from data". It works fine, but it only enable to process only one xml file.
Second attempt  was to use a "loop files" iterator in the beginning of the process. By using this approach, it creates a tf-idf vector for each xml file processed.
Third attempt use only the "process documents from files", and process the xml files internally. This approach assumes that a xml is a document.
My objective is that, for each article_id should be considered as a different document, even when multiple xml files need to be processed.

Any guidance on this issue is more than welcome.

Thank you for your support.

Regards,


Ruca
Logged
Marius
Administrator
Hero Member
*****
Posts: 1794



WWW
« Reply #1 on: September 30, 2013, 09:41:25 AM »

Hi Ruca,

if you use Process Documents from Files, you can split a file into its subdocuments via Split Document.
If you use the Loop Files operator, you can use Read XML, append the data, and use Perform Documents from Data after loop, not in the loop.

Does that help?

Best regards,
Marius
Logged

Please add [SOLVED] to the topic title when your problem has been solved! (do so by editing the first post in the thread and modifying the title)
Please click here before posting.
Ruca
Newbie
*
Posts: 13


« Reply #2 on: September 30, 2013, 02:32:41 PM »

Hi Marius,

Thank you very much for your help. I used the "Loop Files operator" using the append data and it works fine!

My problem is now how to store the results into MySQL database. Since the number of columns in MySQL is limited, I had to perform a transpose operation. Which makes the terms into IDs now.
I'm getting two different terms: "el-nino" and "el niņo". which should be different terms according to UTF-8 character set. Since the terms are now IDs, I'm not able to store these rows on a table because MYSQL assumes that they are the same term.
I had to change the role of the ID column to regular. It works, but I guess is not the right way to do it.
Does anyone has any other approach for doing this?

Thank you for you support!

Regards,

Ruca
Logged
Pages: [1]
  Print  
 
Jump to: