Pages: [1]
  Print  
Author Topic: MySQL, PDFs and text mining  (Read 4440 times)
Patrick
Guest
« on: November 28, 2008, 05:41:18 PM »

Yep,
 
I know it's a bit a weird title but here comes the origin of it. I got a rather large repository of documents (most PDFs, some HTMLs and some TXT files) and they are stored within a database (yes, a single field of every record of my table contains a whole PDF besides some fields describing the doctype, language, origin and a label). The reason that a database keeps the files instead of a filesystem is given by the fact that we want to enable access from outside to the files using some simple PhP files.
This database will contain a the end also some properties gathered with RM concerning these PDFs...
 
 
However, we want also to explore all PDFs and look which are 'familiar' and which are not (yes, using RM and the WVTool thus).
The problem now is this:
1/ Accessing a MySQL database from within RM is a piece of cake (thank you RM-developpers!), saying which field to pick as well.
2/ Using the WVTool on a directory of PDFs is easy as well
 
But... is there a way to forward the MySQL stream (containing the PDFs actually) to the TextInput Method of the WVTool and make the WVTool use these fields of the different records as files ? Or do I need to use another method than TextInput to perform this task.
 
Thus in summary: how can I replace the directory or URL input option by a kind of ExampleSet input option ?
 
BTW, I'm working on a Linux FC8 system if this can help to solve/circumvent this problem.
All help is greatly appreciated.
 
(A possible solution will be to start from PDFs on a filesystem and load them together with the gathered data into the MySQL database using RM, but I want to avoid this solution, because we are working in a project with different teams, some offer the data (PDFs, TXTs, ..), I'm doing the text mining, and still others will use the outcome. Hence a web-accessible database is to be preferred over a file-system...)
 
Best,
Patrick
Logged
Tobias Malbrecht
Global Moderator
Sr. Member
*****
Posts: 293



WWW
« Reply #1 on: November 28, 2008, 05:44:15 PM »

Hi Patrick,

But... is there a way to forward the MySQL stream (containing the PDFs actually) to the TextInput Method of the WVTool and make the WVTool use these fields of the different records as files ? Or do I need to use another method than TextInput to perform this task.
 
Thus in summary: how can I replace the directory or URL input option by a kind of ExampleSet input option ?

well, the answer is quite simple. Simply load the data into RM from the database. Check that your attributes containing the texts are loaded as string attributes. Then use the StringTextInput operator.

Hope that helps,
Tobias
Logged

Tobias Malbrecht
Director of Product Marketing
RapidMiner
pdemaziere
Newbie
*
Posts: 2


« Reply #2 on: November 30, 2008, 10:03:39 AM »

Hello Tobias,

It still won't work, I get an error when I try what you suggest.
Here's the XML of RM:
<operator name="Root" class="Process" expanded="yes">
    <operator name="DatabaseExampleSource" class="DatabaseExampleSource">
        <parameter key="database_url"   value="jdbc:mysql://localhost:3306/test_rapid"/>
        <parameter key="id_attribute"   value="ID"/>
        <parameter key="label_attribute"   value="DocName"/>
        <parameter key="query"   value="SELECT `ID`, `Doc`, `DocName` FROM `Documents`"/>
        <parameter key="string_attribute"   value="Doc"/>
        <parameter key="username"   value="root"/>
    </operator>
    <operator name="StringTextInput" class="StringTextInput" expanded="yes">
        <parameter key="default_content_encoding"   value="utf-8"/>
        <parameter key="default_content_language"   value="english"/>
        <parameter key="default_content_type"   value="pdf"/>
        <parameter key="id_attribute_type"   value="long"/>
        <parameter key="vector_creation"   value="TermFrequency"/>
        <operator name="StringTokenizer" class="StringTokenizer">
        </operator>
        <operator name="EnglishStopwordFilter" class="EnglishStopwordFilter">
        </operator>
        <operator name="PorterStemmer" class="PorterStemmer">
        </operator>
        <operator name="NGramTokenizer" class="NGramTokenizer">
        </operator>
    </operator>
</operator>


And here's my database scheme:
CREATE TABLE IF NOT EXISTS `Documents` (
  `ID` bigint(20) NOT NULL auto_increment,
  `Doc` longblob NOT NULL,
  `MIME` tinytext collate utf8_unicode_ci NOT NULL,
  `DocName` tinytext collate utf8_unicode_ci NOT NULL,
  PRIMARY KEY  (`ID`)
) ENGINE=MyISAM  DEFAULT CHARSET=utf8 COLLATE=utf8_unicode_ci AUTO_INCREMENT=8 ;

The Doc field contains the PDF itself.

The error I get is:
Error in: StringTextInput (StringTextInput)
The input example set does not contain any attributes with value type string.

Loading of the data out of the database did work, that was no problem, since I was able to see the contents in a window. Only problem is that this is really file-content, starting in my case with
%PDF-1.4-%
and then some more "cryptic" characters...

I hope you can get me out here.
many thanks in advance,
Patrick
Logged
Tobias Malbrecht
Global Moderator
Sr. Member
*****
Posts: 293



WWW
« Reply #3 on: November 30, 2008, 06:26:12 PM »

Hi Patrick,

the problem here is that the data you loaded from the database does not contain a string attribute. The StringTextInput however assumes the text to be processed is given in string attribtues. Probably your database columns containing the texts are loaded as nominal attributes. You can check that by adding a breakpoint behind the DatabaseExampleSource operator in the meta data view. You simply have to convert the attribute containing the text to a string value type. This can be done by the Nominal2String operator. Simply place the following code between the data source operator and the StringTextInput operator:

Code:
    <operator name="AttributeSubsetPreprocessing" class="AttributeSubsetPreprocessing" expanded="yes">
        <parameter key="attribute_name_regex" value="Doc"/>
        <parameter key="condition_class" value="attribute_name_filter"/>
        <operator name="Nominal2String" class="Nominal2String">
        </operator>
    </operator>

This should do the trick. The PDF commands should be ignored automatically.

Regards,
Tobias
Logged

Tobias Malbrecht
Director of Product Marketing
RapidMiner
Rocky
Guest
« Reply #4 on: December 01, 2008, 06:36:24 AM »

Hi the forum,

I am on Debian Linux with Intel Core 2 Duo processors with 1 GBytes RAM. I have downloaded and installed RapidMiner 4.3 Community : it works, it rocks !
Now I have downloaded the plugins and put them in "RapidMiner/lib" directory. At opening, RM says : "unable to load <text plugin related jars> : bad version number". I have used the 4.3 version plugins you provided.

Any idea ?

Rocky.
Logged
pdemaziere
Newbie
*
Posts: 2


« Reply #5 on: December 01, 2008, 05:43:10 PM »

Hi Rocky,
upgrading to java version 6 might solve you problems
Patrick


Hello Tobias,

I still don't manage to get the damned thing work, I guess I made some crucial mistake somewhere or did not explain something well: In my database the whole PDF is loaded as a blob, which makes that the content of this database field is identical to what you get if you do "more filename.pdf" instead of "acroread filename.pdf"

So compared to the outcome of reading the directory with the PDFs I still get not the same results. RM still treats the filecontent as one "variable which makes it look like the whole string tokenization and stemming did not take place....

Here's the code for the file-based approach:
<operator name="Root" class="Process" expanded="yes">
    <operator name="TextInput" class="TextInput" expanded="yes">
        <parameter key="create_text_visualizer"   value="true"/>
        <parameter key="default_content_encoding"   value="utf-8"/>
        <parameter key="default_content_language"   value="english"/>
        <parameter key="default_content_type"   value="pdf"/>
        <parameter key="id_attribute_type"   value="long"/>
        <parameter key="prune_below"   value="2"/>
        <list key="texts">
          <parameter key="SEM Papers"   value="/nfs_points/Simone/data1_simone/library/papers/sem"/>
        </list>
        <operator name="StringTokenizer" class="StringTokenizer">
        </operator>
        <operator name="EnglishStopwordFilter" class="EnglishStopwordFilter">
        </operator>
        <operator name="TokenLengthFilter" class="TokenLengthFilter">
            <parameter key="min_chars"   value="3"/>
        </operator>
        <operator name="PorterStemmer" class="PorterStemmer">
        </operator>
        <operator name="NGramTokenizer" class="NGramTokenizer">
        </operator>
    </operator>
</operator>


And here for the database-approach:
<operator name="Root" class="Process" expanded="yes">
    <operator name="DatabaseExampleSource" class="DatabaseExampleSource">
        <parameter key="database_url"   value="jdbc:mysql://localhost:3306/test_rapid"/>
        <parameter key="id_attribute"   value="ID"/>
        <parameter key="label_attribute"   value="DocName"/>
        <parameter key="password"   value="Vi6obvUPotQ="/>
        <parameter key="query"   value="SELECT ID, DocName, DocContent FROM `Documents`;"/>
        <parameter key="username"   value="root"/>
    </operator>
    <operator name="AttributeSubsetPreprocessing" class="AttributeSubsetPreprocessing" expanded="yes">
        <parameter key="attribute_name_regex"   value="DocContent"/>
        <parameter key="condition_class"   value="attribute_name_filter"/>
        <operator name="Nominal2String" class="Nominal2String">
        </operator>
    </operator>
    <operator name="StringTextInput" class="StringTextInput" expanded="yes">
        <parameter key="create_text_visualizer"   value="true"/>
        <parameter key="default_content_encoding"   value="utf-8"/>
        <parameter key="default_content_language"   value="english"/>
        <parameter key="default_content_type"   value="pdf"/>
        <parameter key="id_attribute_type"   value="long"/>
        <parameter key="prune_below"   value="2"/>
        <operator name="StringTokenizer" class="StringTokenizer">
        </operator>
        <operator name="EnglishStopwordFilter" class="EnglishStopwordFilter">
        </operator>
        <operator name="TokenLengthFilter" class="TokenLengthFilter">
            <parameter key="min_chars"   value="3"/>
        </operator>
        <operator name="PorterStemmer" class="PorterStemmer">
        </operator>
        <operator name="NGramTokenizer" class="NGramTokenizer">
        </operator>
    </operator>
</operator>


So what am I doing wrong ?
PS: DocContent is exactly what you see when you do a "cat filename.pdf"
« Last Edit: December 01, 2008, 05:50:17 PM by pdemaziere » Logged
rocky
Guest
« Reply #6 on: December 17, 2008, 02:05:48 PM »

Hi Patrick,

Thank you for the tip, it is working now indeed !!
Rocky.
Logged
Pages: [1]
  Print  
 
Jump to: