Pages: [1] 2
  Print  
Author Topic: StringTextInput discards original ID values, replaces with different values  (Read 8214 times)
B.
Jr. Member
**
Posts: 71


« on: June 11, 2008, 07:51:34 AM »

I use custom ID values for each record in my database.  StringTextInput discards these ID values and the ID attribute name and inserts its own attribute name and Id value.

Is there a way to keep my original ID values?  I notice the new ID values do not match the original records and I can not match my records to cluster results.  See step 3 in the screen capture to see how these change.

Code:
<operator name="Root" class="Process" expanded="yes">
    <description text="#ylt#h3#ygt#Specifying texts by an example

set#ylt#/h3#ygt##ylt#p#ygt#Using the parameter list or the wizard are simple methods for

setting up the directories from which the text documents are read. Sometimes, however, a

more flexible solution is needed. If, for instance, your text documents have different types

of encoding or are written in different languages, you might wish to provide this

information  for each input directory separately.#ylt#/p#ygt# #ylt#p#ygt#You can do this by

using an example set that contains one row for each input directory and corresponding

attributes for source, encoding, type and class. If such an example set is provided, the

texts in the parameter list are ignored.#ylt#/p#ygt#"/>
    <operator name="DatabaseExampleSource" class="DatabaseExampleSource">
        <parameter key="database_system" value="Microsoft SQL Server (JTDS)"/>
        <parameter key="database_url" value="jdbc:jtds:sqlserver://localhost:1433/xxx"/>
        <parameter key="id_attribute" value="RecID"/>
        <parameter key="password" value="qqqqq"/>
        <parameter key="query" value="SELECT * FROM [tblGolfTest]"/>
        <parameter key="username" value="sa"/>
    </operator>
    <operator name="ExampleVisualizer (Step1)" class="ExampleVisualizer"

breakpoints="before">
    </operator>
    <operator name="StringTextInput" class="StringTextInput" expanded="yes">
        <parameter key="filter_nominal_attributes" value="true"/>
        <list key="namespaces">
        </list>
        <operator name="StringTokenizer" class="StringTokenizer">
        </operator>
    </operator>
    <operator name="ExampleVisualizer (Step2)" class="ExampleVisualizer"

breakpoints="before">
    </operator>
Logged
Tobias Malbrecht
Global Moderator
Sr. Member
*****
Posts: 293



WWW
« Reply #1 on: June 13, 2008, 09:59:40 AM »

Hi,

I just checked and reproduced your problem. The StringTextInput-Operator seems to simply create a ID new attribute instead of keeping the old one. As I am not that familiar with the text plugin and its implementation I do not know whether this is intended or if this is an accidental implementation artefact. I assume the latter reason is the case. I will check this and post again when I know more about this issue.

Regards,
Tobias
Logged

Tobias Malbrecht
Director of Product Marketing
RapidMiner
B.
Jr. Member
**
Posts: 71


« Reply #2 on: June 16, 2008, 05:21:49 AM »

Tobias,

If the original ID is discarded there is no easy way to link results back to original data.  I think this is an error

B..
Logged
B.
Jr. Member
**
Posts: 71


« Reply #3 on: June 20, 2008, 03:11:57 PM »

Tobias

Have you determined what the problem is with ID not carrying through the process?  thanks
Logged
jdouet
Newbie
*
Posts: 21



WWW
« Reply #4 on: June 25, 2008, 11:33:35 PM »

Hi Tobias, Hi All,

I have a problem both with DatabaseExampleSource and with StringTextInput. I have PHP/MySQL blogs, and I caught the dump/backup files to import them on my local mySQL server.
Then I constructed my SQL query, but expecting that "post_content" and "post_title" should be of "string" type. On the original DB they were "text" but after importing they are "nominal"; what can I do ?
I have used "filter nominal attributes" but it is refused since there is no string attribute in my resulting Exampleset.

Cheers,
   Jean-Charles.
Logged
B.
Jr. Member
**
Posts: 71


« Reply #5 on: June 26, 2008, 04:07:49 AM »

Jean-Charles

For StringTextInput, set filter_nominal_attributes to ON/TRUE.  I am able to get text to into the STI operator from my SQL database.  However, the record ID is discarded in STI, and you will not be able to match results back to the original data in your SQL database.


HTH
B.



    <operator name="DatabaseExampleSource" class="DatabaseExampleSource">
        <parameter key="database_system"   value="Microsoft SQL Server (JTDS)"/>
        <parameter key="database_url"   value="jdbc:jtds:sqlserver://localhost:1433/database"/>
        <parameter key="id_attribute"   value="RecID"/>
        <parameter key="password"   value="zzz"/>
        <parameter key="query"   value="SELECT "/>
        <parameter key="username"   value="sa"/>
    </operator>
    <operator name="StringTextInput" class="StringTextInput" expanded="yes">


        <parameter key="filter_nominal_attributes"   value="true"/>


        <list key="namespaces">
        </list>
        <operator name="StringTokenizer" class="StringTokenizer">
        </operator>
    </operator>
Logged
B.
Jr. Member
**
Posts: 71


« Reply #6 on: June 26, 2008, 04:20:56 AM »

Jean-Charles

I forgot to mention your SQL query in DBExampleSource will be pulling the text fields from your database.

<parameter key="query"   value="SELECT TextField1, TextField2, .......  From Table"/>

I haven't mixed text and other data types such as numeric or dates so I can't tell you what will happen.

B.
Logged
jdouet
Newbie
*
Posts: 21



WWW
« Reply #7 on: June 26, 2008, 07:06:52 AM »

Hi B.,

I have tried with "filter nominal attributes" : nothing...
Here is my experiment :

<operator name="travail_sur_dump_evoblogs" class="Process" expanded="yes">
    <operator name="DatabaseExampleSource" class="DatabaseExampleSource" breakpoints="after">
        <description text="voir problème des types text/nominal :côté SQL avec charset, cast et convertcôté RM en bricolant l'AML, et en demandant à Ingo"/>
        <parameter key="database_url"   value="jdbc:mysql://localhost:3306/installer0018843"/>
        <parameter key="id_attribute"   value="ID"/>
        <parameter key="label_attribute"   value="post_category"/>
        <parameter key="password"   value="dummy"/>
        <parameter key="query"   value="select ID, post_issue_date, post_content, post_title, post_category from evo_posts;"/>
        <parameter key="table_name"   value="evo_posts"/>
        <parameter key="username"   value="root"/>
    </operator>
    <operator name="ExampleSetWriter" class="ExampleSetWriter" activated="no">
        <parameter key="attribute_description_file"   value="C:\Documents and Settings\JCD\Bureau\outils de recherche\analyses statistiques\analyse_site\table_posts_blogs.aml"/>
        <parameter key="example_set_file"   value="C:\Documents and Settings\JCD\Bureau\outils de recherche\analyses statistiques\analyse_site\data\table_posts_blogs.dat"/>
    </operator>
    <operator name="texte" class="OperatorChain" expanded="yes">
        <operator name="StringTextInput" class="StringTextInput" expanded="yes">
            <parameter key="create_text_visualizer"   value="true"/>
            <parameter key="default_content_encoding"   value="windows-1252"/>
            <parameter key="default_content_language"   value="french"/>
            <parameter key="default_content_type"   value="html"/>
            <parameter key="filter_nominal_attributes"   value="true"/>
            <list key="namespaces">
            </list>
            <parameter key="vector_creation"   value="TermOccurrences"/>
            <operator name="StringTokenizer" class="StringTokenizer">
            </operator>
        </operator>
    </operator>
</operator>

"post_title" and "post_content" are "text" types, while there are "enum(published, private)" types that had rather be of nominal type. But it does not seem to work this way...

Jean-Charles.

PS : I have used "GuessValueType" and nothing...It would be interesting that nominal attributes for which each value contains blank spaces should be recognized as "string", shouldn't it ?
« Last Edit: June 26, 2008, 03:42:15 PM by jdouet » Logged
B.
Jr. Member
**
Posts: 71


« Reply #8 on: June 27, 2008, 05:52:20 AM »

Jean-Charles

I notice your process structure is a little different from mine.

You read data from mySQL and save it with ExampleWriter then continue to an OperatorChain with StringTextInput.  You do not read your example back into the process.  You also have a date type (post_issue_date) that you select.

DBExampleSource
ExampleSetWriter
OperaterChain (Text)
  STI
    StringTokenizer



My structure is more direct and I don't have date or non-text fields.

DBExampleSource
STI
  StringTokenizer

These are the only differences I see between the two processes.  Can you rearrange your process to match mine (leave out ExampleWriter and don't put STI and Tokenizer in an OperatorChain) and use only text fields (no date or non-text) to see what results you obtain?


Also, I remember now there was an issue with the text operators reading from SQL databases:
http://rapid-i.com/rapidforum/index.php/topic,19.0.html

<We fixed the Text plugin and uploaded a new version at:>

Windows Installer: http://rapid-i.com/snapshot/rapidminer-text-4.1-installer.exe

Good luck.


Logged
jdouet
Newbie
*
Posts: 21



WWW
« Reply #9 on: June 27, 2008, 04:37:14 PM »

Hi B.,

About ExampleSetWriter, it was disabled, I use that kind of option instead of deleting it...

here is my new esxperiment :
<operator name="travail_sur_dump_evoblogs" class="Process" expanded="yes">
    <operator name="DatabaseExampleSource" class="DatabaseExampleSource" breakpoints="after">
        <description text="voir problème des types text/nominal :côté SQL avec charset, cast et convertcôté RM en bricolant l'AML, et en demandant à Ingo"/>
        <parameter key="database_url"   value="jdbc:mysql://localhost:3306/installer0018843"/>
        <parameter key="id_attribute"   value="ID"/>
        <parameter key="password"   value="..."/>
        <parameter key="query"   value="select ID, post_content, post_title from evo_posts;"/>
        <parameter key="username"   value="root"/>
    </operator>
    <operator name="StringTextInput" class="StringTextInput" expanded="yes">
        <parameter key="create_text_visualizer"   value="true"/>
        <parameter key="default_content_encoding"   value="windows-1252"/>
        <parameter key="default_content_language"   value="french"/>
        <parameter key="default_content_type"   value="html"/>
        <parameter key="filter_nominal_attributes"   value="true"/>
        <list key="namespaces">
        </list>
        <parameter key="vector_creation"   value="TermOccurrences"/>
        <operator name="StringTokenizer" class="StringTokenizer">
        </operator>
    </operator>
</operator>

I have launched the experiment, but I still have nominal attributes.
Now I am going to have a look at the bugfix, thank you Cheesy !

Cheers,
  Jean-Charles.
Logged
jdouet
Newbie
*
Posts: 21



WWW
« Reply #10 on: June 27, 2008, 09:21:00 PM »

Ok, Hello all again...

I have used the corrected plugin : behavior and results are different, but a problem still remains. Here is my experiment :
<operator name="travail_sur_dump_evoblogs" class="Process" expanded="yes">
    <operator name="DatabaseExampleSource" class="DatabaseExampleSource">
        <description text="blablabla"/>
        <parameter key="database_url"   value="jdbc:mysql://localhost:3306/installer0018843"/>
        <parameter key="id_attribute"   value="ID"/>
        <parameter key="label_attribute"   value="cat_name"/>
        <parameter key="password"   value="---"/>
        <parameter key="query"   value="select evo_posts.ID, evo_posts.post_issue_date, evo_posts.post_content, evo_posts.post_title, evo_categories.cat_name, evo_categories.cat_blog_ID from evo_posts, evo_categories where evo_posts.post_category=evo_categories.cat_ID and ID &lt; 20;"/>
        <parameter key="username"   value="root"/>
    </operator>
    <operator name="ChangeAttributeRole (2)" class="ChangeAttributeRole">
        <parameter key="name"   value="post_issue_date"/>
        <parameter key="target_role"   value="id"/>
    </operator>
    <operator name="ChangeAttributeRole" class="ChangeAttributeRole" breakpoints="after">
        <parameter key="name"   value="cat_blog_ID"/>
        <parameter key="target_role"   value="batch"/>
    </operator>
    <operator name="AttributeSubsetPreprocessing" class="AttributeSubsetPreprocessing" expanded="yes">
        <parameter key="attribute_name_regex"   value="post_content|post_title"/>
        <parameter key="deliver_inner_results"   value="true"/>
        <operator name="StringTextInput (2)" class="StringTextInput" expanded="yes">
            <parameter key="create_text_visualizer"   value="true"/>
            <parameter key="default_content_encoding"   value="windows-1252"/>
            <parameter key="default_content_language"   value="french"/>
            <parameter key="default_content_type"   value="html"/>
            <parameter key="filter_nominal_attributes"   value="true"/>
            <parameter key="id_attribute_type"   value="short"/>
            <list key="namespaces">
            </list>
            <parameter key="vector_creation"   value="TermFrequency"/>
            <operator name="StringTokenizer (2)" class="StringTokenizer">
            </operator>
            <operator name="TokenLengthFilter" class="TokenLengthFilter">
                <parameter key="min_chars"   value="3"/>
            </operator>
            <operator name="SnowballStemmer" class="SnowballStemmer">
            </operator>
        </operator>
    </operator>
</operator>

Now, before and after the breakpoint :
- the "batch" attribute disappears ...? If I create a "label_2" attribute type it disappears too !
- If I activate "extend exampleset", there is a strange behaviour, where all vectors are NULL, and old attributes from before vectorization remain.

Is that normal, doctor ?

Cheers,
   Jean-Charles.
Logged
Ingo Mierswa
Administrator
Hero Member
*****
Posts: 1226



WWW
« Reply #11 on: June 27, 2008, 10:16:18 PM »

Hi,

about the special attributes which got lost: I think there is an option like "append_to_example_set" or "extend_example_set" or something similar. I think this parameter was added in order to keep the former attributes (at least the id attribute but probably also the others like batch etc.).

Cheers,
Ingo
Logged

Did you try our new Marketplace? Upload or download new Extensions, add comments, and organize your operators. Have a look at  http://marketplace.rapid-i.com
jdouet
Newbie
*
Posts: 21



WWW
« Reply #12 on: June 27, 2008, 11:00:30 PM »

Tobias

Have you determined what the problem is with ID not carrying through the process?  thanks

@B. I realized that I have the same problem than yours...
@Ingo : I deactivated "extend exampleset", I have understood why all my vectors are flat (!!)

I sum up : I have lost "batch" or equivalents, and my IDs have been modified...

Cheers,
   Jean-Charles.
Logged
B.
Jr. Member
**
Posts: 71


« Reply #13 on: June 30, 2008, 05:37:36 AM »

Jean-Charles, Ingo

The problem is probably in the STI operator and how it handles ID attributes.

I set ID_attribute_type to short and long, and the text fields from my SQL records were merged into one field and used as the ID in place of a number generated by STI.

When I select one text field from the database, only that field is used as the ID.  So if I have several words or a sentence those words or sentence become the ID values.

I suggest expanding the functionality of STI to include a fourth type of ID, pass-through or external ID that is passed into STI and not altered.   Then we can match RM output back to original source data.

B.

Logged
jdouet
Newbie
*
Posts: 21



WWW
« Reply #14 on: June 30, 2008, 11:11:13 AM »

Hi B.,

I have the same behaviour : the "post_content" field becomes the ID field !! To overcome it, I have to reload the original table, and "ExampleJoin" it with the vector table...

Cheers,
  Jean-Charles.
Logged
Pages: [1] 2
  Print  
 
Jump to: