Pages: [1]
  Print  
Author Topic: KernelKMeans now produces error when classify text  (Read 2541 times)
B.
Jr. Member
**
Posts: 71


« on: July 20, 2008, 08:30:25 PM »

RM team

I have switched to RM 4.2.  I began testing by using an existing project that classifies text by KernelKMeans.  Text is read from a database and passed through StringtextInput and StringTokenizer.  This operator chain worked before.  Now I receive an error message

Error 104 - non-numeric
Error in: KernelKMeans (KernelKMeans) The example set contains non-numerical attribute #0: StockItemDesc (nominal/single_value)/values=

Using KMediods to classify text works.  Looking at the metadata with examplevisualizer there are string vectors and weights.

Here is the project.

<operator name="Root" class="Process" expanded="yes">
    <description text="#ylt#h3#ygt#Specifying texts by an example set#ylt#/h3#ygt##ylt#p#ygt#Using the parameter list or the wizard are simple methods for setting up the directories from which the text documents are read. Sometimes, however, a more flexible solution is needed. If, for instance, your text documents have different types of encoding or are written in different languages, you might wish to provide this information  for each input directory separately.#ylt#/p#ygt# #ylt#p#ygt#You can do this by using an example set that contains one row for each input directory and corresponding attributes for source, encoding, type and class. If such an example set is provided, the texts in the parameter list are ignored.#ylt#/p#ygt#"/>
    <operator name="DatabaseExampleSource" class="DatabaseExampleSource">
        <parameter key="database_system"   value="Microsoft SQL Server (JTDS)"/>
        <parameter key="database_url"   value="jdbc:jtds:sqlserver://localhost:1433/XXX"/>
        <parameter key="id_attribute"   value="IDNbr"/>
        <parameter key="password"   value="y6sa3JX9Wrc="/>
        <parameter key="query"   value="SELECT [Text], [IDNbr] FROM [Classify]"/>
        <parameter key="username"   value="sa"/>
    </operator>
    <operator name="StringTextInput" class="StringTextInput" expanded="yes">
        <parameter key="filter_nominal_attributes"   value="true"/>
        <list key="namespaces">
        </list>
        <operator name="StringTokenizer" class="StringTokenizer">
        </operator>
    </operator>
    <operator name="ExampleVisualizer" class="ExampleVisualizer" breakpoints="before">
    </operator>
    <operator name="KernelKMeans" class="KernelKMeans" breakpoints="after">
        <parameter key="k"   value="500"/>
        <parameter key="kernel_type"   value="KernelDot"/>
    </operator>
    <operator name="ClusterModel2ExampleSet" class="ClusterModel2ExampleSet">
        <parameter key="keep_cluster_model"   value="false"/>
    </operator>
    <operator name="ExampleSetWriter" class="ExampleSetWriter">
        <parameter key="example_set_file"   value="Example.dat"/>
        <parameter key="special_format"   value="$i $v[cluster]"/>
    </operator>
</operator>

Thanks for your help.

B
Logged
B.
Jr. Member
**
Posts: 71


« Reply #1 on: July 21, 2008, 03:56:19 AM »

After an hour or two of work, KMedoids fails with an Index  Out of Bounds error message. 

It does not fail immediately on starting like KernelKmeans does now.
Logged
B.
Jr. Member
**
Posts: 71


« Reply #2 on: July 21, 2008, 05:11:38 AM »

KMeans also stops with an error "example set contains non numerical attributes #0"
Logged
Ingo Mierswa
Administrator
Hero Member
*****
Posts: 1226



WWW
« Reply #3 on: July 22, 2008, 03:56:03 PM »

Hello,

are you sure this process did work with RM 4.1 and before? I am asking because as far as I can see the "usual" kernel functions of RapidMiner are used and those never supported nominal values...

Hoever, you could of course use the operator Nominal2Numeric before the clustering, it might even be more appropriate to apply a Nominal2Binominal first.

Cheers,
Ingo
Logged

Did you try our new Marketplace? Upload or download new Extensions, add comments, and organize your operators. Have a look at  http://marketplace.rapid-i.com
B.
Jr. Member
**
Posts: 71


« Reply #4 on: July 29, 2008, 06:26:04 AM »

Ingo

I  reinstalled RM 4.1 alongside RM 4.2.  I tested this project.  It runs under 4.1 and fails under 4.2.

Same SQL query to pull records and same text in the records.
+++++++++++++
<operator name="Root" class="Process" expanded="yes">
    <description text="#ylt#h3#ygt#Specifying texts by an example set#ylt#/h3#ygt##ylt#p#ygt#Using the parameter list or the wizard are simple methods for setting up the directories from which the text documents are read. Sometimes, however, a more flexible solution is needed. If, for instance, your text documents have different types of encoding or are written in different languages, you might wish to provide this information  for each input directory separately.#ylt#/p#ygt# #ylt#p#ygt#You can do this by using an example set that contains one row for each input directory and corresponding attributes for source, encoding, type and class. If such an example set is provided, the texts in the parameter list are ignored.#ylt#/p#ygt#"/>
    <operator name="DatabaseExampleSource" class="DatabaseExampleSource">
        <parameter key="database_system"   value="Microsoft SQL Server (JTDS)"/>
        <parameter key="database_url"   value="jdbc:jtds:sqlserver://localhost:1433/SqlServer"/>
        <parameter key="id_attribute"   value="RecID"/>
        <parameter key="password"   value="y6sa3JX9Wrc="/>
        <parameter key="query"   value="SELECT [Text1], [Text2], [RecID] FROM
"/>
        <parameter key="username"   value="sa"/>
    </operator>
    <operator name="StringTextInput" class="StringTextInput" expanded="yes">
        <parameter key="filter_nominal_attributes"   value="true"/>
        <list key="namespaces">
        </list>
        <operator name="StringTokenizer" class="StringTokenizer">
        </operator>
    </operator>
    <operator name="ExampleVisualizer" class="ExampleVisualizer">
    </operator>
    <operator name="KernelKMeans" class="KernelKMeans">
        <parameter key="k"   value="500"/>
        <parameter key="kernel_type"   value="KernelDot"/>
    </operator>
    <operator name="ClusterModel2ExampleSet" class="ClusterModel2ExampleSet">
        <parameter key="keep_cluster_model"   value="false"/>
    </operator>
    <operator name="ExampleSetWriter" class="ExampleSetWriter">
        <parameter key="example_set_file"   value="C:\TestDataOutput.dat"/>
        <parameter key="special_format"   value="$i $v[cluster]"/>
    </operator>
</operator>

+++
4.2 error message

Error in: KernelKMeans (KernelKMeans) The example set contains non-numerical attribute #0: StockItemDesc
++++++++++++++++++
<as far as I can see the "usual" kernel functions of RapidMiner are used and those never supported nominal values>

Doesn't the FilterNominalAttributes convert the attributes to a usable format for further processing?

Thanks for your help.

B.


Logged
Ingo Mierswa
Administrator
Hero Member
*****
Posts: 1226



WWW
« Reply #5 on: July 29, 2008, 05:10:50 PM »

Hi,

Quote
I  reinstalled RM 4.1 alongside RM 4.2.  I tested this project.  It runs under 4.1 and fails under 4.2.

thanks for this info. I now found the reason for this behaviour. It has actually nothing to do with the clustering operator but with the StringTextInput. There is a new parameter "remove_original_attributes" which unfortunately has not the default setting "true" (in order to keep backwards compatibility) but "false" so the original nominal (or string) attributes were not removed. This have caused the error for the clustering since the kernel cannot handle nominal values which are still present in the data set if the parameter "remove_original_attributes" was not set to "true". So the solution is quite simple: just set this parameter to "true" and everything should work as usual. You could add a breakpoint after the StringTextInput operator to see the difference with and without this setting.


Quote
Doesn't the FilterNominalAttributes convert the attributes to a usable format for further processing?

Yes, but with the new parameter they are also still kept as part of the example set as long as "remove_original_attributes" is set to "false". Instead of removing the directly here (with the parameter setting mentioned above) you could of course also use the operator "AttributeFilter" after the text processing to filter out all nominal attributes and only keep the numerical ones.

Cheers,
Ingo
Logged

Did you try our new Marketplace? Upload or download new Extensions, add comments, and organize your operators. Have a look at  http://marketplace.rapid-i.com
B.
Jr. Member
**
Posts: 71


« Reply #6 on: July 30, 2008, 04:52:50 AM »

Ingo

This runs successfully now.  Thanks for the help.

B.
Logged
Pages: [1]
  Print  
 
Jump to: