Pages: [1]
  Print  
Author Topic: RemoveDuplicates operator  (Read 1051 times)
Shubha
Full Member
***
Posts: 141


« on: March 20, 2009, 11:44:46 AM »

Hi,

I have a data with several variables. Among them 'cluster', 'cluster_from_ES2' are nominal variables and an id variable named 'Id' are also present. I want to delete some observations by using the 'RemoveDuplicates' operator, based on the above variables, 'cluster', 'cluster_from_ES2' and 'Id' variables. How do i specify this in the operator? I tried with regular expression, "Id|cluster.*". But its not working. How do i make this work?

Many thanks again,
Shubha
Logged
haddock
Hero Member
*****
Posts: 853



WWW
« Reply #1 on: March 20, 2009, 03:02:24 PM »

Normally the point of an ID tag is to provide identity by uniqueness, if you include it in your regex there can be no duplicate matches Undecided The rest of the regex will match any string which begins with 'cluster', and does not contain a line break, so both 'cluster' and 'cluster_from_ES2' would match. So the answer to the question is ....

"Id|cluster.*"

To see the point try running this, and then changing the regex...

Code:
<operator name="Root" class="Process" expanded="yes">
    <operator name="ExampleSetGenerator" class="ExampleSetGenerator">
        <parameter key="number_examples" value="200"/>
        <parameter key="target_function" value="random"/>
    </operator>
    <operator name="IdTagging" class="IdTagging">
    </operator>
    <operator name="BinDiscretization" class="BinDiscretization">
        <parameter key="number_of_bins" value="3"/>
        <parameter key="range_name_type" value="short"/>
    </operator>
    <operator name="ChangeAttributeName" class="ChangeAttributeName">
        <parameter key="new_name" value="cluster"/>
        <parameter key="old_name" value="att1"/>
    </operator>
    <operator name="ChangeAttributeName (2)" class="ChangeAttributeName">
        <parameter key="new_name" value="cluster_from_ES2"/>
        <parameter key="old_name" value="att2"/>
    </operator>
    <operator name="RemoveDuplicates" class="RemoveDuplicates">
        <parameter key="attributes" value="id|cl.*"/>
    </operator>
</operator>
Logged

Where is the wisdom we have lost in knowledge?
Where is the knowledge we have lost in information?

T.S.Eliot ~ Choruses from the Rock 1934
Shubha
Full Member
***
Posts: 141


« Reply #2 on: March 20, 2009, 04:26:23 PM »

That answers my question. Thank you very much for your time.

But, just want to say how i have an id variable with same value being repeated. I have joined the original data with its aggregate data (say the cluster centroids based on three clusters), by the ExampleSetCartesian operator. So, if my exampleset had 25 examples (25 id values) and the aggregate exampleset had 3 examples, then the resultant exampleset will have 25*3=75 examples. And hence ID's are being repeated.

Thank you very much,
Shubha
Logged
Pages: [1]
  Print  
 
Jump to: