Pages: [1]
  Print  
Author Topic: [SOLVED] Remove duplicates selecting which examples must remain  (Read 886 times)
arthurgouveia
Newbie
*
Posts: 5


« on: January 16, 2014, 07:40:18 PM »

Hello.

I have a large customer dataset with some values "duplicate". Let me try to make myself clear: I have a dataset with over 50 attributes  of over 200k contracts. One of these attributes is contract_status. Some of these statuses are valid and some are invalid. I've created a boolean attribute named is_valid_status.

I'd like to remove duplicates based on a subset of attributes and keep only the examples where is_valid_status is true.

How can I do it?

Thanks
« Last Edit: January 17, 2014, 02:05:35 PM by arthurgouveia » Logged
Ralf Klinkenberg
Administrator
Jr. Member
*****
Posts: 71



WWW
« Reply #1 on: January 16, 2014, 08:06:49 PM »

Hello Arthur Gouveia,

you can use the RapidMiner operator called Filter Examples.

Cheers,
Ralf
Logged

Marius
Administrator
Hero Member
*****
Posts: 1794



WWW
« Reply #2 on: January 17, 2014, 10:17:19 AM »

Hi Arthur,

in addition to Filter Examples for filtering on is_valid_status you can use the Remove Duplicates operator. It allows to select which attributes are considered for finding duplicates. You should try the operators first on a smaller subset to get a feeling for their settings.

Best regards,
Marius
Logged

Please add [SOLVED] to the topic title when your problem has been solved! (do so by editing the first post in the thread and modifying the title)
Please click here before posting.
arthurgouveia
Newbie
*
Posts: 5


« Reply #3 on: January 17, 2014, 12:24:20 PM »

Thank you! It worked almost perfectly. I can't believe I didn't think about the solution you guys gave me....  Roll Eyes

But now I have another problem. I found several contracts with valid statuses and I'd like to remove duplicates but keep only the most recent status. I have an attribute named date_processing that I can use to achieve that but I can't figure how.

Is there any way to remove duplicates keeping only the most recent data?
Logged
Ralf Klinkenberg
Administrator
Jr. Member
*****
Posts: 71



WWW
« Reply #4 on: January 17, 2014, 12:37:12 PM »

Hello Arthur Gouveia,

this requires several steps:
  • You can use the RapidMiner operator Aggregate to determine the maximum of date_processing for each client (group by client ID).
  • The operator Rename can be used to rename the new attribute max(date_processing) to max_date.
  • With Join you can add the max_date column to the original data table (select the client ID as ID for both tables).
  • With Filter Examples you keep only the data lines where date_processing >= max_date and you are done.

Cheers,
Ralf
Logged

arthurgouveia
Newbie
*
Posts: 5


« Reply #5 on: January 17, 2014, 01:39:44 PM »

It didn't work. Almost everything went ok but I couldn't filter just the max_date. It cannot parse value 'max_date' with date pattern yyyy-MM-dd HH:mm:ss Z



What is amazing is that when I look to the meta data view the max_date attribute is type date_time. I've tried to change the type using Date to Nominal, Date to Numerical, Numerical to Date, Nominal to Date, Guess Types but all of them either don't list max_date at the attribute list or don't make Filter Example work.
Logged
arthurgouveia
Newbie
*
Posts: 5


« Reply #6 on: January 17, 2014, 02:05:13 PM »

I found a solution! I used the Generate Attributes to create a date_dif attribute using the function date_diff(max_date,date_processing). Then I just had to filter the examples where date_dif=0.

It's working! Thanks!
Logged
Pages: [1]
  Print  
 
Jump to: