Hi everyone,

I'm pretty new to data mining and RapidMiner so take it easy on me

.

I'm dealing with a binary classification problem where I'm trying to identify people at high risk for a certain condition. 1 = yes 2= no

I'm using various sizes of data (in terms of observations) averaging around 160,000 observation. the data set contains 22 attributes (nominal/polunominal/numerical) and the binominal class label as described above. I'm comparing different classification algorithms for this problem which are listed in the table below. All experiments used a 5-fold cross validation with a binominal classification performance operator to get the results.

THE PROBLEM

The J48 Decision tree from the WEKA extension provides promising results as seen in the provided results table below, however, the AUC does not seem correct (see table below). When looking at the plot of the ROC curve at the bottom left corner of the chart the true positive rate remains at 0 for a little as the false positive rate increases along the x-axis. at about .5 along the x-axis the true positive rate finally increases and eventually goes above the y=x line. This is clearly why the AUC suffers but I do not know why this is happening and this does not occur in any other algorithm. (all data has been prepossessed to remove missing values and under-sampling has been implemented with some additional steps as well.)

If anyone knows why this could be occurring your help would be greatly appreciated, thank you.

AUC Sensitivity Specificity F-Measure Accuracy

Logistic Regression (WEKA LR) 0.715 65.70% 65.28% 65.56% 65.49%

C4.5 Decision Tree (WEKA J48) 0.678 67.99% 63.58% 66.52% 65.78%

Random Forest (WEKA RF) 0.704 63.89% 65.17% 64.30% 64.53%

Support Vector Machine 0.710 70.49% 59.87% 66.94% 65.18%

Neural Network 0.713 72.25% 57.14% 66.81% 64.70%

Radial Basis Function Network 0.654 62.96% 59.07% 61.67% 61.01%

K-NN 0.500 52.27% 52.71% 52.38% 52.49%

Naïve Bayes 0.689 59.14% 68.41% 62.01% 63.77%