CTR data is "unbalanced" - ie ~1% chance of clicking. So subsampling is good - but I have to do it only on the "non-click class" and then reweight the class in the training algorithm [ eg data contains 100 clicks, 100000 non-clicks - I am happy to subsample non-clicks]
feature data is JUST IDs: WebsiteID, AdID etc [ eg google.com=1, yahoo.com=2, cnbc.com=3,....], so no description of website.
So yes I want to to NominaltoBinominal, but then/at same time/before I want to FILTER out those Binominals eg certain websites for which there is little training data]
( see eg http://www.kaggle.com/about/papers
... click though rate)