I have a data set from a access web log file which I'm interested in finding similar clusters. (I'm an absolute beginner of data mining). So far I have referred many research papers on the same problem domain.

An Efficient Approach for Clustering Web Access Patterns from Web Logs

http://www.sersc.org/journals/IJAST/vol5/1.pdfClassifying the user intent of web queries using k-means clustering

http://faculty.ist.psu.edu/jjansen/academic/jansen_user_intent_kmeans.pdfI want to use k-means clustering to cluster web pages. Although these papers discuss about the algorithm, they do not specify the way of providing input data set. k-means calculate similarity between data points using Euclidean distance. So how to normalize my dataset to be mined using k-means since urls can not directly used for k-means. Any help/good reference on this?

Example Dataset(p1..pn are different web pages)

p1,p2,p3,p4

p1,p2

p1,p5,p6,p7

p1,p2,p3,p5