domingo, 30 de diciembre de 2012

System for the analysis of web traces and clustering using the k-means algorithm


As an academic project I worked in the development of a system in the Java programming language with the Swing library for the user interface, which has as an entry a log document type “Combined log” where we take for each request the user id. We used an interval of 30 minutes to set a session. It means that several request with the same user in an interval between them lower than 30 minutes compose a session.

For the clustering of the different sessions I used the k-means algorithm with the numbers of clusters and the kind of distance as parameters. For the different kind of distances I coded the Euclidean distance, the cosine measure, and the Jaccard distance for the calculation at the moment to compare the sessions.

So that I clustered the sessions in different groups having common requests in such a way that we could to determinate statistics such as: the sites with the lowest and highest concurrence, predictions about links for the users, and relations between links.

Paris - France, December 2012