As an
academic project I worked in the development of a system in the Java programming
language with the Swing library for the user interface, which has as an entry a
log document type “Combined log” where we take for each request the user id. We
used an interval of 30 minutes to set a session. It means that several request
with the same user in an interval between them lower than 30 minutes compose a
session.
For the
clustering of the different sessions I used the k-means algorithm with the
numbers of clusters and the kind of distance as parameters. For the different
kind of distances I coded the Euclidean distance, the cosine measure, and the
Jaccard distance for the calculation at the moment to compare the sessions.
So that I
clustered the sessions in different groups having common requests in such a way
that we could to determinate statistics such as: the sites with the lowest and highest concurrence,
predictions about links for the users, and relations between links.