Clustering is an unsupervised classification (learning) technique, where the objective is to maximize inter-cluster distance while minimizing the intra-cluster distance. By unsupervised, we mean clustering or segmenting or classifying data based on all the available attributes and specifically there is no availability of class information. A supervised classification on other hand uses class information.
As usual, before we jump into ‘how’ let’s answer the ‘why’. Clustering is applied to solve variety of problems ranging from biological systems to using it for exploratory analysis of data ( as a pre-processing technique). Many of the predictive analytics algorithms use clustering solutions as one of their components. It is used in all major brands for CRM, to understand their customer better. Another use of clustering is in outlier detection or fraud transaction identification. If you have heard about a site called www.similarsites.com, it extensively works on clustering algorithms where the sites are segmented/clustered based on website attributes like category of domain, number of users, traffic, content type, corporate or personal, blog, image blog, video blog,etc. For example, if you entered INMOBI, you would get a list of companies which are in this space mainly its competitors – mojiva, Millenialmedia, Admob, Quattro, Mobclix,etc. If you are looking for image hosting site and want to know alternatives/options, this will be helpful.
We talk about similarity in terms of distance measures like
(i) Euclidean Distance
(ii) Manhattan Distance