August 30, 2013 Leave a comment
My Interactive & Visual CV http://bit.ly/prdeepakbabu
Just another WordPress.com weblog
May 26, 2012 Leave a comment
Clustering is an unsupervised classification (learning) technique, where the objective is to maximize inter-cluster distance while minimizing the intra-cluster distance. By unsupervised, we mean clustering or segmenting or classifying data based on all the available attributes and specifically there is no availability of class information. A supervised classification on other hand uses class information.
As usual, before we jump into ‘how’ let’s answer the ‘why’. Clustering is applied to solve variety of problems ranging from biological systems to using it for exploratory analysis of data ( as a pre-processing technique). Many of the predictive analytics algorithms use clustering solutions as one of their components. It is used in all major brands for CRM, to understand their customer better. Another use of clustering is in outlier detection or fraud transaction identification. If you have heard about a site called www.similarsites.com, it extensively works on clustering algorithms where the sites are segmented/clustered based on website attributes like category of domain, number of users, traffic, content type, corporate or personal, blog, image blog, video blog,etc. For example, if you entered INMOBI, you would get a list of companies which are in this space mainly its competitors – mojiva, Millenialmedia, Admob, Quattro, Mobclix,etc. If you are looking for image hosting site and want to know alternatives/options, this will be helpful.
We talk about similarity in terms of distance measures like
(i) Euclidean Distance
(ii) Manhattan Distance
March 6, 2012 Leave a comment
Hadoop is an open source framework for writing and running distributed application that process huge amounts of data ( more famously called Big Data). The name Hadoop is not an acronym; it’s a made-up name. The project’s creator, Doug Cutting, explains how the name came about: “The name my kid gave a stuffed yellow elephant. Short, relatively easy to spell and pronounce, meaningless, and not used elsewhere: those are my naming criteria. Kids are good at generating such. Googol is a kid’s term”
It has two components
- Distributed Storage ( uses HDFS – Hadoop file system)
Ensures the data is distributed evenly across all the nodes of Hadoop cluster. There is option of replicate data across nodes (redundancy) to provide capabilities to recover from failures.
- Distributed Computing ( uses MR – Map Reduce Paradigm)
Once the data is available on Hadoop cluster. The MR codes ( typically return in Java,C++) is moved to each of the nodes for computation on the data. Map Reduce has two phases mapper and Reducer.
One of the early examples of a distributed computing include SETI@home project, where a group of people volunteered to offer CPU time of their personal computer for research on radio telescope data to find intelligent life outside earth. However this differs from Hadoop MR is in the fact that, data is moved to place where computing takes place in case SETI, while code is moved to the place of data in latter case. Other projects include finding the largest prime numbers, sorting Pet bytess of data in shortest time,etc.
Applications of Hadoop MR – Big data
Hadoop MR Wrapper applications include
October 13, 2011 Leave a comment
This blog post is about the analysis of implementation of helmet rule in various Indian states and the effect it had on bringing down accidental deaths due to 2 wheelers. Here we are specifically focussing on one particular state, Karnataka.
National Mandatory helmet legislation is included in the Indian Motor Vehicles Act, 1988. However, implementing this law has been left to the individual states. Karnataka gov. enforced mandatory helmet rule for all 2 wheeler riders in the year 2007. Traffic cops started imposing fines to violators of the rule and within no time, good compliance to the rule was observed. Accidental Deaths & Suicides in India publicizes the data about accidental deaths broken down by type of vehicle and by state (from 1967 to 2009). Considering the state of Karnataka, looking at the accidental deaths due to 2 wheelers and plotting the trend we see a clear decline in accidental deaths after 2007, specifically 8% decline in accidental deaths as of 2009. But can we just attribute this drop in accidental deaths to the helmet rule? Let’s explore.
Read more of this post
July 3, 2011 Leave a comment
Turning raw data into insights often involves integrating data from multiple disparate sources (not just limited structured one), analyzing the data, visualizing it and socializing the results/insights to a broader audience to whom the results are of interest. In this cycle of turning data into insights, Visualization plays a vital role and hence would be the topic of my discussion in this blog post . Visualization could aid in analyzing huge data by identifying patterns which are easily interpretable visually as compared to tabular layout of numbers.Second, Visualization could help represent the numbers using visuals which are easy for everyone to read and understand. One could easily convey the insights of the analysis by visuals, grasped in a minute or two, which might have possibly took 3-4 mins using textual aid/table of numbers.This is a important factor to consider especially when are you delivering the findings to the CEO/CFO/CXO/CIO of a company, as often they have limited time.
Going back to history of visualization. The most famous, early example mapping epidemiological data was Dr. John Snow’s map of deaths from a cholera outbreak in London, 1854, in relation to the locations of public water pumps. The original (high-res PDF copies from UCLA), spawned many imitators including this simplified version by Gilbert in 1958. Tufte (1983, p. 24) says,”Snow observed that cholera occurred almost entirely among those who lived near (and drank from) the Broad Street water pump. He had the handle of the contaminated pump removed, ending the neighborhood epidemic which had taken more than 500 lives.” Read more of this post
November 13, 2010 9 Comments
In my previous post, i had discussed about Association rule mining in some detail. Here i have shown the implementation of the concept using open source tool R using the package arules. Market Basket Analysis is a specific application of Association rule mining, where retail transaction baskets are analysed to find the products which are likely to be purchased together. The analysis output forms the input for recomendation engines/marketing strategies. Read more of this post