I have spent most of my life into data and its applications to problems. Now, when i look at some patterns in algorithms we use in analyzing data, one thing that emerges is increased use of meta-algorithms. Boosting techniques(like AdaBoost) is one such meta-algorithm which uses multiple weak learners(classifiers) to improve prediction accuracy. Random forests,a very prominent Continue reading
We are in the era of big data, with newer sources of data emerging at an exponential rate involving sensor data, EHR, social network/media data & machine generated data. In this blog post, I will be discussing specifically about social network data, its applications in data science problems, solutions & visualizations. In simple terms, a network is a group of nodes interconnected by links (also called edges). In a social network, users are the nodes and connections are the links/edges. Consider a Facebook user’s network, by adding friends, we are creating the links. Before getting into a little more of technical details of a network, let’s spend some time on more interesting area – its applications to data science problems.
Linkedin, Facebook & other social network uses the network information, to predict “People you may know” & offer people recommendations. Product companies like Microsoft, Oracle uses network analytics to identify key influencers in leading tech forum/online community networks to help market their products by utilizing the greater reach of the identified influencers. WWW is another example of networks. The pages are interconnected in the form of network & its analysis helps understand information flow across the WWW. “People you may know” feature generally works using triangulation. i.e If B and C are connected. If A knows B, then it is likely that A knows C. Most of the people recommendation work based on this principle.
Clustering is an unsupervised classification (learning) technique, where the objective is to maximize inter-cluster distance while minimizing the intra-cluster distance. By unsupervised, we mean clustering or segmenting or classifying data based on all the available attributes and specifically there is no availability of class information. A supervised classification on other hand uses class information.
As usual, before we jump into ‘how’ let’s answer the ‘why’. Clustering is applied to solve variety of problems ranging from biological systems to using it for exploratory analysis of data ( as a pre-processing technique). Many of the predictive analytics algorithms use clustering solutions as one of their components. It is used in all major brands for CRM, to understand their customer better. Another use of clustering is in outlier detection or fraud transaction identification. If you have heard about a site called www.similarsites.com, it extensively works on clustering algorithms where the sites are segmented/clustered based on website attributes like category of domain, number of users, traffic, content type, corporate or personal, blog, image blog, video blog,etc. For example, if you entered INMOBI, you would get a list of companies which are in this space mainly its competitors – mojiva, Millenialmedia, Admob, Quattro, Mobclix,etc. If you are looking for image hosting site and want to know alternatives/options, this will be helpful.
We talk about similarity in terms of distance measures like
(i) Euclidean Distance
(ii) Manhattan Distance
Hadoop is an open source framework for writing and running distributed application that process huge amounts of data ( more famously called Big Data). The name Hadoop is not an acronym; it’s a made-up name. The project’s creator, Doug Cutting, explains how the name came about: “The name my kid gave a stuffed yellow elephant. Short, relatively easy to spell and pronounce, meaningless, and not used elsewhere: those are my naming criteria. Kids are good at generating such. Googol is a kid’s term”
It has two components
– Distributed Storage ( uses HDFS – Hadoop file system)
Ensures the data is distributed evenly across all the nodes of Hadoop cluster. There is option of replicate data across nodes (redundancy) to provide capabilities to recover from failures.
– Distributed Computing ( uses MR – Map Reduce Paradigm)
Once the data is available on Hadoop cluster. The MR codes ( typically return in Java,C++) is moved to each of the nodes for computation on the data. Map Reduce has two phases mapper and Reducer.
One of the early examples of a distributed computing include SETI@home project, where a group of people volunteered to offer CPU time of their personal computer for research on radio telescope data to find intelligent life outside earth. However this differs from Hadoop MR is in the fact that, data is moved to place where computing takes place in case SETI, while code is moved to the place of data in latter case. Other projects include finding the largest prime numbers, sorting Pet bytess of data in shortest time,etc.
Applications of Hadoop MR – Big data
- Weblog analysis
- Fraud detection
- Text Mining
- Search Engine Indexing
- LinkedIn uses for “Who viewed your profile” and “People you may know – recommendations”
- Amazon.com uses for book recommendation
Hadoop MR Wrapper applications include
- Pig : A data flow language and execution environment for exploring very large datasets. Pig runs on HDFS and MapReduce clusters.
- Hive : A distributed data warehouse. Hive manages data stored in HDFS and provides a query language based on SQL (and which is translated by the
runtime engine to MapReduce jobs) for querying the data.
- Mahout : Machine Learning implementation in Map Reduce.
In the recent past, a variety of new social media sites have emerged – location Based Services( like foursquare, gowala), Group Deals( like groupon), microblogging( like twitter, fb). These social media sites have provided integration features with other SM sites, for instance A foursquare checkin can be configured to automatically publish a tweet with the URL of location checked-in. All these data being primarily open source, we have various business opportunities to leverage the data by integrating this with the internal customer data.
A retailer’s major concern is the need to understand their customers better, to gain the 360 degree view of the customer. Most of the companies, have strategy to integrate the internal customer behavior data acorss POS, Ecomm, Mail order,etc. By leveraging the social media data and integrating with location based services(like foursquare) and microblogging services(twitter), the retailers now have the ability to track customers.
This blog post is about the analysis of implementation of helmet rule in various Indian states and the effect it had on bringing down accidental deaths due to 2 wheelers. Here we are specifically focussing on one particular state, Karnataka.
National Mandatory helmet legislation is included in the Indian Motor Vehicles Act, 1988. However, implementing this law has been left to the individual states. Karnataka gov. enforced mandatory helmet rule for all 2 wheeler riders in the year 2007. Traffic cops started imposing fines to violators of the rule and within no time, good compliance to the rule was observed. Accidental Deaths & Suicides in India publicizes the data about accidental deaths broken down by type of vehicle and by state (from 1967 to 2009). Considering the state of Karnataka, looking at the accidental deaths due to 2 wheelers and plotting the trend we see a clear decline in accidental deaths after 2007, specifically 8% decline in accidental deaths as of 2009. But can we just attribute this drop in accidental deaths to the helmet rule? Let’s explore.
This blog post is about comparison of amazon.com and linkedin.com in terms of similarities across dimensions of analytic maturity & use of data shared by their customers. As Thomas Davenport mentions in his book “Competing with analytics”, amazon.com is one of the few companies which was built on the foundation of data, the so called “Analytically mature” company. LinkedIn has joined the list, with lot of new features available to their users.
As customers interact with the site, they generate data about their liking towards certain products or feature. Companies like amazon.com and LinkedIn clearly understand how to leverage this information to make the interaction between the customer and the site even more valuable & relevant. Users who are ready to share more data with site about their likes/dislikes, the better would be the site’s recommendation for the user. The companies need to instil this confidence in the customers mind, and hence have the users share data by will.
Turning raw data into insights often involves integrating data from multiple disparate sources (not just limited structured one), analyzing the data, visualizing it and socializing the results/insights to a broader audience to whom the results are of interest. In this cycle of turning data into insights, Visualization plays a vital role and hence would be the topic of my discussion in this blog post . Visualization could aid in analyzing huge data by identifying patterns which are easily interpretable visually as compared to tabular layout of numbers.Second, Visualization could help represent the numbers using visuals which are easy for everyone to read and understand. One could easily convey the insights of the analysis by visuals, grasped in a minute or two, which might have possibly took 3-4 mins using textual aid/table of numbers.This is a important factor to consider especially when are you delivering the findings to the CEO/CFO/CXO/CIO of a company, as often they have limited time.
Going back to history of visualization. The most famous, early example mapping epidemiological data was Dr. John Snow’s map of deaths from a cholera outbreak in London, 1854, in relation to the locations of public water pumps. The original (high-res PDF copies from UCLA), spawned many imitators including this simplified version by Gilbert in 1958. Tufte (1983, p. 24) says,”Snow observed that cholera occurred almost entirely among those who lived near (and drank from) the Broad Street water pump. He had the handle of the contaminated pump removed, ending the neighborhood epidemic which had taken more than 500 lives.” Continue reading
In this blog post, i talk about 3 scenarios where there had been highly valuable insights derived, yet remaining simple.
1. Customers shopped online returned via stores Randy Lea, VP product & service marketing Teradata talks about one of their clients, who had tagged their e-com customers as best customers based on web sales they were generating and reaching out to them with various promotions. However, on integrating their web data with Enterprise data( store data) they found most of the customers were buying things online in multiple units and returning them through stores.
For example, some customers brought 4-5 shirts of different colors, however they reatined one of them they liked the most and returned the rest of them visiting the stores. Effectively customers were buying through one channel(web) and returning them through another channel(store).Hence the web customers, whom they believed best not actually best rather average shoppers and shouldnt have been sent offers.
Source: Teradata ( Video)
2. In the United States, if you live more than two miles from a pharmacy store, you probably don’t shop there!In the book data-drien marketing , Mark Jeffery talks about the case of how walgreens optimized their marketing spend using simple geo-spatial visualization. The pic on the right, is a picture of three stores of the Walgreens pharmacy chain on a map.Walgreens is a $59 billion annual revenue pharmacy company with 6,850 stores throughout the United States.
This geospatial picture shows dots that are the customers and where they live and are coded by shape depending on which of the threeWalgreens stores they shop. The ‘‘diamond’’ customers shop at Store 1; the ‘‘square’’ customers, at Store 2; and the ‘‘star’’ customers, at Store 3. This pharmacy retail chain predominantly markets using ﬂyers in newspapers. The way they pay for the marketing is by zip code, denoted by the dashed line, for example, in the picture. Mike Feldner, the marketing manager who ﬁrst created these pictures, noticed something interesting: the circle on the picture is two miles in radius, and after looking at many pictures throughout the United States, he noticed that there are no dots (customers) for a store more than two miles from the store. He concluded that in the United States, if you live more than two miles from a pharmacy store, you probably don’t shop there. At that time,Walgreens treated each U.S. locale equally; allocating equal dollar amounts for newspaper advertising in each zip code across the United States. But the data show that if there is no store within two miles of the zip code, customers do not shop at the store. Based on these data, Walgreens ultimately stopped spending advertising dollars in all zip codes without a store within two miles of the zip code. As you might guess, the impact to sales revenues was exactly zero. The impact to marketing, however, was a cost saving of more than $5 million, for a total cost of collecting the data and creating the plots of approximately $200,000. This multimillion-dollar saving in marketing did not require a lot of money, and the analysis was done on a personal computer (PC). This is yet another example of being simple in approach, yet making the impact.
3. We won because we understood the science of incentivizing people to cooperateLate last year the Pentagon’s mad-scientist research wing, Darpa, announced the Network Challenge, a $40,000 prize for the first group to find and report the locations of ten red weather balloons that the agency would set aloft one day in secret locations around the country. Most of the thousands of groups that signed up quickly realized that crowdsourcing was the way to find the 8-foot spheres. So, naturally, they offered bounties to balloon hunters. But Pentland’s crew at MIT’s Human Dynamics Lab–part of the MIT Media Lab–took their crowd control a step further. “It was trivial for us to slap together the balloon thing,” says the 58-year-old Pentland. That’s because other groups’ tactics were based on guesswork, he argues. His were based on lessons learned through data-mining research. “We won because we understood the science of incentivizing people to cooperate.”
Read the entire article here: Mining Human Behavior at MIT