Big Data: Hadoop Map Reduce

Hadoop is an open source framework for writing and running distributed application that process huge amounts of data ( more famously called Big Data). The name Hadoop is not an acronym; it’s a made-up name. The project’s creator, Doug Cutting, explains how the name came about: “The name my kid gave a stuffed yellow elephant. Short, relatively easy to spell and pronounce, meaningless, and not used elsewhere: those are my naming criteria. Kids are good at generating such. Googol is a kid’s term”

It has two components

-          Distributed Storage ( uses HDFS – Hadoop file system)
Ensures the data is distributed evenly across all the nodes of Hadoop cluster. There is option of replicate data across nodes (redundancy) to provide capabilities to recover from failures.

-          Distributed Computing ( uses MR – Map Reduce Paradigm)
Once the data is available on Hadoop cluster. The MR codes ( typically return in Java,C++) is moved to each of the nodes for computation on the data. Map Reduce has two phases mapper and Reducer.

One of the early examples of a distributed computing include SETI@home project, where a group of people volunteered to offer CPU time of their personal computer for research on radio telescope data to find intelligent life outside earth. However this differs from Hadoop MR is in the fact that, data is moved to place where computing takes place in case SETI, while code is moved to the place of data in latter case. Other projects include finding the largest prime numbers, sorting Pet bytess of data in shortest time,etc.

Applications of Hadoop MR – Big data

  •           Weblog analysis
  •           Fraud detection
  •           Text Mining
  •           Search Engine Indexing
  •           LinkedIn uses for “Who viewed  your profile” and “People you may know – recommendations”
  •           Amazon.com uses for book recommendation

Hadoop MR Wrapper applications include

  •           Pig : A data flow language and execution environment for exploring very large datasets. Pig runs on HDFS and MapReduce clusters.
  •           Hive : A distributed data warehouse. Hive manages data stored in HDFS and provides a query language based on SQL (and which is translated by the
    runtime engine to MapReduce jobs) for querying the data.
  •           Mahout : Machine Learning implementation in Map Reduce.

Location Intelligence using Social Media Data: Customer Location Aware Systems

In the recent past, a variety of new social media sites have emerged – location Based Services( like foursquare, gowala), Group Deals( like groupon),  microblogging( like twitter, fb). These social media sites have provided integration features with other SM sites, for instance A foursquare checkin can be configured to automatically publish a tweet with the URL of location checked-in. All these data being primarily open source, we have various business opportunities to leverage the data by integrating this with the internal customer data. 
             A retailer’s major concern is the need to understand their customers better, to gain the 360 degree view of the customer. Most of the companies, have strategy to integrate the internal customer behavior data acorss POS, Ecomm, Mail order,etc. By leveraging the social media data and integrating with location based services(like foursquare) and microblogging services(twitter), the retailers now have the ability to track customers.
           When a customer checks-in to a location adjacent to the store, the retailer could idetify this event of the customer, and target in real time with offers/product recommmendation which would likely bring him to the store, increase footfalls and sales.  The channel could be either sms, email or voicemail. The offer could be linked to his/her overall sentiment score by analyzing the customer tweets over time. Meaning, a dissatisfied customer could be given a high value offer as compared to neutral/satisfied customer.
            Pharmacy retail stores could leverage this platform for refill reminders, so to increase drug adherence. When a customer is identified to be closer to the store, a message could go out via channels namely email, sms, voicemail, etc. This increases good will and they feel being cared about, which intun increases loyalty.
          The online world with rise of social media sites has raised concerns about user privacy. I feel the users should be well educated about their privacy settings in these social media sites, and any of the targeting should be done only based on the users approval. We need to instill the message in customers ”The more the data you share with us, the better is the service you recieve”.
Shown below is a 5 slider deck which talks about this idea of “Location Intelligence using Social Media data: Customer Location Aware Systems” which won Social Media Analytics event at an Analytics expo.

Did use of helmets reduce deaths due to 2 wheeler accidents?

This blog post is about the analysis of implementation of helmet rule in various Indian states and the effect it had on bringing down accidental deaths due to 2 wheelers. Here we are specifically focussing on one particular state, Karnataka.
               National Mandatory helmet legislation is included in the Indian Motor Vehicles Act, 1988.  However, implementing this law has been left to the individual states. Karnataka gov. enforced mandatory helmet rule for all 2 wheeler riders in the year 2007. Traffic cops started imposing fines to violators of the rule and within no time, good compliance to the rule was observed. Accidental Deaths & Suicides in India publicizes the data about accidental deaths broken down by type of vehicle and by state (from 1967 to 2009). Considering the state of Karnataka, looking at the accidental deaths due to 2 wheelers and plotting the trend we see a clear decline in accidental deaths after 2007, specifically 8% decline in accidental deaths as of 2009. But can we just attribute this drop in accidental deaths to the helmet rule? Let’s explore.
               Let’s identify another state (to control for helmet rule) which had similar accidental deaths pattern over the years before 2007 and was not a strict enforcer of the helmet rule. If we see a dip in accidental deaths in this state, we can pretty surely conclude helmet rule did not have any influence in bringing down the accidental deaths. Interestingly by analyzing accidental death patterns across different states, neighbouring state Tamil Nadu (TN) had similar accidental death patterns over the years before 2007 and was not strict enforcer of helmet rule (violators were not penalized often).
              The figure below shows accidental deaths plotted for both these states starting from 2001 to 2009. Until 2007, the accidental deaths have been observed to increase year on year in both these states. However post 2007; Tamil Nadu continued to see increase in death rates due to 2 wheeler accidents for the next two years up to 2009 as Karnataka showed a decline in accidental deaths. Triangulating the above points, strict enforcement of Helmet rule in Karnataka helped bring down the accidental deaths due to 2 wheelers.

Helmet Rule - Effectiveness Analysis

Helmet Rule - Effectiveness Analysis

Socializing Insights with end users: Analytics for masses – Amazon vs. LinkedIn

This blog post is about comparison of amazon.com and linkedin.com in terms of similarities across dimensions of analytic maturity & use of data shared by their customers. As Thomas Davenport mentions in his book “Competing with analytics”, amazon.com is one of the few companies which was built on the foundation of data, the so called “Analytically mature” company. LinkedIn has joined the list, with lot of new features available to their users.

              As customers interact with the site, they generate data about their liking towards certain products or feature. Companies like amazon.com and LinkedIn clearly understand how to leverage this information to make the interaction between the customer and the site even more valuable & relevant. Users who are ready to share more data with site about their likes/dislikes, the better would be the site’s recommendation for the user.  The companies need to instil this confidence in the customers mind, and hence have the users share data by will.

               Amazon.com & LinkedIn makes available every little fact about the consumer’s behaviour and interaction with other users or products to help change their behaviour in terms of decision they make to buy or not-buy a product or whether to look for an employer change, etc.

LinkedIn has amazing insights about the companies, profiles which is all available to the users freely. In a interview with linkedIn CEO, Reid Hoffman by Andreas weigend, Reid talks about every individual as a small business and every individual thinks of their reputation in terms of number of new connections, who viewed their profiles, how many times their profile came up in the search results and stats of similar kind.  Andreas Weigend, a social data expert talks behavior change brought about by features like ‘who viewed your profile in the last 15 days’ in end users and in the way companies like LinkedIn treats the users.

(i)                  Insights about companies( lets say we are researching the company mu-sigma):

  • Employee switching patterns between companies. Employees moved from ‘xyz’ to mu-sigma.
  • Employee switching patterns between companies: Employees  moved from mu-sigma to “abc”.
  • Gender distribution: M to F ratio at mu-sigma.
  • By years of experience, how does mu-sigma differ from other companies. Similar statistics is available by job function, educational qualification & university. Similar company benchmark is available for comparison.
  • People who looked at “mu-sigma” also viewed – other list of companies?
  • Where employees of mu-sigma call home
  • Most recommended at mu-sigma.
  • Time trend of employees who got a change in title.

(ii)                Insights about profiles/users

    • Who viewed my profile in the last 15 days?
    • How many times did your profile show up in search results?
    • Recommendation about other profiles/users you  might know.
    • Companies which user might be interested in following.
    • Relevant jobs for every user with functionality to apply for it.
    • Work recommendations by colleagues and customers.
LinkedIn

LinkedIn

    Here’s a look at what amazon.com offers. When purchasing a product at amazon.com, the user would be presented with stats related to

  • How many users who searched for the book “The outliers by Malcolm Gladwell” (say) ended up purchasing it or ended up purchasing “The tipping Point” , “The Blink” , “What the dog saw”, etc.. in the same order. However I feel the need to quantify the same would help. I mean calling out that 80% of people who searched for “A” ended up purchasing “B”. Or 80% of people who searched for “A” ended up purchasing “A”.
  • “Frequently brought together items” for a given product.
  • Review statistics: How many rated 5-star, 4-star and so on, as a bar chart.

We are moving towards an era of socializing data with end users to make every little decision they possibly make is data driven. WordPress, Netflix, glassdoor, etc are some of the other companies geared towards this trend. The intention of collecting data has truly gone beyond marketing purpose.

“Integrate, Analyze, Visualize & Socialize” – Visualization Tools & Techniques

Turning raw data into insights often involves integrating data from multiple disparate sources (not just limited structured one), analyzing the data, visualizing it and socializing the results/insights to a broader audience to whom the results are of interest. In this cycle of turning data into insights, Visualization plays a vital role and hence would be the topic of my discussion in this blog post . Visualization could aid in analyzing huge data by identifying patterns which are easily interpretable visually as compared to tabular layout of numbers.Second, Visualization could help  represent the numbers using visuals which are easy for everyone to read and understand. One could easily convey the insights of the analysis by visuals, grasped in a minute or two, which might have possibly took 3-4 mins using textual aid/table of numbers.This is a important factor to consider especially when are you delivering the findings to the CEO/CFO/CXO/CIO of a company, as often they have limited time.

London Cholera Outbreak visualized

London Cholera Outbreak visualized

Going back to history of visualization. The most famous, early example mapping epidemiological data was Dr. John Snow’s map of deaths from a cholera outbreak in London, 1854, in relation to the locations of public water pumps. The original (high-res PDF copies from UCLA), spawned many imitators including this simplified version by Gilbert in 1958. Tufte (1983, p. 24) says,”Snow observed that cholera occurred almost entirely among those who lived near (and drank from) the Broad Street water pump. He had the handle of the contaminated pump removed, ending the neighborhood epidemic which had taken more than 500 lives.”
The following pointers should help anyone analyze data and socialize finding by effective newer visualizations techniques:

1. Fusion Charts – involves basic chart types, all it needs is a data file, configuration file and can link the chart to the data file, flash based  & supports interactive charts, web supported.
2. Fusion Maps – contains maps of all counties and major cities world wide, interactive, flash based, involves data file and configuration file, web supported.
3. Fusion Widgets – involves coolest visualization techniques like angular gauge, spark line/column,gant chart, pyramid, cylindrical & thermometric gauge & bulb gauge. Some of these charts have power to do real time streaming generally used in stock market analysis.
4. Power Charts – contains some of the rare chart types like node chart, heat map, waterfall chart, multilevel pie chart, candlestick chart,etc. again flash based and hence web supported.
4. R – Revolution Computing – a powerful open source data mining/stat language which can generate stacked multi-combinatorial charts using a single line of command.
5. Google Visualization – javascript based, web supported, involves some of the coolest viz techniques like motion chart which can display data in 5 dimensions, geomap, word cloud, money pile, 3D chart, QR code, etc.
6. Google Charts – contains all basic chart types, from google.
7. Custom Flex Charts – Using customer written flex code and action script code.
8. Microsoft Excel – famous for its quick and ease of chart creation , latest version now has spark line chart support.
9. Tableau – Data Exploration- would recommend this tool for rapid fire analytics involving various dimension, it is just as easy as drag and drop to change views of the metrics by dimension hierarchy.
10. BI Report Tools – BOXI, Cognos – commercial BI tools with support for creation of various report type based on charts and tabular layouts.

Industry Trends involve real time streaming of charts – used in supply chain analytics, interactive charts, mobile supported charts, Creating alerts in charts(for example alert biz. users sending an email, as the sales of any product goes below $x on three consecutive days and so on..), video & audio supported charts.

Some of the insights have stood best, because they were simple!

 In this blog post, i talk about 3 scenarios where there had been highly valuable insights derived, yet remaining simple.

1. Customers shopped online returned via stores Randy Lea, VP product & service marketing Teradata talks about one of their clients, who had tagged their e-com customers as best customers based on web sales they were generating and reaching out to them with various promotions. However, on integrating their web data with Enterprise data( store data) they found most of the customers were buying things online in multiple units and returning them through stores.

        For example, some customers brought 4-5 shirts of different colors, however they reatined one of them they liked the most and returned the rest of them visiting the stores. Effectively customers were buying through one channel(web) and returning them through another channel(store).Hence the web customers, whom they believed best not actually best rather average shoppers and shouldnt have been sent offers.

Source: Teradata ( Video) 

2. In the United States, if you live more than two miles from a pharmacy store, you probably don’t shop there!In the book data-drien marketing , Mark Jeffery talks about the case of how walgreens optimized their marketing spend using simple geo-spatial visualization. The pic on the right, is a picture of three stores of the Walgreens pharmacy chain on a map.Walgreens is a $59 billion annual revenue pharmacy company with 6,850 stores throughout the United States.

Source: "Data Driven Marketing" by Mark jeffery

Geo spatial visualization of Walgreens stores

This geospatial picture shows dots that are the customers and where they live and are coded by shape depending on which of the threeWalgreens stores they shop. The ‘‘diamond’’ customers shop at Store 1; the ‘‘square’’ customers, at Store 2; and the ‘‘star’’ customers, at Store 3. This pharmacy retail chain predominantly markets using flyers in newspapers. The way they pay for the marketing is by zip code, denoted by the dashed line, for example, in the picture. Mike Feldner, the marketing manager who first created these pictures, noticed something interesting: the circle on the picture is two miles in radius, and after looking at many pictures throughout the United States, he noticed that there are no dots (customers) for a store more than two miles from the store. He concluded that in the United States, if you live more than two miles from a pharmacy store, you probably don’t shop there. At that time,Walgreens treated each U.S. locale equally; allocating equal dollar amounts for newspaper advertising in each zip code across the United States. But the data show that if there is no store within two miles of the zip code, customers do not shop at the store. Based on these data, Walgreens ultimately stopped spending advertising dollars in all zip codes without a store within two miles of the zip code. As you might guess, the impact to sales revenues was exactly zero. The impact to marketing, however, was a cost saving of more than $5 million, for a total cost of collecting the data and creating the plots of approximately $200,000. This multimillion-dollar saving in marketing did not require a lot of money, and the analysis was done on a personal computer (PC). This is yet another example of being simple in approach, yet making the impact.

Source: “Data-Driven Marketing” by Mark Jeffery

3. We won because we understood the science of incentivizing people to cooperateLate last year the Pentagon’s mad-scientist research wing, Darpa, announced the Network Challenge, a $40,000 prize for the first group to find and report the locations of ten red weather balloons that the agency would set aloft one day in secret locations around the country. Most of the thousands of groups that signed up quickly realized that crowdsourcing was the way to find the 8-foot spheres. So, naturally, they offered bounties to balloon hunters. But Pentland’s crew at MIT’s Human Dynamics Lab–part of the MIT Media Lab–took their crowd control a step further. “It was trivial for us to slap together the balloon thing,” says the 58-year-old Pentland. That’s because other groups’ tactics were based on guesswork, he argues. His were based on lessons learned through data-mining research. “We won because we understood the science of incentivizing people to cooperate.”

Read the entire article here: Mining Human Behavior at MIT

 

Market Basket Analysis/Association Rule Mining using R package – arules

In my previous post, i had discussed about Association rule mining in some detail.  Here i have shown the implementation of the concept using open source tool R using the package arules. Market Basket Analysis is a specific application of Association rule mining, where retail transaction baskets are analysed to find the products which are likely to be purchased together. The analysis output forms the input for  recomendation engines/marketing strategies. Read more of this post

Beyond BI & Analytics

For the last 6 months, i have been closely following trends in information management. Below are few of my observations.

  • Data source explosion: Business Problems are gaining complexity day by day, hence there is a huge demand for analyzing data from multitude of sources to help companies frame strategies for growth.  GPS data accumulated by Telecom companies offer insights into customers current location and provide context aware recomendations. Infact, some of the telecom companies have introduced location based pricing. Sensor data helps identify security threats to secure networks. Social network data has opened up as a channel for marketing services/product. Analysis of such closely knit data leads to behavioral & Contextual targeting. Traditional data analysis tools/algorithms fail to perform efficiently because such data are of huge sizes and needs newer datastructures for efficient analysis.
  • Databases going beyond relational is gaining popularity. NoSQL dbs and Graph/Tree/XML based databases.
  • Open Source tools continue to emerge.(R, RapidMiner, Weka)
  • Growing need for massive dataset analysis.
  • Artificial Intelligence(AI) and NLP gaining popularity among data analysts( in additional to ML techniques)
  • Multimedia Analytics: Need for gathering critical metrics like customer footfalls, quantifying customers satisfaction by using facial expressions. All these applications demand high end signal processing( both Image & Video). There is a lot of scope for innovation in this area.
  • Privacy preserving techniques for data analysis. This in turn encourages companies to outsource some of the critical data analysis to third parties.
  • Agile Methodologies for Analytics Project to cope up with rapidly changing customer/business needs.
  • Bio-Inspiration/Bio-Imitation: To learn from nature/natural processes and develop analogous techniques which could potentially solve a real-world problem. Some classic examples are development of Neural network inspired by working of a human brain, solving path optimization problem from Ant colonies, 280 degree view of honey bee(vision) etc.
  • More and more data are made publicly available.
  • Real Time data integration, insight generation and business decision.
  • Complex visualization techniques through new technology like Adobe Flex , MS Silverlight,etc which are known for generating RIA.(Rich Internet Applications)

And I am sure these are just few items in the list and really not exhaustive. Feel free to share your comments.

Datamining Video Lectures – Best way to learn

  Do you find analytics/data mining a difficult topic to understand and learn? To a certain extent true if you were to use books as the source. Friends, i found these two very valuable and high quality source for learning topics related to data mining and above all these are free.

  (i) From David Mease who teaches DM at Google:You can access approximately 11 hours of video(11 parts) on the semester topic “Statistical Aspects of Data Mining” here http://video.google.com/videosearch?q=mease+stats+202&sitesearch=# and also you can get pdf version of lecture slides and assignments, try to solve them and master them. I guess the author has also some blogs to discuss problems in this topic. The best thing about this video tutorial is that David has demonstrated implementation of each of these techniques using open source data mining tool – R (short for Revolution).

Videos: http://video.google.com/videosearch?q=mease+stats+202&sitesearch=#
Lecture notes -pdf : http://www.stats202.com/original_index.html
Course Home: http://www.stats202.com/

  (ii)From Stanford University as Andrew Ng. teaches “Machine learning”: This is another very usefull video course. The semester course is covered in 20 parts and hence approx. 20 hours of quality knowledge. The best thing about Andrew is he teaches the mathematics so good, you start visualizing equations and that is one good way to learn maths. Its not just about maths, he also demonstrates the video demos on Machine learning projects implemented by his students like autonomous car driving, autonomous flying, converting a picture to a 3-d experience,etc…that way you dont get bored anytime during the lecture.I loved it a lot.Hope you enjoy it too.

videos: http://www.youtube.com/view_play_list?p=A89DCFA6ADACE599&search_query=stanford+%2B+machine+learning
Lecture Notes:  http://www.stanford.edu/class/cs229/materials.html
Course Home: http://cs229.stanford.edu/

   I am sure you will find more content than what i have mentioned here. Feel free to explore the course page. I personally believe anything can be learnt best only by first learning its applications,which in process gets you motivated and the rest is assured. I would like to thank Andrew Ng. and David Mease for sharing their expertise. A good initiative by stanford. Expecting more from top educational schools.

Association Rule Mining

Association Rule Mining [ Implementation using R here]

Association Rule mining is one of the classical DM technique. Association Rule mining is a very powerful technique of analysing / finding patterns in the data set. It is a supervised learning technique in the sense that we feed the Association Algorithm with a training data set( as called Experience E in machine learning context) to formulate hypothesis(H) . The input data to a association rule mining algorithm requires a format which will be detailed shortly.
Ok let me first introduce the readers with some of the application areas of this DM technique and motivation for the study of Association analysis. The classic application of the association rule mining is to analyse the Market Basket Data of a retail store. For example, Retail stores like Wal-Mart, Reliance fresh, big bazaar gather data about customer purchase behaviour and they have complete details of the goods purchased as part of a single bill. This is called Market basket data and its analysis is termed “market basket analysis”. Read more of this post

Follow

Get every new post delivered to your Inbox.

Join 52 other followers