Big Data: Hadoop Map Reduce

Hadoop is an open source framework for writing and running distributed application that process huge amounts of data ( more famously called Big Data). The name Hadoop is not an acronym; it’s a made-up name. The project’s creator, Doug Cutting, explains how the name came about: “The name my kid gave a stuffed yellow elephant. Short, relatively easy to spell and pronounce, meaningless, and not used elsewhere: those are my naming criteria. Kids are good at generating such. Googol is a kid’s term”

It has two components

-          Distributed Storage ( uses HDFS – Hadoop file system)
Ensures the data is distributed evenly across all the nodes of Hadoop cluster. There is option of replicate data across nodes (redundancy) to provide capabilities to recover from failures.

-          Distributed Computing ( uses MR – Map Reduce Paradigm)
Once the data is available on Hadoop cluster. The MR codes ( typically return in Java,C++) is moved to each of the nodes for computation on the data. Map Reduce has two phases mapper and Reducer.

One of the early examples of a distributed computing include SETI@home project, where a group of people volunteered to offer CPU time of their personal computer for research on radio telescope data to find intelligent life outside earth. However this differs from Hadoop MR is in the fact that, data is moved to place where computing takes place in case SETI, while code is moved to the place of data in latter case. Other projects include finding the largest prime numbers, sorting Pet bytess of data in shortest time,etc.

Applications of Hadoop MR – Big data

  •           Weblog analysis
  •           Fraud detection
  •           Text Mining
  •           Search Engine Indexing
  •           LinkedIn uses for “Who viewed  your profile” and “People you may know – recommendations”
  •           Amazon.com uses for book recommendation

Hadoop MR Wrapper applications include

  •           Pig : A data flow language and execution environment for exploring very large datasets. Pig runs on HDFS and MapReduce clusters.
  •           Hive : A distributed data warehouse. Hive manages data stored in HDFS and provides a query language based on SQL (and which is translated by the
    runtime engine to MapReduce jobs) for querying the data.
  •           Mahout : Machine Learning implementation in Map Reduce.

Socializing Insights with end users: Analytics for masses – Amazon vs. LinkedIn

This blog post is about comparison of amazon.com and linkedin.com in terms of similarities across dimensions of analytic maturity & use of data shared by their customers. As Thomas Davenport mentions in his book “Competing with analytics”, amazon.com is one of the few companies which was built on the foundation of data, the so called “Analytically mature” company. LinkedIn has joined the list, with lot of new features available to their users.

              As customers interact with the site, they generate data about their liking towards certain products or feature. Companies like amazon.com and LinkedIn clearly understand how to leverage this information to make the interaction between the customer and the site even more valuable & relevant. Users who are ready to share more data with site about their likes/dislikes, the better would be the site’s recommendation for the user.  The companies need to instil this confidence in the customers mind, and hence have the users share data by will.

               Amazon.com & LinkedIn makes available every little fact about the consumer’s behaviour and interaction with other users or products to help change their behaviour in terms of decision they make to buy or not-buy a product or whether to look for an employer change, etc.

LinkedIn has amazing insights about the companies, profiles which is all available to the users freely. In a interview with linkedIn CEO, Reid Hoffman by Andreas weigend, Reid talks about every individual as a small business and every individual thinks of their reputation in terms of number of new connections, who viewed their profiles, how many times their profile came up in the search results and stats of similar kind.  Andreas Weigend, a social data expert talks behavior change brought about by features like ‘who viewed your profile in the last 15 days’ in end users and in the way companies like LinkedIn treats the users.

(i)                  Insights about companies( lets say we are researching the company mu-sigma):

  • Employee switching patterns between companies. Employees moved from ‘xyz’ to mu-sigma.
  • Employee switching patterns between companies: Employees  moved from mu-sigma to “abc”.
  • Gender distribution: M to F ratio at mu-sigma.
  • By years of experience, how does mu-sigma differ from other companies. Similar statistics is available by job function, educational qualification & university. Similar company benchmark is available for comparison.
  • People who looked at “mu-sigma” also viewed – other list of companies?
  • Where employees of mu-sigma call home
  • Most recommended at mu-sigma.
  • Time trend of employees who got a change in title.

(ii)                Insights about profiles/users

    • Who viewed my profile in the last 15 days?
    • How many times did your profile show up in search results?
    • Recommendation about other profiles/users you  might know.
    • Companies which user might be interested in following.
    • Relevant jobs for every user with functionality to apply for it.
    • Work recommendations by colleagues and customers.
LinkedIn

LinkedIn

    Here’s a look at what amazon.com offers. When purchasing a product at amazon.com, the user would be presented with stats related to

  • How many users who searched for the book “The outliers by Malcolm Gladwell” (say) ended up purchasing it or ended up purchasing “The tipping Point” , “The Blink” , “What the dog saw”, etc.. in the same order. However I feel the need to quantify the same would help. I mean calling out that 80% of people who searched for “A” ended up purchasing “B”. Or 80% of people who searched for “A” ended up purchasing “A”.
  • “Frequently brought together items” for a given product.
  • Review statistics: How many rated 5-star, 4-star and so on, as a bar chart.

We are moving towards an era of socializing data with end users to make every little decision they possibly make is data driven. WordPress, Netflix, glassdoor, etc are some of the other companies geared towards this trend. The intention of collecting data has truly gone beyond marketing purpose.

“Integrate, Analyze, Visualize & Socialize” – Visualization Tools & Techniques

Turning raw data into insights often involves integrating data from multiple disparate sources (not just limited structured one), analyzing the data, visualizing it and socializing the results/insights to a broader audience to whom the results are of interest. In this cycle of turning data into insights, Visualization plays a vital role and hence would be the topic of my discussion in this blog post . Visualization could aid in analyzing huge data by identifying patterns which are easily interpretable visually as compared to tabular layout of numbers.Second, Visualization could help  represent the numbers using visuals which are easy for everyone to read and understand. One could easily convey the insights of the analysis by visuals, grasped in a minute or two, which might have possibly took 3-4 mins using textual aid/table of numbers.This is a important factor to consider especially when are you delivering the findings to the CEO/CFO/CXO/CIO of a company, as often they have limited time.

London Cholera Outbreak visualized

London Cholera Outbreak visualized

Going back to history of visualization. The most famous, early example mapping epidemiological data was Dr. John Snow’s map of deaths from a cholera outbreak in London, 1854, in relation to the locations of public water pumps. The original (high-res PDF copies from UCLA), spawned many imitators including this simplified version by Gilbert in 1958. Tufte (1983, p. 24) says,”Snow observed that cholera occurred almost entirely among those who lived near (and drank from) the Broad Street water pump. He had the handle of the contaminated pump removed, ending the neighborhood epidemic which had taken more than 500 lives.”
The following pointers should help anyone analyze data and socialize finding by effective newer visualizations techniques:

1. Fusion Charts – involves basic chart types, all it needs is a data file, configuration file and can link the chart to the data file, flash based  & supports interactive charts, web supported.
2. Fusion Maps – contains maps of all counties and major cities world wide, interactive, flash based, involves data file and configuration file, web supported.
3. Fusion Widgets – involves coolest visualization techniques like angular gauge, spark line/column,gant chart, pyramid, cylindrical & thermometric gauge & bulb gauge. Some of these charts have power to do real time streaming generally used in stock market analysis.
4. Power Charts – contains some of the rare chart types like node chart, heat map, waterfall chart, multilevel pie chart, candlestick chart,etc. again flash based and hence web supported.
4. R – Revolution Computing – a powerful open source data mining/stat language which can generate stacked multi-combinatorial charts using a single line of command.
5. Google Visualization – javascript based, web supported, involves some of the coolest viz techniques like motion chart which can display data in 5 dimensions, geomap, word cloud, money pile, 3D chart, QR code, etc.
6. Google Charts – contains all basic chart types, from google.
7. Custom Flex Charts – Using customer written flex code and action script code.
8. Microsoft Excel – famous for its quick and ease of chart creation , latest version now has spark line chart support.
9. Tableau – Data Exploration- would recommend this tool for rapid fire analytics involving various dimension, it is just as easy as drag and drop to change views of the metrics by dimension hierarchy.
10. BI Report Tools – BOXI, Cognos – commercial BI tools with support for creation of various report type based on charts and tabular layouts.

Industry Trends involve real time streaming of charts – used in supply chain analytics, interactive charts, mobile supported charts, Creating alerts in charts(for example alert biz. users sending an email, as the sales of any product goes below $x on three consecutive days and so on..), video & audio supported charts.

Follow

Get every new post delivered to your Inbox.

Join 52 other followers