In this blog post, I am going to discuss about mobile advertising networks, ecosystem & industry trends. This understanding would serve as base for the future posts I plan to write around data science problems in advertising industry. Lets start in top down fashion. Advertising is a multi-billion dollar industry (estimated $500Bn a year globally as of 2010). Traditional advertising relied on TV, Radio, Prints: fliers & Billboards. With the advent of digital technologies, the advertising dollars has been fast shifting to Web and Mobile. Specifically, mobile growth has outgrown desktop growth due to availability of inexpensive mobile devices in the form of smartphones and tablets; and reduced mobile data rates in the recent days. We know Google, Facebook & Twitter makes money majorly through Ads. It is important to realize that these companies put data at the centre of all its products, which helps create targeted ads relevant to users resulting in great CTRs (Click-thru-rate: a key metric in advertising to measure ad effectiveness/relevance. More on metrics in my next blogpost)
So, what exactly is an Ad Network?
An Advertising Network is a marketplace for ads, which connects advertisers(on demand side) with publishers(on supply side). It is merely an entity Read More
We talk about big data quite often these days, wanted to put some fundas about basics around data. Do you know the singular form of data ? How data differs from information vs. knowledge? How insights convert to actions? Here is my attempt towards answering some of these.
Data is often raw in binary represented as a number or a character or a string. Data is a plural version of datum. Information is anything which puts context to data. For example: The number 89 itself doesn’t mean anything unless we fit a context to say – the car’s speed is 89 kmph. Knowledge on other hand is about knowing how things around us work & the larger world is interconnected. It is obtained based on experiences, experiment, research, etc. In the below infographic I have tried to explain these terms using a simplified version of intelligence that can be embedded in cars. More on the infographic follows.
I have tried to take a very simplified version here. Now just imagine, when we talk about intelligent cars – we are usually talking about 100s of such parameters instead of just a single varaible (speed of car) ,all collected from multiple sensors obtained in realtime streaming format to make such decisions. Knowledge is obtained through predictive algorithms which continuously learns [in AI terms to say “adapts”] from data and help in making recommendation about safety of the vehicle. Now imagine these 100s of parameters collected every millisecond from 1000s of connected cars around you – this is what forms “Big Data”.
Hope you liked this post, I will be writing up more articles – as have been getting requests from friends around the globe. Stay tuned !
I have spent most of my life into data and its applications to problems. Now, when i look at some patterns in algorithms we use in analyzing data, one thing that emerges is increased use of meta-algorithms. Boosting techniques(like AdaBoost) is one such meta-algorithm which uses multiple weak learners(classifiers) to improve prediction accuracy. Random forests,a very prominent Read More
We are in the era of big data, with newer sources of data emerging at an exponential rate involving sensor data, EHR, social network/media data & machine generated data. In this blog post, I will be discussing specifically about social network data, its applications in data science problems, solutions & visualizations. In simple terms, a network is a group of nodes interconnected by links (also called edges). In a social network, users are the nodes and connections are the links/edges. Consider a Facebook user’s network, by adding friends, we are creating the links. Before getting into a little more of technical details of a network, let’s spend some time on more interesting area – its applications to data science problems.
Linkedin, Facebook & other social network uses the network information, to predict “People you may know” & offer people recommendations. Product companies like Microsoft, Oracle uses network analytics to identify key influencers in leading tech forum/online community networks to help market their products by utilizing the greater reach of the identified influencers. WWW is another example of networks. The pages are interconnected in the form of network & its analysis helps understand information flow across the WWW. “People you may know” feature generally works using triangulation. i.e If B and C are connected. If A knows B, then it is likely that A knows C. Most of the people recommendation work based on this principle.
Clustering is an unsupervised classification (learning) technique, where the objective is to maximize inter-cluster distance while minimizing the intra-cluster distance. By unsupervised, we mean clustering or segmenting or classifying data based on all the available attributes and specifically there is no availability of class information. A supervised classification on other hand uses class information.
As usual, before we jump into ‘how’ let’s answer the ‘why’. Clustering is applied to solve variety of problems ranging from biological systems to using it for exploratory analysis of data ( as a pre-processing technique). Many of the predictive analytics algorithms use clustering solutions as one of their components. It is used in all major brands for CRM, to understand their customer better. Another use of clustering is in outlier detection or fraud transaction identification. If you have heard about a site called www.similarsites.com, it extensively works on clustering algorithms where the sites are segmented/clustered based on website attributes like category of domain, number of users, traffic, content type, corporate or personal, blog, image blog, video blog,etc. For example, if you entered INMOBI, you would get a list of companies which are in this space mainly its competitors – mojiva, Millenialmedia, Admob, Quattro, Mobclix,etc. If you are looking for image hosting site and want to know alternatives/options, this will be helpful.
We talk about similarity in terms of distance measures like
(i) Euclidean Distance
(ii) Manhattan Distance
Hadoop is an open source framework for writing and running distributed application that process huge amounts of data ( more famously called Big Data). The name Hadoop is not an acronym; it’s a made-up name. The project’s creator, Doug Cutting, explains how the name came about: “The name my kid gave a stuffed yellow elephant. Short, relatively easy to spell and pronounce, meaningless, and not used elsewhere: those are my naming criteria. Kids are good at generating such. Googol is a kid’s term”
It has two components
– Distributed Storage ( uses HDFS – Hadoop file system)
Ensures the data is distributed evenly across all the nodes of Hadoop cluster. There is option of replicate data across nodes (redundancy) to provide capabilities to recover from failures.
– Distributed Computing ( uses MR – Map Reduce Paradigm)
Once the data is available on Hadoop cluster. The MR codes ( typically return in Java,C++) is moved to each of the nodes for computation on the data. Map Reduce has two phases mapper and Reducer.
One of the early examples of a distributed computing include SETI@home project, where a group of people volunteered to offer CPU time of their personal computer for research on radio telescope data to find intelligent life outside earth. However this differs from Hadoop MR is in the fact that, data is moved to place where computing takes place in case SETI, while code is moved to the place of data in latter case. Other projects include finding the largest prime numbers, sorting Pet bytess of data in shortest time,etc.
Applications of Hadoop MR – Big data
- Weblog analysis
- Fraud detection
- Text Mining
- Search Engine Indexing
- LinkedIn uses for “Who viewed your profile” and “People you may know – recommendations”
- Amazon.com uses for book recommendation
Hadoop MR Wrapper applications include
- Pig : A data flow language and execution environment for exploring very large datasets. Pig runs on HDFS and MapReduce clusters.
- Hive : A distributed data warehouse. Hive manages data stored in HDFS and provides a query language based on SQL (and which is translated by the
runtime engine to MapReduce jobs) for querying the data.
- Mahout : Machine Learning implementation in Map Reduce.
In the recent past, a variety of new social media sites have emerged – location Based Services( like foursquare, gowala), Group Deals( like groupon), microblogging( like twitter, fb). These social media sites have provided integration features with other SM sites, for instance A foursquare checkin can be configured to automatically publish a tweet with the URL of location checked-in. All these data being primarily open source, we have various business opportunities to leverage the data by integrating this with the internal customer data.
A retailer’s major concern is the need to understand their customers better, to gain the 360 degree view of the customer. Most of the companies, have strategy to integrate the internal customer behavior data acorss POS, Ecomm, Mail order,etc. By leveraging the social media data and integrating with location based services(like foursquare) and microblogging services(twitter), the retailers now have the ability to track customers.