We are in the era of big data, with newer sources of data emerging at an exponential rate involving sensor data, EHR, social network/media data & machine generated data. In this blog post, I will be discussing specifically about social network data, its applications in data science problems, solutions & visualizations. In simple terms, a network is a group of nodes interconnected by links (also called edges). In a social network, users are the nodes and connections are the links/edges. Consider a Facebook user’s network, by adding friends, we are creating the links. Before getting into a little more of technical details of a network, let’s spend some time on more interesting area – its applications to data science problems.
Linkedin, Facebook & other social network uses the network information, to predict “People you may know” & offer people recommendations. Product companies like Microsoft, Oracle uses network analytics to identify key influencers in leading tech forum/online community networks to help market their products by utilizing the greater reach of the identified influencers. WWW is another example of networks. The pages are interconnected in the form of network & its analysis helps understand information flow across the WWW. “People you may know” feature generally works using triangulation. i.e If B and C are connected. If A knows B, then it is likely that A knows C. Most of the people recommendation work based on this principle.
Herbert A Simon quotes “Solving a problem simply means representing it so as to make the solution transparent”. Visualization is a powerful analysis tool and we have seen many use cases of it, including the infamous – identification of diarrhoea breakthrough using geo visualization. Network visualization helps identify patterns hidden in complex network data using open source tools like Gephi (GUI) & R (statistical/data mining & analytics programming). LinkedIn Labs provide linkedIn InMaps application, which visualizes one’s professional LinkedIn network. It identifies groups in the network, some of which it might be able to label based on the user profile data, some of which it might not be able to label, due to additional context required (refer fig.). I have visualized my FB social network using Gephi here and some very interesting finds emerged, just by looking at the visualization. If you are interested in visualizing your FB data, use http://snacourse.com/getnet to download your FB data in the form a gml data file, which can used for visualizing in Gephi or using SNA packages available in R.
(i) Four main sub-networks/ sub group exists in the network (named below). Technically, they are called connected components.
(a) School Friends
– Training Batch
(c) College Friends
(ii) Users who connect sub-networks: As you can see, user marked A & B connects sub-networks. These are the users who bridge subgroups. People recommendations are based on such links.
(iii) Hierarchy in groups: The concentric circles in the graph represents a hierarchy, the inner subgroup represents college friends with same major, in my case Electrical Eng., while the bigger one represents all of the college friends.
(iv) A sets of users who are not connected to any other user in your network. These are likely to be strangers, who were added as friend mistakenly.
“People you may know” feature generally works using triangulation. i.e If B and C are connected. If A knows B, then it is likely that A knows C. Most of the people recommendation work based on this principle. LinkedIn Lab provides features to visualize your network, without the need to go through the pain to download the data and visualize using external tools. Try out the Linkedin Labs InMaps. However, if you would like to find who is most active on your network, social index, # of hops to reach to a specific user, its worth trying out Gephi.
The network considered above is non-directional, meaning it doesn’t matter A is connected to B is same as saying B is connected A. However consider this network, where the ‘likes’ relationship is visualized among a set of users. In this case, the network would be directional, as A likes B would mean different as compared to saying B likes A.
Now what kind of questions can the graph/network model answer – “what is the shortest path for me to reach user A, so as to have maximum reach?”, “who is the most social/influential person in my network? Based on comment patterns/ # of connections/ # of likes, etc.
I would recommend for everyone interested in network analysis, there is a coursera on “Social Network Analysis” by Lada Adamic from University of Michigan.