Archive

AI

Mechanistic Interpretability is a field of study that concerns study of neural networks (more generally ML models) with an intent to understand and explain inner workings of a machine learned model. This has particularly become important due to the rise of deep learning and neural network based techniques which learn from large datasets through supervised and self-supervised means leading to super-human performance, beating human baselines in many cases. More recently, Transformer architecture based LLMs have revolutionized AI as being general reasoning machines that can not just learn from text data but understand multi-modal data (vision and speech included) mimicking vaguely how humans sense the world and respond to stimuli. This is already enough justification to invest in interpretability research so we have sufficient oversight on benefits and harm these models can have on our society. LLMs are notoriously known for hallucinating facts in QA, imagine if we somehow could peek into the activations to identify node(s) that are active when a model is hallucinating, we could have interventions that can prevent them in first place. 

Mechanistic interpretability is akin to reverse-engineering the model to comprehend its “thought” processes. This is an active area of research in AI, with techniques like layer-wise relevance propagation, saliency maps, and attention mechanisms being explored to provide insights into the inner workings of complex models

This rest of the article is roughly divided into two phases 
(i) Reasons why we need this research and 
(ii) Recent work and findings from this research in the context of LLMs. Specifically, we discuss recent works from Deepmind and MIT/Meta.

Why do we need to study interpretability ? 

  • Bias and Alignment with Human Values: The pursuit of mechanistic interpretability is essential in ensuring that AI systems operate without bias. It is paramount that AI decisions are aligned with human values, particularly concerning sensitive characteristics such as age and gender. For example, in the recruitment domain, it is crucial that job applicants are evaluated based on their qualifications rather than discriminatory factors — a goal that requires careful scrutiny of the training data to prevent inherent biases, such as a model that has learned to prefer one gender over another due to biased historical hiring practices.
  • Trustworthiness of AI Systems: We need to study mechanistic interpretability to cement the trustworthiness of AI systems. By understanding the internal mechanics of AI decision-making, we ensure that these systems can be relied upon without skepticism. Trust in AI is not a given; it is built on the clarity and justifiability of its processes and outputs.
  • Factuality Over Fabrication: AI systems must be designed to adhere to facts and resist the temptation to ‘hallucinate’ or fabricate information. Mechanistic interpretability studies the AI’s decision-making process to guard against the creation of convincing yet untrue facts, which could otherwise mislead individuals and propagate misinformation.
  • Regulatory Compliance in High-Stakes Fields: In sectors where the stakes are particularly high, such as finance, the demand for explainable AI is not just practical but regulatory. For instance, if a loan application is denied, regulatory frameworks mandate that the decision be transparent and explainable to the customer. This requirement ensures that a deep learning model’s high-confidence prediction of a loan default is accompanied by a rational and understandable explanation.
  • Enhancing Human Efficacy and Discovery: Finally, studying mechanistic interpretability has the potential to amplify human intellect by unraveling complex concepts that have yet to be discovered. This can lead to significant advancements in various fields, such as improving healthcare treatments and interventions, by leveraging AI’s ability to identify patterns and correlations beyond human discernment.

Recent Work on Mechanistic Interpretability

Bridging the Human–AI Knowledge Gap: Concept Discovery and Transfer in AlphaZero” is a paper from Google Deepmind, where authors study the behavior of AlphaZero to understand novel concepts in the way it plays the game of chess. AlphaZero is an AI system that mastered the game of chess via self-play without human supervision. They particularly choose AI that knows to play chess, as it is a game that is well understood and groundtruth is much easier to validate unlike science or medicine. There is also a quantitative measure of the quality of play, both for human experts as well as machines, known as the Elo rating.

They motivate the need for humans to learn from machines with an example of a system capable of producing a more accurate cancer diagnosis or effective personalised treatment than human experts is useful, transferring the rationale behind their decisions to human doctors could not only bring advances in medicine but also leverage human doctors’ strength and generalisation ability to enable new breakthroughs.

The human representational space (H) has some overlap with the machine representational space (M). A representational space forms the basis of and gives rise to knowledge and abilities, which is of ultimate interest. Authors use representational space and knowledge interchangeably roughly speaking, H to represent what humans know and M to represent what a machine knows. There are things that both AI and humans know (M intersect H), things that only humans know (H-M), and things only machines know (M-H). Most existing research efforts only focus on (M intersect H). Authors posit that the knowledge gap represented by (M-H) holds the crucial key to empowering humans by identifying new concepts and new connections between existing concepts within highly performant AI systems.

One prominent example in the history of AI is the move 37 that AlphaGo made in a match with Lee Sedol. This move came as a complete surprise to the commentators and the player, and is still discussed to this day as an example of machine-unique knowledge. The vision to pursue super-human knowledge is ultimately for human-centered AI, and a world where human agency and capability do not come second.

Authors show a methodology to mine novel concepts from AlphaZero, ensure they are teachable concepts to humans and verify the knowledge transferability by working 4 chess grandmasters, the best chess players in the world.

Another popular and recent paper in this space includes “LANGUAGE MODELS REPRESENT SPACE AND TIME” from MIT, where authors study Llama 2 models on spatial (world, US, NYC) and temporal datasets(historical figure, news, artwork) and find LLMs learn linear representations of space and time across multiple scales. These representations are robust to prompting variations and unified across different entity types (e.g. cities and landmarks). Further they go on to conclude that modern LLMs acquire structured knowledge about fundamental dimensions such as space and time, supporting the view that they learn not merely superficial statistics, but literal world models. 

Below are some highlights of the paper

  • Train linear regression probes on the internal activations of the names of these places and events at each layer to predict their real-world location or time.
  • These representations are linear (R2 does not support non-linearity), robust to prompting and unified across entities (cities and natural landmarks)
  • The probe learns the mapping from model coordinates to human interpretable coordinates.
  • What is probing ? fits a simple linear ridge regression model on the network activations to predict some target label associated with labeled input data. In particular, given an activation dataset A and a target Y containing either the time or two-dimensional latitude and longitude coordinates.
  • The early layers (25%) are responsible for recall of information.
  • Adding non-linearity in the fitted model doesn’t change the explained variance wrt linear model indicating the presence of linear relationship.
  • Robust to prompting strategies

In conclusion, the imperative to study and advance mechanistic interpretability is clear. By unpacking the ‘black box’ of AI, we ensure that the technology we increasingly rely on operates in a manner that is ethical, understandable, and beneficial. MI is an active area of research, that is extensively studied by OpenAI, Anthrophic, Google and other teams building world models that serve as a backbone for futuristic AGI solutions.

LLMs are fast-evolving and we have a new model every week, showing up on leaderboards[1] beating previous SoTA model on multiple NLP benchmarks. There are multiple architectures with nuances to training and dataset generation. This post is an attempt to broadly categorize the LLMs and paint a picture of different ways to adopt LLMs to use-cases, discuss pros and cons of each approach. It is to be noted, the distinction among these categories of LLMs are not crystal-clear and they have blurred boundary lines. As an example, you could have parameteric LLM as an agent or parameteric LLM trained using instruct format.

Parametric Memory LLMs: Self-Reliant Titans of Information

Parametric memory LLMs are akin to colossal knowledge repositories, perfectly encapsulating a world of information within their intricate neural network structures. These self-reliant models, devoid of external memory dependencies, can effectively store and retrieve knowledge through fixed weights within the network. This unique trait enables them to scale seamlessly with increasing parameter counts, thus achieving state-of-the-art accuracies on many NLI/U tasks.

Prominent examples of these titan models include Google’s PaLM (Pathway LM) with a staggering 540B parameters, GPT-3 housing 175B parameters, and the relatively lighter Chinchila[2] carrying 65B parameters. Though their monumental size poses challenges for inference, the AI community has responded with dexterity. We’re witnessing a promising trend of smaller, yet powerful models that match their larger counterparts in performance, thanks to strategic techniques like self-instruct and parameter-efficient training. Nonetheless, these models do possess their unique quirks, such as their propensity to ‘hallucinate’ or fabricate compelling yet false facts—a challenge yet to be overcome.

👍 No dependency on external modules. Knowledge stored in parameter. benefits reasoning.

👎 Tendency to hallucinate since knowledge is hidden in parameters and has no way to verify

👎 Updating a model to recent events and facts requires full pretraining which is expensive

Non-Parametric or External Memory LLMs: Harnessing External Memory for Freedom

In contrast to their parametric counterparts, non-parametric LLMs ingeniously leverage external memory resources, liberating themselves from the constraints of their internal memory. This innovative approach allows these models to remain streamlined and current without necessitating constant retraining and gradient updates—a significant advantage that drastically reduces model hallucinations and ensures more reliable outputs.

However, every innovation brings its own set of challenges. In this instance, the added complexity of maintaining a supplementary retrieval model is an inevitable trade-off. We’re exploring various paradigms to manage this complexity, including ‘frozen’ LLM and plug-and-play KB techniques, and the ground-breaking Retrieval Augmented Generation (RAG) approach.

👍 Reduced Hallucination. Can be smaller LLMs, knowledge outsourced to exernal memory fetch at inference

👍 LLMs can preserve freshness without retraining, since knowledge is decoupled in form of external index

👎 Makes the architecture complex as it needs a retrieval from external index.

LLM as Agents: A New Age of Reasoning and Control

The AI research landscape is abuzz with an emerging and exciting prospect—LLMs as autonomous agents proficient in planning and control. The concept of an LLM agent capable of breaking down complex tasks into component questions and actions has unleashed a world of possibilities.

Innovations such as ReACT’s[3] LLM agent, or the various interpretations presented by Toolformer, ReACT, WebGPT, and DSP, are illuminating the way forward. These trailblazers are setting the stage for LLMs that can emulate human-like reasoning, invoke complex tools like Python code interpreters or mathematical calculators, and align with human values for a more dependable and meaningful response.

👍 Human-like. Outsources things that are hard for LLM to do. Significantly reduces hallucinations. For math uses calculators, for puzzles can invoke python interpreter or other models.

👍 Improved reasoning giving the LLM ability to accomplish task with super-human performance pushing LLMs towards AGI

👎 Needs all tools available as API. In case of large tools, can run into context length limitations forcing SFT which can be expensive to collect data.

Instruct Models: Charting New Courses with Human Instructions

Instruct models[4], though not a distinct architecture, are causing quite a stir. They represent a novel data paradigm that is revolutionizing our interaction with LLMs. By formulating tasks as human instructions and fine-tuning LLMs to heed these directives, we have been able to create versatile models. These models not only generalize across tasks without explicit programming but also maintain alignment with human expectations and values—an exciting prospect for the future of AI.

Needless to say, LLMs are changing reference architectures for search/recommendation engines, databases, web and app development with components like vector DB, ensemble of LLMs, tools/API, prompt engine, etc. becoming a central piece in the software stack of a production system. We will see this accelerate in next few months and new reference architectures evolve. I anticipate there will be new areas emerging related to latency optimization, cost/token optimization, training and inference which are typically topics considered after-thought to take a centre stage.

What are you working on ? what LLM and software architectures are you considering ? Feel free share it in comments.

[1] https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard
[2] https://arxiv.org/abs/2203.15556
[3] https://arxiv.org/abs/2210.03629
[4] https://arxiv.org/abs/2109.01652

Emergent abilities refer to the capabilities that arise spontaneously from the complex interactions of simpler components. They are properties that can’t be predicted solely based on the individual parts, but only become apparent when these parts start to interact as a whole. This fascinating concept has its roots in various fields such as biology, physics, sociology, and more recently, artificial intelligence (AI).

Defining Emergent Abilities

The term “emergent” is derived from the Latin word “emergere,” meaning “to rise out” or “to come forth.” Emergent abilities, therefore, are those capabilities that ‘come forth’ or ‘rise out’ from a system. They are not explicitly programmed or designed into the system but instead arise organically from the interactions of the system’s components.

To better understand this, imagine a flock of birds. Each bird follows simple rules: stay close to your neighbors, avoid collisions, and move in the same direction. Yet, when these individual behaviors combine, they create a mesmerizing, dynamic pattern known as murmuration. This collective behavior is an emergent property of the system, as it cannot be predicted or explained by examining a single bird’s actions.

In the context of LLM (Large Language Models), emergent abilities refer to the unexpected behaviors that artificial models exhibit when they interact with their environment or when they are trained with large amounts of data. For instance, GPT-3, a language model developed by OpenAI, has shown an impressive ability to generate human-like text that was not explicitly programmed into it. This ability is ’emergent’ from the model’s training on a diverse range of internet text. Reasoning capabilities, low-resource language translation, math problem solving are considered to be emergent properties of LLMs. Jason Wei maintains an excellent blog of 137 emergent abilities in LLMs. Mathematical abilities have been seen to be non-existent in models upto 13B parameters, similarly reasoning abilities are known to be present in 100B+ parameter models. If you are someone who is comfortable reading scientific papers, highly recommend giving this paper a read (from the same author -Jason Wei of Google)

How do emergent abilities like reasoning arise ?

This is an important question to make a step change in deploying LLMs in production-grade systems at scale. Today, reasoning abilities are seen in 100B+ model parameter size which works for most offline scenarios that are not sensitive to latency. The billion dollar question is “Can we have emergent abilities like reasoning in much smaller models 7B, 13B models that are more practical for deployment in online latency-senstive use-cases?”

The reason for emergent abilities is believed to be in scaling i.e bigger datasets and larger parameters. However, there is a recent line of research that finds empirical evidence of coding tasks being linked to LLMs reasoning ability. Yao Fu argues coding tasks involve decomposing complex problems (OOP) into objects that can be roughly mapped to high-level problem solving + coding requires keeping track of state ex: opening and closing braces that provide attending to longer context windows which together gives the ability to reason about world. As of this writing, most of the research and developer community have thought about coding LLMs to be separate from general purpose LLMs. If this insight does hold true, we could be generating smaller LLMs with similar reasoning abilities as that of GPT3’s and 4’s of the world.

Neverthelss, studing origins of emergent abilities is an active area of research in NLP or LLM. Unlocking this insight has major repurcusions in how we will adapt LLMs in production-grade systems (analogy to physics – much like importance of solving superconductivity at room temperature)

I would like to conclude this post by saying, while reasoning is not the sole emergent ability. It is believed to be one of the key abilities that will unlock AGI in LLMs. Feel free to share you thoughts and comments on this topic.

The growing capabilities of Large Language Models (LLMs) have led to an increasing demand for their integration into production systems. However, implementing LLMs with low-latency requirements like Ads and Search (think bidding models, relevance engine, CTR prediction, etc.) can be challenging due to their massive size and complexity. In this blog post, we will present a strategy for deploying LLMs in latency-sensitive systems by using a combination of knowledge distillation, model compression, and optimized deployment techniques.

The traditional separation of teams into software (SWE) and modeling (ML) presents challenges in the world of large language models (LLMs), as it leads to complexities in deployment. Some have solved this problem through introduction of MLE roles (Machine Learning Engineering) to act as bridging role between scientists and engineers. We also discuss mental model to meaningfully innovate and make progress without excessively overstepping boundaries of individual roles.

The definition of “large” in Large Language Models (LLMs) has evolved over time, with GPT-1 models having 117M parameters, GPT-2 with 1.5B, and the recent GPT-3 model boasting 175B parameters. The increase in size has been driven by the improved state-of-the-art accuracies achieved on various NLP benchmarks as models scale up. Furthermore, there are significant differences in architectures, such as parametric models, non-parametric models with external knowledge bases or retrieval augmentation, and instruction-tuned models. The larger parameter size has enabled LLMs to act as world models capable of reasoning and planning, which has led to substantial progress in solving complex problems. However, these gains observed in offline experiments can be challenging to transfer to practical applications with tight latency constraints. The following sections will address the strategy for transitioning LLM gains from experimentation to production, with broad applicability to any class of LLMs.

Lets take the example of a company interested in building CTR prediction models for Ad bidding using LLM

  1. Engineers/MLEs work backwards from the use-case to determine the acceptable latency for such a system leading to a particular choice of parameter size considering a suitable decoder architecture like T5, BART, GPT and input/output sequence length. Lets call this student model.
  2. Parallel, science or ML team iterates on LLMs with different architectures like RAG, parameteric, LLM agents, Instruct LLM, etc. to find one or more approaches/models that provide good CTR prediction accuracies. Lets call these models as teacher models as they are expected to be superior CTR predictors than if we were to train the smaller-parameter LLM (student) chosen for run-time serving.

Following are some popular techniques used to productionalize LLMs

  • knowledge distillation [Gou et al.] [Hinton et al.] – is a technique that involves training or teaching a smaller LLM (that is runtime ready referred as “student” model) to mimic the behavior of larger LLM(referred as “teacher”). This is achieved by using larger parameter LLMs to generate training data for a smaller student LLM to fine-tune on. To control the quality of machine generated training data, we could consider model confidence among other domain constraints to selectively sample training data generated from the larger teacher LLM. The effectiveness of distillation is measured using distillation efficiency which is ratio of student model accuracy to teacher model accuracy. 
    There are variants of knowledge distillation that involves joint training of teacher and student with loss function focussed on reducing combined cross-entropy loss between student output and groundtruth + KL divergence between prob. distribution of student and teacher for same input. However, the former method is preferred for its simplicity and generic framework that is applicable to multiple teacher models. 
  • Model Compression [Dettmers et al.] – Model compression utilizes a collection of techniques, including pruning and quantization, to reduce model size and complexity. Pruning typically involves removing encoder/decoder layers or blocks and decreasing the number of hidden units per layer, thus lowering the model’s parameter count. Quantization, on the other hand, involves representing weights and activations with fewer bits, such as converting weights from FP32 to lower precision formats like FP16 or even INT8, which decreases computational overhead during inference. For example, a BART large model with 12 encoder and 12 decoder layers, totaling 400M parameters, can be compressed by removing 4 encoder and 4 decoder blocks, resulting in a sub-100M parameter model more suitable for real-time inference
  • Model Deployment – Optimizing deployment involves selecting open formats such as ONNX to benefit from hardware acceleration and runtime optimization. ONNX, or Open Neural Network Exchange, is a widely-used open-source standard for representing deep learning models, which supports cross-platform deployment (including desktops, servers, and mobile devices) while abstracting hardware optimization from the platform. Models in ONNX format are interoperable between various deep learning frameworks, such as PyTorch and TensorFlow.

In addition to these, there are considerations on hosting infrastructure i.e whether to use GPU or CPU and appropriate choice of hardware that is beyond the scope of this post. As we conclude this blog post, we’d love to hear your thoughts, suggestions, or experiences related to deploying Large Language Models in production. Have you encountered any unique challenges or discovered innovative solutions during your own journey? Please feel free to share your insights in the comments section below. Your input can greatly benefit the community as we continue to explore the vast potential of LLMs in real-world applications.

Large Language Models (LLMs) have  caused a paradigm shift in the way NLP is traditionally done i.e anything to do with text. I have been working in this domain for 10+ years having seen n-gram ways of processing textual data getting replaced with word vectors or word embeddings (word2vec) and now transformer-based models like BERT, GPT and T5 taking this to an all new level. Around 2022, I started talking about this nascent technology, giving talks to groups small and big to share my excitement towards emerging foundational LLM technology  that worked strikingly similar to how humans grow and learn new concepts. Now it is 2023 and my excitement towards this tech has multiplied 10X to say the least. 

I believe we have reached a critical juncture, a turning point, in the development of AGI, and we are living in exciting times. The next Google or next Facebook of the decade ahead, will be built by individuals and companies (small and large) taking baby steps today towards making use of these LLMs in innovative ways to build products and features. In terms of ML/AI advancements and the speed at which they are progressing, it seems that a single day in 2023 is equivalent to a month in 2020, and a year in 2010.

I still remember reading the book  The Master Algorithm by Pedro Domingos. The year was 2018.  Pedro talks about a master algorithm – single AI model to rule all tasks. Today we have regression models for predicting continuous quantities and different classes of models to predict discrete or categorical data. For other complex data types like image, text and audio they have their own algorithms. As an example, audio data involves fourier decomposition, image data requires filter banks and so on. Alternately, these complex data types have often been collapsed into low-dimension representation which can then be used with traditional learning algorithms. My initial reaction to this book back in 2018 was “As an insider, Well this is too far from where we are and unlikely to happen in a decade!”. Fast Forward to 2021 – I was amazed when Google released T5 showing how it could do varied tasks like regression, classification, reasoning, language generation purely as text-to-text. What Pedro talked about in his book was starting to happen !

Since then, generative LLMs, like the most recent GPT-4 model by OpenAI, have grown in popularity, showing human-level performance on various tasks and even superhuman performance on some tasks.

LLMs like GPT4 are seen to have emergent properties that are absent in similar models of similar architecture. There have been studies that show how GPT-4 can learn to translate low-resource languages that have not been seen in model training with just a few examples as in-context learning. Similarly, they have shown ability to write complex code based on natural language instructions, do planning and reasoning to tasks outside its data regime. Jason Wei maintains an excellent repository of emergent behaviors in LLMs like GPT and Chinchilla. 

LLMs like GPT-3/4 are capable of complex reasoning which has opened up avenues to apply RL-like sequential decision making using LLMs as Agents which is more efficient than dealing with agents that are trained on large-scale dataset of simulated events and rewards as in traditional RL. This has led to development of role-playing LLM agents that continuously self-reflect and refine giving it autonomous capabilities.  They are able to have very human-like conversations and engage in constructively debating on ideas and problems. 

On the engineering side, OS tools like langchain are building abstractions that are making SOTA research on LLMs and agents available to developers and scientists instantly and we have seen innovative applications like Auto-GPT and BabyAGI that are making AI beneficial for the masses. On a related note, LLMs are getting increasingly interdisciplinary with connections to Psychology involving “Theory of Mind

While there are many open scientific questions about how LLMs accomplish their tasks. I feel following topics are important to be understood

  • Can we separate “language understanding” from “knowledge”. Today’s LLMs like GPT, PaLM, GPT do not distinguish these two and to some extent it is hard and ambiguous. I feel if we can solve this, we can have much smaller LLMs that are good at language understanding so we can provide plug-and-play KBs to effectively reason. We have seen just reducing the 175B model to 7B or 11B model makes emergent properties disappear without this distinction. 
  • How can we leverage LLMs like GPT or Chinchilla in latency-sensitive applications like Ads and Search in a cost-effective manner. 
  • Does “Data” beat the model “Size” ? Studies on scaling law attempt to arrive at the sweet spot between data, size and compute.

We are surely living in interesting times. Motivated by all the progress and tireless follow-up to stay relevant in LLMs progress. I will be starting an LLM series with a plan to publish interesting topics – research, applied and trends to closely watch. This will include reviewing promising papers, tools, research, applications and ideas. If you have not already subscribed, consider subscribing to the NLP 2.0 blog. Also, Let me know what topics you would like to hear about in the comments. 

5 minutes

From N-grams to GPT-4: The Meteoric Rise of Large Language Models. #LLM #AI #NLP

Machine Learning – Linear Regression using gradient descent vs. neural network

Machine learning or Supervised Learning broadly encompasses two classes of problems – regression and classification. Regression deals with predicting a continous variable while a classification problem deals with categorical prediction. As an example, predicting house price is a regression problem which can take any real number. While, email spam detection is a classification problem as the outcome is a special kind of categorical variable ie. binary (spam 1/ non-spam 0). Numerous algorithms have been built over the last few decades which falls under one of these two classes. The focus of this post will be regression techniques and will reserve another post for classification techniques. Feel free to subscribe/bookmark this page for upcoming posts.

  • Introduction – Regression
  • Algorithms
    • Oridinary Least Squares (OLS)
    • Gradient Descent
    • Neural Networks
  • Comparison of Algorithms
  • Conclusions & Inferences

Regression Techniques helps to understand relationship of continous variable as a function of one of more independent aka explanatory variables. It can be considered as a curve fitting problem where we are interested in learning a function y = f(X) where X belongs to x1,x2,x3..xi which best fits the y. Now, how do we quantify best fit ? Most techniques uses some measure of error – SSE (sum-of-squared error) also called cost function aka objective function to quantify the quality of fit. We desire to learn parameters which gives the least error. So, formally we can define this as an optimization problem to minimize SSE given the parameters of function f(x).

Linear vs. Non-Linear : When y = f(x) takes the shape of y = a0 + a1 * x1 + a2 * x2 + .. + an * xn, we call this a linear regression where we are trying to fit a straight line to the curve. In non-linear we learn y = f(x) of more complex forms involving log, exponents, higher order polynomials of independent variables. For example y = a0 + a1 * log(x1) + a2 * e^x + a3* x^3.

Single vs. Multiple Regression : When y = f(x) has a more than one explanatory variable then called multiple regression.

Lets take a simple linear regression problem(dataset) and try to apply couple of algorithms and compare accuracy, complexity, run times.  We will conclude this post with in-depth understanding of techniques. Following three techniques have been explored in detail using R libraries. Feel free to download notebook and try running yourself, playing with parameters and plots available on my github. The notebooks are also available in html so you can explore the charts & documentation here. A high level summary is presented below, highly recommend checking the notebook here.

Link to jupyter notebook (code, charts & data) here

  1. Ordinary Least Squares
    A closed form solution uses matrix multiplication & inverse to find parameters which minimize error (SSE). This is an anlaytic solution always finds a minima (of error). Looking the (x,y) plot we attempt fitting non-linear model using linear regression exploiting variable transformation techniques.
  2. Gradient Descent
    A open form solution, which uses mathematical optimization by initializing parameters randomly and iteratively adjusting the parameters depending on the gradient that reduces SSE. We start with learning rate of 0.01 and iterate for 120000 times.
  3. Neural Networks
    Now, generally popular among AI (Artificial Intelligence) practitioners which mimics working of human brain, modelling hidden layers with weights propagating across layers between input & output nodes. Internally, uses gradient descent with back-propagation to learn the weights. We use a 3 hidden layers, each with 3 nodes and train a neural network model. Input & Output are single nodes as we have a single predictor and a single explanatory variable.

nnet_vis_1

models_1

Lets compare which of the models performs well in terms of best fitting the data. Please note, here I am not using train/test approach as idea here is emphasis on technique. So we are talking about training error here. Lets use RMSE (root mean square error) to compare the three models

RMSE
Ordinary Least Squares
RMSE
Gradient Descent
RMSE
Neural Network
0.000534 0.000984 0.000278

Looking at RMSE, neural network seems to be doing a great job in learning the data. NN is known to have good memorizing capability and causes over fitting leading high variance system. Also we must note, Neural Network did not involve any kind of feature engineering we just passed x values unlike in other methods, we had x^2 as feature. So NN is known to be great at identifying latent features without explicit need for feature engineering usually done by domain experts of respective problem spaces.

Table below explains more about each of the 3 techniques, how they differ, when to use-what etc.

Ordinary Least Squares Gradient Descent Neural Network
Closed Form
Analytic Solution
Open Form
Iterative Optimization
Slow, involves matrix inverse computation Fast for large datasets Slow, training usually done on GPUs which can handle matrix computations easily
Hyperparameters – epoch(iterations), alpha-learning rate Hyperparameters – hidden layers, nodes, activation fn., learning rate
Feature Engineering Required Feature Engineering Required Little/No Feature Engineering Needed
More data is good More data is good While more data is good, NN requires a lot of data for good generalization
Good Interpretability. Easy to communicate findings Good Interpretability. Easy to communicate findings Blackbox, difficult to explain suffers from interpretability
Most commonly used for building offline models on small datasets Commonly used for large scale learning involving thousands of features Used commonly in text/vision/speech – mostly uses two variants CNN and RNN architectures.
Tools: R,SAS,Python Tools: Python,R Tools: Deeplearning4j, Theano, Python, R, Tensorflow
Can get stuck in local minima, due to bad initialization/learning rate Can get stuck in local minima, due to bad initialization/learning rate

What we did not talk about, but important in this context ?

  • Cross Validation
  • Feature engineering and reduction(PCA)
  • Hyperparameter tuning
  • Sampling
  • Objective Functions
  • Regularization L1/L2
  • Interaction Effects

A lot of content to digest here, feel free to share any feedback/comment you have about any part of the blog post – would love to chat around. I will be back with a similar post on classification in coming days comparing logistic regression, decision trees, random forest(ensembles) and neural networks.

   We talk about big data quite often these days, wanted to put some fundas about basics around data. Do you know the singular form of data ? How data differs from information vs. knowledge? How insights convert to actions? Here is my attempt towards answering some of these.
      Data is often raw in binary represented as a number or a character or a string. Data is a plural version of datum. Information is anything which puts context to data. For example: The number 89 itself doesn’t mean anything unless we fit a context to say – the car’s speed is 89 kmph. Knowledge on other hand is about knowing how things around us work & the larger world is interconnected. It is obtained based on experiences, experiment, research, etc. In the below infographic I have tried to explain these terms using a simplified version of intelligence that can be embedded in cars. More on the infographic follows.
Data 2 Actions
           I have tried to take a very simplified version here. Now just imagine, when we talk about intelligent cars – we are usually talking about 100s of such parameters instead of just a single varaible (speed of car) ,all collected from multiple sensors obtained in realtime streaming format to make such decisions. Knowledge is obtained through predictive algorithms which continuously learns [in AI terms to say “adapts”] from data and help in making recommendation about safety of the vehicle. Now imagine these 100s of parameters collected every millisecond from 1000s of connected cars around you – this is what forms “Big Data”.
              Hope you liked this post, I will be writing up more articles – as have been getting requests from friends around the globe. Stay tuned !

Clustering is an unsupervised classification (learning) technique, where the objective is to maximize inter-cluster distance while minimizing the intra-cluster distance.  By unsupervised, we mean clustering or segmenting or classifying data based on all the available attributes and specifically there is no availability of class information. A supervised classification on other hand uses class information.
As usual, before we jump into ‘how’ let’s answer the ‘why’. Clustering is applied to solve variety of problems ranging from biological systems to using it for exploratory analysis of data ( as a pre-processing technique).  Many of the predictive analytics algorithms use clustering solutions as one of their components. It is used in all major brands for CRM, to understand their customer better. Another use of clustering is in outlier detection or fraud transaction identification.  If you have heard about a site called www.similarsites.com, it extensively works on clustering algorithms where the sites are segmented/clustered based on website attributes like category of domain, number of users, traffic, content type, corporate or personal, blog, image blog, video blog,etc. For example, if you entered INMOBI, you would get a list of companies which are in this space mainly its competitors – mojiva, Millenialmedia, Admob, Quattro, Mobclix,etc. If you are looking for image hosting site and want to know alternatives/options, this will be helpful.

We talk about similarity in terms of distance measures like

(i)                  Euclidean Distance

(ii)                Manhattan Distance

Read More

Turning raw data into insights often involves integrating data from multiple disparate sources (not just limited structured one), analyzing the data, visualizing it and socializing the results/insights to a broader audience to whom the results are of interest. In this cycle of turning data into insights, Visualization plays a vital role and hence would be the topic of my discussion in this blog post . Visualization could aid in analyzing huge data by identifying patterns which are easily interpretable visually as compared to tabular layout of numbers.Second, Visualization could help  represent the numbers using visuals which are easy for everyone to read and understand. One could easily convey the insights of the analysis by visuals, grasped in a minute or two, which might have possibly took 3-4 mins using textual aid/table of numbers.This is a important factor to consider especially when are you delivering the findings to the CEO/CFO/CXO/CIO of a company, as often they have limited time.

London Cholera Outbreak visualized

London Cholera Outbreak visualized

Going back to history of visualization. The most famous, early example mapping epidemiological data was Dr. John Snow’s map of deaths from a cholera outbreak in London, 1854, in relation to the locations of public water pumps. The original (high-res PDF copies from UCLA), spawned many imitators including this simplified version by Gilbert in 1958. Tufte (1983, p. 24) says,”Snow observed that cholera occurred almost entirely among those who lived near (and drank from) the Broad Street water pump. He had the handle of the contaminated pump removed, ending the neighborhood epidemic which had taken more than 500 lives.” Read More

For the last 6 months, i have been closely following trends in information management. Below are few of my observations.

  • Data source explosion: Business Problems are gaining complexity day by day, hence there is a huge demand for analyzing data from multitude of sources to help companies frame strategies for growth.  GPS data accumulated by Telecom companies offer insights into customers current location and provide context aware recomendations. Infact, some of the telecom companies have introduced location based pricing. Sensor data helps identify security threats to secure networks. Social network data has opened up as a channel for marketing services/product. Analysis of such closely knit data leads to behavioral & Contextual targeting. Traditional data analysis tools/algorithms fail to perform efficiently because such data are of huge sizes and needs newer datastructures for efficient analysis.
  • Databases going beyond relational is gaining popularity. NoSQL dbs and Graph/Tree/XML based databases.
  • Open Source tools continue to emerge.(R, RapidMiner, Weka)
  • Growing need for massive dataset analysis.
  • Artificial Intelligence(AI) and NLP gaining popularity among data analysts( in additional to ML techniques)
  • Multimedia Analytics: Need for gathering critical metrics like customer footfalls, quantifying customers satisfaction by using facial expressions. All these applications demand high end signal processing( both Image & Video). There is a lot of scope for innovation in this area.
  • Privacy preserving techniques for data analysis. This in turn encourages companies to outsource some of the critical data analysis to third parties.
  • Agile Methodologies for Analytics Project to cope up with rapidly changing customer/business needs.
  • Bio-Inspiration/Bio-Imitation: To learn from nature/natural processes and develop analogous techniques which could potentially solve a real-world problem. Some classic examples are development of Neural network inspired by working of a human brain, solving path optimization problem from Ant colonies, 280 degree view of honey bee(vision) etc.
  • More and more data are made publicly available.
  • Real Time data integration, insight generation and business decision.
  • Complex visualization techniques through new technology like Adobe Flex , MS Silverlight,etc which are known for generating RIA.(Rich Internet Applications)

And I am sure these are just few items in the list and really not exhaustive. Feel free to share your comments.