Tensorflow Speech Recognition challenge


  1. Definition

Project Overview

The project is closely related to Automatic Speech Recognition except that instead of recognizing the continuous speech constituting of many different words and sentences, we will be recognizing a small set of words and label remaining as unknown or silence. The training and test data contains wave files of nearly one second long each one uttering a word, noise or just silence.

The datasets required to train this project are provided by Google as part of a kaggle speech recognition challenge (kaggle).

In the age all digital products attempting to communicate with their customers in terms of speech rather than typing, ASR and NLP have been an interesting as well as important field of study. You can find brief history of research in the field of ASR in this link.

Problem Statement

The problem is to classify all the sound wave files into twelve broad categories of [‘down’, ‘go’, ‘left’, ‘no’, ‘off’, ‘on’, ‘right’, ‘stop’, ‘up’, ‘yes’, ‘unknown’, ‘silence’]

To achieve this,

  1. We will classify whether a wave file has any voice or it is just a silent file.
  2. For all the non-silent files, we will build a model that can classify whether the sound in the wave file belongs to one of the words mentioned above or some unknown word or sound.
  3. The anticipated solution/ model should predict all the 12 categories as accurately as possible in a large sample test set.


Kaggle provides a test set of nearly 150000 samples of wave files for which I will predict the labels and create a submission file to submit. Kaggle will provide the accuracy of my predictions based on the submission file.

Apart from this, we will do test train split of training data and check our models performance on training and test data samples with train and validation accuracies.

Since this is a classification problem with a few set of specific labels possible, accuracy is a good metric to gauge our model as we are only concerned about getting the prediction exactly right or wrong and all classes are equally important.

  1. Analysis

Data Exploration

Our training set contains .wav files with variants of 30 class labels, where each class label is a word. Number of training .wav files provided for each wave is around 2000 on average which is large enough for training a good model. Out of the approximate 64000 .wav files, nearly 4000 files were detected by VAD (Voice Activation Detector) as silent or noise files. We have excluded these from training set. Few of the word pairs in the dataset like (go, no) etc. have very close pronunciation and predictions compete closely with each other as well. Each wave had some silence/noise padded before and after the word for which we had to use VAD and strip off frames of that sort. One such wave for example is provided in Exploratory Visualization section.


Initially we assumed that for some words in test set that are other than these 30 words (among which we have to recognize 10 words as is and other twenty as unknown), the model will probably not return probabilities more than the threshold for any of the labels and hence we can consider that word as unknown. However, since it is hard to find proper threshold of that kind even if that exists, we introduced a new label called “Unknown unknowns” which are some words apart from these 30. It is probably tough for the model to converge all other words with varied features to aggregate as single class. We also need to prepare some test data with random words to this Unknown unknown’s class.  The procedure we followed to create data for Unknown unknown’s class is: 1.     Take a wave file from existing training set and find the max amplitude index from that wave’s samples.2.     Take the wave from beginning to that max sample value and store it as first half of the final wave required.3.     Take another random wave and pick max sample value to end samples to create second half of the final wave.4.     Merge first half and second half to get a new wave file with some random sound which might not even be an actual word.5.     This data will be a right fit for the unknown unknown’s class.


Exploratory Visualization

Plot of raw wave and its spectrogram:


Plot of sample MFCC for a wave file:



[i]We have plotted mean fft and spectrogram of different words that needs to be classified, to decide on which features to pick for classification.


Then we did violin plot to identify amount of different frequencies in each word label of training set.


Algorithms and Techniques

Since speech is a time series where sentences are sequences of words and words are sequences of phonemes, we chose long short term memory networks which can hold the memory of recent phonemes and predict the next phoneme considering the memory and subsequently word.


Though we could have tackled this problem using convolutional neural network, since the problem is restricted to words, eventually when we need to scale it to sentences, it makes sense to approach with LSTM. Though RNNs are good for sequential data, when you need to predict long term dependencies, it is better to use LSTMs over vanilla RNN. LSTMs remember data for longer periods of time comparatively.


A standard LSTM cell consists of:

  1. A forget layer which makes the decision what needs to be remembered and passed from previous cell state and what needs to be forgotten.
  2. An input gate layer which makes the decision of what part of input needs to be added to the cell state.
  3. An output gate layer which decides which part of cell state should be given as output.

A slightly modified version of the same called Gated Recurrent unit.



We chose binary cross-entropy and categorical accuracy to train our model as it is a Multi-classification problem to predict our labels. Our labels are one-hot encoded and passed on to model for training.


We have probabilities for each label coming out of the model and picked the label with maximum probability after passing through a threshold value. If none of the labels is more than the threshold, we guessed the word as unknown.


For silence labels, we have passed the wave through the VAD even before passing it model for prediction and labelled accordingly.


The model will be given an input numpy array of size (16, 26) which will be taken by an input LSTM layer where 16 represents sequence of 16 frames and 26 represents the mfcc and delta_mfcc features of each frame.



I have picked the results of my first trained model as the benchmark which contained 12 labels with a dense layer at the output, one LSTM input layer and one LSTM hidden layer. The accuracy of that model against the test set was 0.62.


Test accuracy: 0.62


Layer (type)                 Output Shape              Param #


lstm_1 (LSTM)                (None, 16, 39)            8268


lstm_2 (LSTM)                (None, 26)                6864


dense_1 (Dense)              (None, 12)                810


Total params: 15,942

Trainable params: 15,942

Non-trainable params: 0


III. Methodology

Data Preprocessing

After considering, spectrogram, fft and mfcc for Features to train a model, We chose to build features based on mel scale, which is inspired by how humans process speech in ears. Hence built mfcc features for each wave which is stripped off with silence using Voice Activation Detection.


We have used Librosa library to build mfcc features from a raw sound wave. It includes


  1. Converting wave file into smaller frames.
  2. Find the power spectrum of each frame
  3. Apply mel filter bank to the spectra and sum power inside each filter.
  4. Take logarithm of few filterbank energies.
  5. Convert them to DCT and pick few important coefficients of the same.


These coefficients are called Mel-frequency cepstral coefficients and state of the art in Automatic Speech Recognition systems.


Later on, considering Speech information could be present in dynamics of spectral frames, rather than just the spectral envelope of frames, we added delta mfcc features as well.


Stacking both mfcc and delta_mfcc features for each frame and doing it for all 16 frames of the wave, we have ended up with features of size (16, 26).


Machine Learning Pipeline: 1.     Pass wave file through VAD (Voice Activity Detector) to filter out and label all silence files.2.     Pass remaining wave files through chosen feature extractor, MFCCs in this case.3.     Pass the features through stacked LSTM model which can detect patterns across long range of phonemes/frames in the wave file and has the output of 32 labels.4.     Train the model with different parameters with epochs enough for the model to converge. It took around 30 minutes to train each epoch in my CPU and the model used to take around 15 to 20 epochs to converge. It is very time consuming to experiment each time to train the model.   _________________________________________________________________Layer (type)                 Output Shape              Param #   =================================================================lstm_9 (LSTM)                (None, 16, 52)            16432     _________________________________________________________________dropout_9 (Dropout)          (None, 16, 52)            0         _________________________________________________________________lstm_10 (LSTM)               (None, 16, 45)            17640     _________________________________________________________________dropout_10 (Dropout)         (None, 16, 45)            0         _________________________________________________________________lstm_11 (LSTM)               (None, 45)                16380     _________________________________________________________________dense_5 (Dense)              (None, 45)                2070      _________________________________________________________________dropout_11 (Dropout)         (None, 45)                0         _________________________________________________________________dense_6 (Dense)              (None, 30)                1380      =================================================================Total params: 53,902Trainable params: 53,902Non-trainable params: 0_________________________________________________________________



The initial solution had our model with 11 labels, with two LSTM layers and one dense output layer. It had only mfcc features and delta_mfcc were added later. All the words other than the ten words that needs to be predicted are grouped and one label.


Accuracy on test set : 0.63


Layer (type)                 Output Shape              Param #


lstm_1 (LSTM)                (None, 16, 39)            8268


lstm_2 (LSTM)                (None, 26)                6864


dense_1 (Dense)              (None, 11)                810


Total params: 15,942

Trainable params: 15,942

Non-trainable params: 0



Then, we have added delta features, drop out layers for regularization to avoid overfitting. We have changed the number of labels in the output dense layer to 31. This increased the number of trainable parameters to approximately 53k and significantly increased the model training time per epoch.

Initially we were working with threshold value of 0.5, which skipped many words that are recognized around 0.25 and 0.3 max probabilities. After reducing threshold to 0.1, score has crossed 0.7 on the leaderboard test set.


Later on, I have split background noise wave files into multiple 1 sec long waves and used them as well as training data for another model has one more silence label included which makes it to a 32 labelled model.

Using the batch size parameter while fitting the model to training data helped regarding the training time.

Final model:

Layer (type)                 Output Shape              Param #


lstm_32 (LSTM)               (None, 16, 96)            41856


lstm_33 (LSTM)               (None, 16, 96)            74112


lstm_34 (LSTM)               (None, 96)                74112


dense_18 (Dense)             (None, 32)                3104


Total params: 193,184

Trainable params: 193,184

Non-trainable params: 0



  1. Results

Free Form Visualization:

As I’ve recorded the accuracy and loss of the models per epoch, here’s the accuracy/loss graph of the model with batch normalization.


As we can see the training accuracy is near 100% in the diagram and the loss is near 0. Similarly the validation accuracy is also near 95% while the validation loss is around 0.2% near the end of the 30 epochs. We also see the trend where the validation loss reached the lower bound before the training loss. Clearly this model is overfitting on the training data.

Model Evaluation and Validation


The final model has been tested against the test set of 1.5 lakh samples with varied sounds including normal words, meaningless sounds, different kinds of noises, silence samples etc.  Hence the score on the test set validates the model to be of reasonably good quality.

Our model is robust enough as we take the sound wave through Voice Activity Detection and strip off the sound wave from unwanted silent frames and pass on only the valid sound frames to the model for prediction. I have created some wave files with words spoken in the different background noises and model was robust enough to predict the words.

I have also hand-picked few predictions from the test set and verified manually to validate my model’s predictions.

The final model has reported an accuracy of 0.75 in test set of 1.5 lakh samples and close to 0.95 on training set.


The accuracy of 0.75 on test set produced by the final model is lot better than the 0.63 score of initial benchmark model.

The final solution had stacked LSTM of three layers, first two returning full sequences while the last layer returns a vector.  These layers are followed by dense layers of 32 nodes equivalent to number of labels to predict.

Though it has lot of scope for improvement and experimentation which is described in below section, this score and model is robust and significant enough to do speech recognition for specified labels.

  1. Conclusion


After trying out with different models and feature data, the best model obtained has the unknown unknowns (words not in training set considered) and have been trained on 32 labels with MFCC features. One more binary classification for detecting Silence in the wave file is employed. We have stripped off silent frames in the audio file before training. Model with LSTM layers considering sequences and memory states with dropout layers for generalization have been fruitful.


One difficult and yet interesting part of this problem is classifying words that pronounce closely and classifying words in dominating background noise which can possibly have multiple interpretations. For examples, go and no pronounce closely and can be easily mistaken in noisy background. However, even humans might not be accurate in this situations but the sentence context can have some information which hints towards a particular word in real word situations.


  1. The dimensionality of features is huge and try using PCA to remove some dimensions of negligible variance before passing it to the model for training.
  2. Though the MFCC features are state of the art in speech recognition based on the literature, we should try other features like raw wave, fft, log mel features etc and combinations of different features as well.
  3. We should try using few other neural network layers for mfcc features and see how the model performs. Some of them might include:
    1. Convolution LSTM.
    2. Densenet201 etc.
  4. I should try other models like XGBoost (LGBM implementation, as it is computationally effective), random forest kind of ensemble models and see how they perform on this Multi-classification problem.
  5. I should either setup or rent GPU to run these models as these will take days if I were to run on normal CPU. I need to make use floydhub, paperspace or vectordash in future.
  6. I believe, experimenting all these options with proper computational power will definitely enhance my final benchmark model.
  7. Listen to mislabeled samples in validation set.
  8. Data augmentation: Add heavy noise augmentation while keeping noise vs signal ratio below 2.



References :


You can find data required for this project in below link https://www.kaggle.com/c/tensorflow-speech-recognition-challenge/data

Some of the Blog posts followed:

  1. https://ideasforeversite.wordpress.com/2018/03/18/automatic-speech-recognition/
  2. http://practicalcryptography.com/miscellaneous/machine-learning/guide-mel-frequency-cepstral-coefficients-mfccs/
  3. Voice Activity Detection : https://pypi.python.org/pypi/webrtcvad , https://github.com/wiseman/py-webrtcvad/blob/master/example.py



Libraries used:

  1. For extracting MFCC features : https://librosa.github.io/librosa/generated/librosa.feature.mfcc.html
  2. Keras : https://keras.io/layers/recurrent/#lstm

Boiler plate: https://www.kaggle.com/davids1992/speech-representation-and-data-exploration






Automatic Speech Recognition

How do humans hear the speech (5:25)- Introduction on how evolution did it:

An organ in our ear called cochlea has a specialized contribution to our auditory system. It is designed to be responsive to frequency and move variably specific areas along the basilar membrane in response to different frequencies of sound. Based on the area in which basilar membrane moved, different nerve impulses are triggered and informed the brain. A step in the process of extracting Mel frequency cepstral coefficients(popular features for ASR), called periodogram extraction, does a very similar thing.

Mel frequency cepstral coefficients:

Steps to prepare MFCCs:

  1. Split the audio signal into small frames of 20-40 ms, the standard is 25 ms.
  2. Calculate periodogram estimate(power spectrum) for each frame.
  3. Take clumps of periodogram bins and sum the spectrum inside to get the energy levels around different frequencies. We use Mel filterbank to do this. The Mel scale tells us exactly how to space our filterbanks.
  4. Take logarithm of filterbank energies. Humans don’t hear in linear scale as well.
  5. Compute the DCT of log filter bank energies. We do this to decorrelate filterbank energies which are quite correlated. We compress and pick only 12 or 13 coefficients.

Python libraries to extract MFCCs:

  1. scikits.talkbox
  2. librosa
  3. python_speech_features

Learning algorithms for speech recognition:

  1. Using state of the art LSTM recurrent neural networks



Detailed LSTM tutorial






Markov Decision processes

Markov decision process.PNG

A tuple – (S,s1,A,P,R)

S – finite set of states.

s1 – initial state.

A – finite set of actions.

P – Given a state s1 and action a, what is the probability of ending up at a particular state s2? This information is provided by P. This is called a State transition probability matrix. S*A*S

R – What is the reward for taking an action a at the state s?

Pi – Policy, what action to take at which state? A mapping from state space to action space. The optimal policy is the policy with the maximum reward.

Value function – Expected reward starting from state s and following policy pi.

Bellman equations –


Bellman equation.PNG

Ways to find an optimal policy:

Optimal policy when starting in state s is the policy whose value function is maximum starting from state s.

Think about the case where a particular policy is always better than another policy for all initial states. The first policy is greater than second policy and this is called partial ordering.


There always exists a policy such that the partial ordering of it with all other policies is greater or equal. Such a policy/policies is called optimal policy.

Iterative methods:

  1. Policy iteration:
    1. At each state, pick the action with max value function.
    2. you get the new policy.
    3. Again go to step 1 and loop until the new policy and old policy are same.
  2. Value iteration:
    1. Finding an optimal value function rather than explicit policy.
    2. For every iteration improve value vector.
    3. once you converge to a particular value vector, use it to find the optimal policy.
    4. Value iteration.PNG
    5. This is cheap compared to policy iteration.

Model-free methods:

Reinforcement learning is Markov decision process with unknown transition model and/or reward distribution.

Agent can observe samples and induce transition model and reward distribution from that.


Uses Q-values instead of the value function.

To be continued……….


Utility: Utility of the policy at a state is what happens if we start running from that state.

Reward gives us immediate feedback but utility gives us long-term feedback. utilities allow you to take short-term negatives for long-term positives.

Credit assignment problem:


Decision trees

What are decision trees?

Decision trees are classification models built using a tree where a particular feature is selected for branching at different levels down the tree.

They help you to create non-linear boundaries with series of linear questions.

Decision tree

The primary choice to make is which feature to use for splitting first?

The objective in choosing the order of split is to further narrow down the possibilities faster. For example, Animal? Person? famous? living? singer? female? Michael jackson?

If you have started with Michael, if I said yes, then that’s good, but if I said no, you effectively narrowed down almost nothing.

We use info gain or Gini to make that decision.

Info gain:


but what does the info gain represent?

Info gain of a particular feature(work experience in above example) represents the predictability gain as we go down one level using that feature.

So, How is this predictability gain calculated?

This is measured by difference in predictability at parent node and expected sum of predictabilities in child nodes.

What is predictability at a node?

If number of positive and negative examples at a particular node is almost equal, then it is equally likely for a new example at that node to be either positive or negative.

If number of positive examples is far more than the number of negative examples at a particular node, then we can say with much more confidence that a new example at this node is more likely to be positive.

Second case is more predictable and contains less surprise, whereas first case is less predictable and has more surprise.

Now we use Entropy to quantify surprise at any node. How?


If all the samples belong to the same class, entropy is 0. If the samples are evenly distributed among all classes, then the entropy is 1.

  1. How to train?
    1. How to select the next best feature for branching?
      1. Calculate info-gain for all the features and select the one with maximum.
  2. How to predict?
    1. Just pass the test example down the tree and you have your classification at the leaf node.
  3. How to write a decision tree in python using scikit?
    1. writing the classifierdtclf
    2. visualizing the decision treevizdt

When to use decision trees?

  1. well-suited for categorical data.

How much training data do you need?

Nature of decision boundary

Decision trees are interpretable, unlike neural networks. Here you can exactly understand why a classifier makes a decision.

They are weak learners and hence good for ensemble methods.

They can handle unbalanced data sets, 1 % positive examples and 99 % negative examples.

This is Univariate – does not combine features.

Expressiveness among all possible decision trees in n-attributes:

How to deal with continuous attributes? Age,weight,distance

It does make sense to repeat an attribute along a path of a tree if attributes are discrete. If attributes are continuous, then you ask different questions for same attributes.

What to do to avoid overfitting in decision tree?

  1. like usually, using cross validation.
  2. Checking with cross validation at every time you expand the tree and the cv score is less than a particular threshold, then stop the expansion.
  3. Or first, expand the whole tree and start pruning back the leaves and check cv score at each step.


ID3: Top down

Tuning parameters in sklearn:

  1. min_samples_split(default=2):  If node contains only n samples, then it will not be split any further.
  2. criterion=gini,entropy

Decision trees are generally prone to overfitting.

You can build bigger classifiers like ensemble methods using decision trees.







Computational combinatorics vs heuristics

In most of the AI algorithms, it is often the case that you enumerate in some sort and search through all the possible states and check whether we are at the goal state or particular constraints are being satisfied for the different combinations of states. This pretty much looks like a brute force and doesn’t seem to be intelligent though it could solve a pretty huge number of problems because of computational power we are endowed with. There would be a hard upper bound to the kind of problems we can tackle because of the computational limitation.

To make the search more efficient and more directional, we try to use human thought up heuristics into the search algorithms to guide the search. This is the crucial part of Human-machine interaction, leveraging the benefits of computational power and accuracy to the uniquely human-like heuristic intuition.

But the main question is, could we ever be able to make a computer think up its own heuristics appropriate to problem domain to guide its search?

Another question, How do we come up with heuristics in the first place? Do we develop it by copying the practices of our peers and mentors through our experience in the domain? or did we develop an innate sense of what to do when in some regular day to day chore settings because of millions of years of evolution? Is the ability to think up right heuristics is what true intelligence is?

Architecture for AGI


This article mainly runs on following dimensions.

  1. Perceiving the state of the world and state of yourself in the world.
    1. The power of discriminating and encapsulating different material and immaterial concepts which are often indefinitely defined, fuzzy, context -based and have no rigid boundaries.
      1. Natural language processing.
      2. Abstractions- clustering problem. Identifying the need for a level of clustering for current problem domain.importance of similarity measure in clustering. Rudimentary kind of clustering is clustering based on how different objects in the world be useful to him in what ways… Like things to eat to be clustered as one, things to talk to, things to use as the tool, or which body part to be used for what object. K-means algorithm.naive Bayes inferring classes from the models and models from the classes.em algorithm.
    2. pattern recognition.
    3. proprioception and self-referential systems.
    4. object recognition (https://www.youtube.com/watch?v=5yeusVF42K4)
    5. Donald Hoffman’s theory of perception.perceptual systems tuned to fitness outcompete those that are tuned to truth/reality.
  2. Goals and priorities in them.
    1. Switching between the goals based on the timeframes and dependencies for each goal.
    2. Inbuilt goals like getting sex, food, staying alive and goals we choose for ourselves by the culture and experiences and interaction between two kinds. Ethical issues of giving a power of creating self-goals to AI systems.
    3. Decision-making.
  3. Relevance measure for objects perceived in the world to the current goal.
    1. Prioritising perception or attention based on relevance to goal.
    2. Adaptivity of attention based on the skill level(acquired by repetition of the same process)
  4. Identifying actions or a sequence of actions to impact objects for achieving our goals.(planning)
    1. The element of stochasticity or randomness in finding the sequence is necessary to be creative when the sequence is unknown or not obvious.
    2. This can happen from analogies, empathy and copying, pattern recognition.
    3. Considering optimisation in face of multiple paths to a goal.
    4. related to imagination.
  5. A parallel Emotion machine with continuous feedback of utility or happiness on your present state in the world.
    1. Decision-making.
    2. operant conditioning.
  6. Platform knowledge: what information do we have at our birth about the world we are surrounded by and how to build it for a machine?
    1. Evolution.
    2. Jealous baby.
    3. A young baby doesn’t get surprised if a teddy bear passes behind a screen and reemerges as an aeroplane, but a one-year-old does.
    4. No categorical classification at the very beginning.
    5. As a child grows, a number of neurones decreases but a number of synapses increases.
  7. Knowledge representation, search and weights for retention(what you don’t use, you lose)
    1. Brain plasticity.
    2. Repetition and skill.
    3. Priming and classical conditioning.
    4. The DRM effect
    5. Do elements containing similar structure trigger each other.(Hofstader’s Danny in the grand canyon), pattern recognition by analogy.
    6. Reconstruction of data as memory at hippocampus with already existing memory.
    7. Procedural , episodic, semantic
    8. Representation of ideas that could be independent of language, which could allow for generalized reasoning to draw conclusions from combining ideas.
  8. Imagination at the core of knowledge, emotion machine and goals.
    1. Demis hassabis research on Imagination and memory.

What this article doesn’t talk about is:

  1. Whether a machine can have qualia even though it acts like it has general intelligence.Mary’s room. How to test qualia for a machine? Is Turing test enough for that?
  2. When did the consciousness first evolved and How?
  3. Moral implications of AGI and singularity.

Organising the interaction of all the above:

Why do I think I am talking about AGI? because we should be able to deduce any human endeavour from the interactions of above dimensions, at least from the infinitely long interactions and their orderings. Connections missed in below diagram:

  1. Goals to emotion machine.


Deep dive (one at a time):

    1. Why object recognition is difficult?
      1. Objects defined based on purpose rather than its look or structure. We need to co-ordinate with module 3 and 7 to overcome this.
      2. We need to identify the object even though the viewpoint of it changes. When viewpoint changes, we have this problem of dimension hopping in training neural networks to recognize the object.Usually, inputs of neural networks for image recognition would be pixels, but when viewpoint changes, the input at one pixel at one training instance will be same at another pixel during different training instance. This is dimension hopping.#viewpoint_invariance.
    2. True language understanding is impossible without internal modeling the world.
  1.  Goals and priorities
  2. Relevance
  3. Planning (identifying action sequences):
  4. Emotion machine and utility:
  5. Platform knowledge: Jean Piaget, a child development psychologist argued that children create a mental model of the world around them and continuously modify that model as they interact with reality and receives feedback. He called this phenomenon as Equilibration.
    1. Infant reflexes .
    2. A child brain is less focussed with more learning rate on multitude of things, whereas an adult is more focussed, having better self control but with less learning rate.
    3. It seems true that humans come into the world with some innate abilities. some examples by steven pinker,
      1. when a human and a pet are both exposed to speech, human acquires the language whereas pet doesn’t, presumably because of some innate difference between them.
      2. Sexual preferences of men and women vary.
      3. Experimental studies on identical twins who are separated at birth and examined at later stages of life showing astonishing similarities.
      4. Do we have moral sense at birth?
    4. The above discusses about some facts on human nature at birth, but our main focus is to find Is there a seed meta-program that we are endowed with at birth which makes all other learning and behavior possible? If there exists what is it? These kinds of systems are called constructivist systems (thorisson).

Existing Cognitive Architectures: Two types

  1. Uniformity first:
    1. Soar Architecture
  2. Diversity first:
    1. ACT-R : intelligent tutoring systems are its application. John Anderson.

      Autocatalytic Endogenous Reflective Architecture (AERA)

Sigma architecture:


Deep mind(Demis hassabis):

Deep learning + Reinforcement learning


  1. Art is partly skill and partly stochastic.
  2. Poetry at the intersection of analogy and utility.
  3. How much of General and/or human intelligence has to do with reason? If it is fully correlated with reason and morality is built on only reason, then we and AGI has no conflict of interest as more intelligence only means more morality. #moralitynintelligence
  4. Value system guiding attention, attention setting the problem domain, problem domain looking for heuristics, heuristics guiding the search path.


  1. Distributed representation: Many neurones are involved in representing one concept and one neurone is involved in representing many concepts.

Artificial Intelligence

Definition: The study and design of intelligent agents, where an intelligent agent is a system that perceives its environment and takes actions that maximize its chances of success.

Agent and environment:

Rational agent: The agent which selects the action that maximizes its performance measure given its inputs or perceptions and built-in knowledge.

PEAS: Performance, Environment, Actuators, and Sensors

Types of Environment:

  1. Fully vs partially observable:
  2. Deterministic( vs stochastic): The next state of the environment is completely determined by the current state.
  3. Episodic vs sequential: Each perception-action pair can be considered as an episode.
  4. static vs dynamic:
  5. discrete vs continuous:
  6. single-agent vs multi-agent:
  7. known vs unknown:

Types of Agents in the order of increasing generality:

  1. Simple reflex agents: based on current state only. Look up the table of percepts and actions.
  2. Model-based reflex agents: based on percept history, maintains a model of the world, condition-action rules decide the action of the agent.
  3. Goal-based agents: contains goal information additional to Model-based agents.Goals decide the actions of the agent.
  4. Utility-based agents:  Utility function is the agent’s performance measure. try to increase the happiness.
  5. Learning agents:

Agent’s organisation:

  1. Atomic representation: Consider each state as a black box.
  2. Factored representation: Each state has attribute value properties.
  3. Structured representation: Relationship between objects inside a state.

Search Agents

General example problems from AI textbook:

  1. Travelling salesman problem.
  2. Route finding problem
  3. Robot navigation
  4. protein design

State space, search space and search tree.


  1. Completeness: Can the algorithm find a solution when there exists one?
  2. Optimality: Can the algorithm find the least cost path to the solution?
  3. Time complexity: How long does it take to complete the search.
  4. Space complexity: How much memory is needed to complete the search.

Compare search

Two types of search:

  1. Uninformed: No information about the domain.
    1. Breadth-first: FIFO
      1. bfs
      2. Applications: computing shortest path when all the edge lengths are same.
    2. Depth first: LIFO
      1. dfs
      2. Depth limited is not optimal because it might return us a high-cost path on the left sub-tree without considering other low-cost paths in the right subtree.
      3. Applications: Topological ordering of directed acyclic graphs.
    3. Depth limited: Depth first with depth limit
    4. Iterative deepening: Depth limited with increasing limit, combines benefits of depth first and breadth first. First do depth first search to a depth limit n and then increase the depth and do depth first search again. diff of BFS and IDS
    5. Uniform cost: Expand least cost node first, FIFO
      1. Dijkstra’s algorithm is a variant of ucs, except that ucs finds shortest path from start node to goal node, whereas Dijkstra finds shortest path from start node to every other node.ucs and Djikshtra.Computing shortest path when edge lengths are different. Use heap data structure for speed.
      2. I doubt Google uses UCS for it’s route finding in maps. Cost between the nodes can be a mix of distance, traffic, weather etc.Maps algorithm
        1. ucs
  2. Informed: Some information about the domain and goals.
    1. Heuristics search (h(n)): Determine how close we are to the goal based on a heuristic function. Admissible heuristic thinks that the cost to the solution is less than the actual cost.
      1. Greedy search expands the node that appears to be closest to the goal. uses a heuristic function that has the information of cost from a particular node to the target node.
        1. greedy search.PNG
      2. A* search:  Consider both the heuristic that says about the cost from a node to the target node and cost to reach that node. A* search will revise it’s decision, whereas greedy search would not. This property makes greedy incomplete and A* complete.
      3. IDA*:
  3. Local search: iterative improvement algorithms by an optimization/fitness function.
    1. Hill climbing (steepest descent/ascent,greedy local search):some variants of this are:hill-climbing
      1. sideways moves – escape flat lines.
      2. Random restart – to overcome local maxima.
      3. Stochastic –
    2. Simulated annealing : from statistical physics.
    3. Local beam search – k states instead of single state.
    4. Genetic algorithms: from evolutionary biology
      1. genetic algo.PNG

Adversarial search and games:

Multiagent environment, uncontrollable opponent.

Not a sequence of actions, but a strategy or policy.

Approximation is the key ingredient in the face of combinatorial explosion.

Types of games.PNG

(perfect information, deterministic) – Zero sum games.

At any stage, one agent will try to pick the move with maximum utility out of all legal moves and the opponent will pick the move with minimum utility.

Embedded thinking or backward reasoning:

Thinking through the consequences and potential actions of the opponent before making the decision.

Minimax : Two opponents playing optimally, one trying to maximize and other trying to minimize the utility.

Alpha-Beta pruning:

When the agent that is trying to find Max utility of it’s children, you start with an alpha as (-infinity), and update alpha if you find any child nodes with utility better than present alpha. During this process, if you come across any 2nd gen childs with utility less than alpha, then you don’t need look for utilities of other 2nd gen childs of that particular 1st child. Why? because that 1st gen child is trying to find minimum, once you go down below alpha, that 1st gen child is out context at the zeroth generation which is trying to find max has found that this 1st gen child can no longer contribute to me. In this context, ordering of child nodes matters a lot. Right ordering can save you from exploring a lot of unnecessary childs/paths.

Stochastic/Non-deterministic games:

A chance/random element involved.


Machine Learning:

Data collection -> Data preparation -> Exploratory Data analysis(picture the data using different graphs and charts) -> Machine learning techniques -> Visualization -> Decision

  1. Decision trees:
  2. Rule induction:
  3. Neural networks:
  4. SVMs
  5. Clustering method
    1. K-means
    2. Gaussian mixtures
    3. Hierarchical clustering
    4. Spectral clustering.
  6. Association rules
  7. Feature selection
  8. Visualization
  9. Graphical models
  10. Genetic algorithm


  1. Hypothesis testing
  2. Experimental design
  3. Anova
  4. Linear regression
  5. Logistic regression:
    1. It is a linear classifier, why?
      1. Because the decision boundary where the probability=0.5 is linear.
  6. GLM
  7. PCA

Definition of Machine learning:

A computer program is said to learn from experience E with respect to some class of tasks T and performance P, if its performance at tasks in T, as measured by P, improves with experience E.

BY its very definition, Machine learning algorithms are generally subdued to a particular class of tasks and not generic enough for general intelligence.

K-nearest neighbors:

O(n*d) – n is the number of training examples and d is the number of features.

K is usually chosen as odd value.

z-scaling – Each feature has a mean of 0 and a standard deviation of 1.

kernel regression is a flavor of k-nearest neighbors where we weight the contribution of each neighbor, whereas in k-nearest neighbor we just take the average.


Recommender systems (collaborative filtering).

Breast cancer diagnosis

Handwritten character classification

information retrieval


To Avoid over fitting:

Reduce the number of features manually or do feature selection (Algorithms like random forest and XGBoost can give feature importance that you can use to prune less important features).

Do a model selection

Use regularization (keep the features but reduce their importance by setting small parameter values)

  1. Lasso regularization (L1): used in case if you want to prune some unimportant features.
  2. Ridge regularization (L2): works better than L1 usually in practice.
    1. .

Do a cross-validation to estimate the test error.

Training set used for learning the model.

The validation set is used to tune the parameters, usually to control overfitting.

Linear Models:

  1. Linear regression.
    1. Using least square loss
  2. Linear classification:
    1. Perceptron: Data has to be linearly separable.

Classification: Logistic regression

Decision Trees: Tree classifiers

Start with a set of all our examples at the top root node.

Create branches considering splitting criteria and repeat this for creating other branches as you go down.

write the final decision at the bottom of the tree based on training examples and used this tree to classify any new test data.

First, we calculate entropy at a node and then compare info gains for extending that node for different dimensions. Then we choose to extend based on the category of high info gain.

Pruning strategies to escape overfitting:

  1. Stop growing the tree earlier.
  2. Grow a complex tree and then prune it back. Remove a subtree and check it against a validation set, if the performance doesn’t degrade, remove the subtree from the original.

Cart method:

Another method to build a decision tree.

  1. Adopt same greedy, top-down algorithm.
  2. Binary splits instead of multiway splits.
  3. uses Gini index instead of information entropy.


p-proportion of positive examples

P-proportion of negative examples.

Bayes Rule:

Bayes rule.PNG

Discriminative Algorithms:

Idea: Model p(y/x), conditional distribution of y given x.

Find a decision boundary that separates positive from negative example.

Generative Algorithms:

modelling for Bayes rule.PNG

Naive Bayes assumes that feature values are conditionally independent given the label.

Naive Bayes classifier is linear.

Naive Bayes.PNG

Ensemble Methods:

An Ensemble method combines the predictions of many individual classifiers by majority voting.

Such individual classifiers, called weak learners, are required to perform slightly better than random.

  1. Majority voting:
    1. Condorcet’s Jury theorem:
      1. Assumptions:
        1. Each individual makes the right choice with a probability p.
        2. The votes are independent.
      2. If p>0.5, then adding more voters increases the probability that the majority decision is correct. If p<0.5, then adding more voters makes things worse.
  2. Boosting:
    1.  adaboost
    2. One of the most popular Machine learning methods.
    3. AdaBoost.M1-popular algorithm.
    4. Weak learners can be trees, perceptrons, decision stumps etc.
    5. The predictions of all weak learners are combined with a weighted majority voting.
    6. Idea: Train the weak learners on weighted training examples.AdaBoost.PNG
      1. AdaBoost with Decision stumps lead to a form of feature selection.
      2. Bootstrapping is a resampling technique for training weak learners from m samples of original training data.
    7. Iteratively give more weight on examples that are misclassified and less weight for examples that are classified properly.(Refer weight updation step d in above picture).
    8. Boosting.PNG
    9. Boosting tends to avoid overfitting because of increasing margin as you add more and more weak learners.
    10. Boosting tends to overfit if weak learners themselves are overfitting.
  3. Bagging:Bagging.PNG
    1. Bagging on trees works better than building a big tree on whole data.
  4. Random forests: Make a lot of decision trees with each tree trained on a random subset of training data and a random subset of all features considered. use the majority votes of different decision trees to determine the final outcome for a random forest. This is relying on the principle that majority of decision trees will compensate for some wrongly classifying decision trees at different instances.
    1. Quora uses random forests for determining whether two questions are duplicates.

A practical guide to applying Support Vector Classification:

Clustering examples:

  1. Clustering of the population by their demographics.
  2. Audio signal separation.
  3. Image segmentation.
  4. Clustering of geographic objects
  5. Clustering of stars.

Desirable clustering properties:

  1. Richness: For any assignment of objects to clusters, there is some distance matrix D such that a clustering scheme returns that clustering. For example, Stop when clusters are x units apart reached is rich because it can address all possible clusterings with variations of x. On the other hand, stop when a fixed number of clusters reached, is not rich because it can’t address all possible clusters.
  2. Scale invariance: Scaling distances by any value shouldn’t change the clustering of the objects.
  3. Consistency: Shrinking(making similar things more similar) intra cluster distances and expanding(making non-similar things more non-similar) inter clustering distances shouldn’t change the clustering.

Impossibility theorem: No clustering algorithm can achieve all three of richness, scale invariance, and consistency.


  1. Based on initial conditions of the centroid, you may end up with totally different clusters. To deal with this, we take the ensemble of clusterings with different initializations. n_init parameter determines that.
  2. sklearn.cluster.KMeans(n_clusters=8,max_iter=300, n_init=10).

Limitations of k-means:

  1. Output for any given training set would always be same as long as we do not randomly pick initial centroids. But we usually choose centroids randomly in the beginning.
  2. Hill climbing algorithm.

K-means fails for:


Solution for How to select K.

G-means:G-Means betterment of k-means.PNG

How to evaluate?

internal evaluation: High intra-cluster similarity, Low inter-cluster similarity.

External evaluation: Mutual information, entropy, adjusted random index.

Other clustering methods:

  1. Special clustering
  3. BIRCH

1.Single linkage clustering:

Keep on finding two closest points among all points and join them as long as you end up with k required clusters. This is a hierarchical agglomerative cluster structure.

single linkage cluster.PNG

Association Rules:

Supp(A->C): P(A^C)

Probabilistic interpretation:

R: A->C


Though this looks like a valid measure, it doesn’t take into account P(C).


interest(R)=1 then A and C are independent.

interest(R)>1 then A and C are positively dependent.



Collaborative filtering, customer behavior analysis, web organization, affinity promotion, cross-selling

Multidimensional rules:

Read more about Quantitative Association rules as well.


Care about goal rather than the path.


Three elements:

  1. A set of variables.
  2. A set of domains for each variable.
  3. A set of constraints that specify allowable combinations of values.

Types of consistency:

  1. Node consistency – unary constraints
  2. Arc-consistency (Constraint propagation)- binary constraints – For every value that can be assigned to X, if there exists a value that can be assigned to Y, then X->Y is arc consistent. AC-3 is the algorithm that makes a CSP arc consistent. O(n^2*d^3)
  3. Path-consistency – n-ary constraints

Forward checking propagates the information from assigned to unassigned variables only and does not check the interaction between unassigned variables. Arc consistency checks for all arcs and keep doing it whenever the domain of a variable is reduced. In this sense, AC is more general and an improvement of FC.

Two types of CSPs:

  1. Search based
    1. BFS
    2. DFS
    3. BTS
  2. Inference

Choose the one with minimum remaining values to fill first.

Fill with least constraining value first – the one that rules out fewest values in remaining variables.

Can work on independent subproblems separately based on the problem structure.


d-size of domain, n-number of variables.


Reinforcement learning:

Learning behaviour or actions based on the rewards/feedback from environment.

Science of sequential decision making.

  1. Markov decision process – Formal model for sequential decision making.
    1. Markov property – Future is independent of the past given the current state.
    2. Applying reinforcement framework to Markov process.
    3. Theorem – Due to the Markov property, A discounted MDP always has the stationary optimal policy (pi).
    4. A policy is better than another policy if it’s value is greater than the other one calculated from all starting states.
    5. One useful fact is there always exist an optimal policy which is better than all other policies.
    6. Bellman equations.
    7. No closed form solution to Bellman equations to solve for optimal policy.
    8. other iterative solution approaches are
      1. Policy iteration.
      2. Value iteration.
    9. Think of different policies to traverse through MDP. Calculate value of each policy.
    10. Q-value iteration.
    11. Epsilon greedy exploration – with probability epsilon, take random action, with probability (1-epsilon) , take greedy action.

Logical agents or knowledge-based agents:

Mainly based on knowledge representation(domain specific) and inference(domain independent).

  1. Using propositional logic.
  2. Using First – Order Logic – used in personalized assistants.

pros and cons of Logical agents:

  1. Doesn’t handle uncertainty, probability does.
  2. Rule-based doesn’t use data, ML does.
  3. It is hard to model every aspect of the world.
  4. Models are encoded explicitly.

Resolution algorithm

Natural Language Processing:

Text classification (spam filtering)- Naive Bayes classifier.

Sentiment analysis

Naive bayes classifier – Assumes features are independent.

Classifies the new example based on the probabilities of different classes given the features of that particular example.

m-estimate of the probability:

  1. Augment the sample size by m virtual examples, distributed according to prior p.

Chain rule of probability:

  1. To estimate joint probability

N-gram model: Look N-1 words in the past to predict next word.

Bigrams capture synctactic dependencies such as noun comes after eat and verb comes after to etc.

3-grams and 4-grams are common in practice.

Perplexity – Hiigher the conditional probability, lower the perplexity.

Great progress:

Tagged text – which word is what type – noun, adjective,verb

Name entity recognition – yesterday(time), I(person), five (quantity)

Good progress:

Parsing – Exhibit the grammatical structure of a sentence.

Sentiment analysis – Does it identify sarcasm?

Machine translation

Information extraction.

work in progress:

Text summarization


Dialog systems: Siri, echo etc.

Deep learning :

  1. Less or no feature engineering required.
  2. Need to choose number of layers in the stack and number of nodes in each layer.
  3. Needs to be sensitive to minute details to distinguish between two breeds of a dog and at the same time, it should be invariant to large irrevelant variations like background, lighting and pose etc.
  4. ReLU usually learns much faster in networks with many layers compared to sigmoid.
  5. Pre-training was only needed for small data sets with the revival of deep learning.
  6. Convnet combined with recurrent net to generate image captions. Vision to language interface.
  7. Handwritten characters, spoken words, faces were most succesfull applications.
  8. Theorem: No more than 2 hidden layers can represent any arbitrary region (assuming sufficient number of neurons or units).
  9. The number of neurons in each hidden layer are found by cross validation.
  10. The Mammalian visual cortex is hierarchical with 5-10 levels just for the visual system.
  11. Examples of Deep neural networks – Restricted Boltzmann machines, convolutional NN, autoencoders etc.
  12. Software for Deep learning:
    1. Tensorflow – Python
    2. Theano – Python
    3. Torch – LUA
    4. Caffe – Suited for vision


Path planning:

  1. Visibility graph: VGRAPH – A-star search through the nodes between obstacles. guaranteed to give shortest path in 2d. Doesn’t scale well to 3d. Path close to obstacles (nodes at vertices).
  2. Voronoi path planning:
  3. Potential field path planning: Consider both robot and obstacles as positive charges repelling each other and Goal as negative charge attracting the robot. Take a Random walk in case you reach a local minimum.
  4. Probabilistic roadmap planner: First needs to sample the space entirely and detect some collision free nodes. connect k nearest neighbor nodes. then do graph search to go from start to goal with these connected nodes. Can’t be sure that all nodes will be connected.


  • Discriminative models learn the (hard or soft) boundary between classes
  • Generative models model the distribution of individual class.
  • Confusion matrix.
  • ply – each move by one of the players.
  • policy – State to action mapping
  • stationary policy -Particular action for a particular state all the time.
  • Representation learning – automatically discovers representations required to classify or predict, Not much feature engineering required. Ex: Deep learning.
  • Saddle points/ minimax point – where derivative of the function is zero, but because of max in one ax and min in other axes. In the shape of the saddle on the horse.
  • Unsupervised pre-training – Making features before feeding into the learning network.
  • Configuration space(C – space): Set of parameters that completely describe the robot’s state. Proprioception.

CREDITS: Columbia University

Data structures and applicability


Sets – Doesn’t maintain order and doesn’t allow repeated elements.

Stacks – Depth-first search, behave similar to recursion and can be a replacement for it.

Queues – Breadth-first search

Heaps – (Priority queue)

  1. Supports
    1. Insertion – O(logn)
    2. Extract min/max – O(logn)
    3. Heapify – O(n)
    4. Delete – O(logn)
  2. Applications:
    1. Scheduled event managing.
    2. Median maintenance – using two heaps, one max heap, and one min heap.
    3. Dijkstra Algorithm – O(m log n)

Search trees –

Trees :

  1. All the nodes should be connected.
  2. should not have cycles.


  1. Balanced binary search trees:
    1. Red-black trees.
    2. AVL trees.
    3. Splay trees.

Hash tables –

  1. When you don’t need to remember ordering, minimum or maximum.
  2. To check whether an element exists – O(1)
  3. Insertions and deletions – O(1)
  4. Applications:
    1. 2-sum problem.
  5. Hash function:
  6. Resolving collisions:
    1. Chaining: Linked lists inside each bucket.
    2. open addressing (probe sequence) – one object per bucket.
  7. Choose num of buckets as a prime number.
  8. Load factor: Num of objects/ Num of buckets.
  9. Every hash function has a pathological data set.
  10. Randomization from a family of hash functions at runtime.

Hashing – Use a hash function which deduces the index in which to store the value based on the digits of the value itself (usually last few digits). Store the value in the provided index. when you want to lookup, it is of O(1), since lookup from an indexed array is of constant time.

Bloom filters –

  1. More space efficient than hash tables.
  2. No deletions.
  3. Small false positive probability – might say x has been inserted even though it hasn’t been.
  4. Operations:
    1. Fast inserts and lookups.
  5. Applications:
    1. Spell-checker.
    2. List of forbidden passwords.
    3. Network routers – Limited memory, need to be super fast.





  1. The code should be closed to change and open to extension, should be easily maintainable for future change requests.
  2. Prefer composition (change behavior in runtime, encapsulate family of algorithms) over inheritance (your behavior is stuck). (why?)
  3. All design patterns have pros and cons. The decision of whether I should use a design pattern in a situation should be taken after analyzing whether the product you are building can afford to have cons offered by the pattern.

Situation (when):

Design pattern solution (what): Class diagram

Pros and Cons (why):

Decorator pattern (follows open-closed principle):


When there are a lot of combinations to deal with in different categories.

Normal solution (stupid way): Enumerate all the combinations possible and write a method in subclass specific to each combination.

Maintenance problems with this solution:

  1. If the price of milk goes up, they have to change the code in all the places.
  2. If they add a new item in one category, the whole thing explodes even more combinatorically.

One more solution:

Take Boolean instance variable for the presence of each specific condiment and deduce cost of the all condiments based on that in the parent class and you can add the cost of that specific beverage alone in the subclass and call super.cost().

Some problems with this solution:

  1. We have to alter existing code if there is a change in the price of any condiment.
  2. New condiments in future means, new instance variables, new setters and getters and we also need to modify the cost method in the superclass to account for this new condiment as well.
  3. Some of the beverage and condiment combinations may not be appropriate and we have no way of restricting them.
  4. What if the customer wants a double mocha?

Time for One more solution:

When inheriting behavior by sub-classing, the behavior is set statically at compile time.

Design pattern solution:

Inheriting behavior at runtime through composition and delegation. More flexibility.

If u want a Dark roast with condiments of mocha and whip, Take a dark roast object, decorate it with a mocha object and decorate it with the whip and call the cost function and rely on delegation.

Decorator objects are kind of wrappers.

The concrete decorator has an instance variable for the thing it decorates.

We can implement new decorators any time to add new behavior instead of changing existing code which is thoroughly tested.

Decorators are usually created by using Factory and Builder patterns.

Alternative to sub-classing for extending functionality.

Enables a lot of combinations with a minimal number of classes possible.


  1. Introduces complexity to the code and reduces readability.
  2. If you have some logic that is specific to a component type like the discount in this case, then this would not be applicable.
  3. You end up adding a lot of small classes.

Adapter pattern:


Facade Pattern:


  1. To simplify the interface of a group of classes.

Iterator Pattern:

This along with composite pattern helps you deal with a collection of Objects better.


  1. If you want to iterate across collectibles of varied types like ArrayList, Hashmap, etc, provided the component type remains same (check).
  2. To provide a way to access elements of an aggregate object sequentially without worrying about its underlying implementation.

Composite Pattern:


  1. If you want to treat tree-like structures, leaf nodes, and composites uniformly.
  2. If you want your sub-categories to behave in the same way as your categories.
  3. The difference from iterator pattern is, here the component type itself varies.

Design pattern:

Looks familiar to Depth-first search.

Class Diagram:

How to use it:

Pros and cons:

  1. The abstract Menu component class acts as an interface for both leaf node and composite node, thus breaking single responsibility principle.

Before closing out discussion on the collection of objects, touch upon Bounded generics.

Force not to add improper types into your collectibles in the first place. Archiver case.

Below three patterns change the behavior at runtime using composition.

The State Pattern:


Encapsulate state-based behavior and delegate behavior to the current state.

Pros and cons:

  1. Lot of classes, one for each state.

The Strategy Pattern:


Encapsulate interchangeable behaviors and use delegation to decide which behavior to use.

Configures context classes with a behavior.

Template Method Pattern:


Subclasses decide how to implement steps in an algorithm.

Factory pattern:

  1. Tying your class to a concrete implementation (new) is very fragile and less flexible.

Situation: you have different closely related concrete classes and you choose to instantiate one of these based some conditional logic.

Lame solution:

Put if, else blocks and use the new operator to instantiate an appropriate object in each block.

Maintenance problems:

If there is a new object (new duck type), that has to be added to this conditional logic and you have to figure out and add it in all places wherever it is applicable. This is error-prone.

Other solution:

Code to an interface.

Here, we have to figure out which implementation to instantiate for what type. This requires a piece of conditional logic code. This has to change if new implementations are added or any of the existing implementations are to be removed. So, this is not closed for modification.

Factory method lets subclasses decide which class to instantiate.

Observer Pattern:


  1. When a group of objects need to be notified of some state changes.


  1. Code to an interface, not to an implementation.
  2. Use bounded generics instead of Object collectibles for better validation and type casting issues.
  3. Usually, an anti-pattern to modify objects coming in as arguments. Use final modifiers.

You may or may not use any of these patterns in your daily life, but understanding these and going through pros and cons of them will significantly impact your thought process and help you make better decisions.



Naive Bayes

Bayes theorem helps us to incorporate new evidences/information into our model.

Code from sklearn

from sklearn.naive_bayes import GaussianNB



Bayesian Learning:

Bayesian learning.PNG

Naive bayes


Linear in the number of variables.

Naive Bayes assumes conditional independence across attributes and doesn’t capture inter-relationship among attributes.

Gaussian naive Bayes assumes continues values associated with each class are distributed in Gaussian fashion.

Even if the probability of one of the attributes given label becomes zero, the whole thing ends up being zero.

Maximum Likelihood :

Machine learning

Pandas and numpy:

numpy.mean(df[ (df.gold>0) ][“bronze”])

olympic_medal_counts = {‘country_name’:countries,
‘gold’: Series(gold),
‘silver’: Series(silver),
‘bronze’: Series(bronze)}
df = DataFrame(olympic_medal_counts)

del df[‘country_name’]
avg_medal_count=df.apply(lambda x:numpy.mean(x))





df[‘points’] = df[[‘gold’,’silver’,’bronze’]].dot([4, 2, 1]) olympic_points_df = df[[‘country_name’,’points’]]


For highly-skewed feature distributions such as 'capital-gain' and 'capital-loss', it is common practice to apply a logarithmic transformation on the data so that the very large and very small values do not negatively affect the performance of a learning algorithm. Using a logarithmic transformation significantly reduces the range of values caused by outliers. Care must be taken when applying this transformation however: The logarithm of 0 is undefined, so we must translate the values by a small amount above 0 to apply the the logarithm successfully.

Taking the log of values has the effect of spreading small values and bringing closer to large values.


In addition to performing transformations on features that are highly skewed, it is often good practice to perform some type of scaling on numerical features. Applying a scaling to the data does not change the shape of each feature’s distribution (such as 'capital-gain' or 'capital-loss' above); however, normalization ensures that each feature is treated equally when applying supervised learners. Note that once scaling is applied, observing the data in its raw form will no longer have the same original meaning, as exampled below.

We will use sklearn.preprocessing.MinMaxScaler for this.

If data is not normally distributed, especially if the mean and median vary significantly (indicating a large skew), it is most often appropriate to apply a non-linear scaling — particularly for financial data. One way to achieve this scaling is by using a Box-Cox test, which calculates the best power transformation of the data that reduces skewness. A simpler approach which can work in most cases would be applying the natural logarithm.

Identifying Outliers:

Tukey’s method



from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.25)

R2 score:

r2 score.PNG

If {\bar {y}} is the mean of the observed data:

{\bar {y}}={\frac {1}{n}}\sum _{i=1}^{n}y_{i}

then the variability of the data set can be measured using three sums of squares formulas:

SS_{\text{tot}}=\sum _{i}(y_{i}-{\bar {y}})^{2},
SS_{\text{reg}}=\sum _{i}(f_{i}-{\bar {y}})^{2},
{\displaystyle SS_{\text{res}}=\sum _{i}(y_{i}-f_{i})^{2}=\sum _{i}e_{i}^{2}\,}

The most general definition of the coefficient of determination is

R^{2}\equiv 1-{SS_{\rm {res}} \over SS_{\rm {tot}}}.\,


Accuracy = true positive + true negatives/total

from sklearn.metrics import accuracy_score


Accuracy is not a right metric when data is skewed.







f beta.PNG



Underfitting: Error due to bias, Oversimplified model, performs badly on both training and testing data.

Overfitting: Error due to variance. Over complicated model. model is too specific, performs badly on testing data.

Model complexity graph:

Model complexity graph.PNG

Training set: for training model.

Cross validation set: for choosing right parameters like degree of the polynomial. Useful to check whether the trained model is overfitting. If trained model performs poorly on this cross validation set, then the model is overfitted.

Testing set: For final testing.


Divide all your data into k buckets and iteratively create models by choosing one bucket as testing set and remaining for training.

Use average of these models for final model.



If you increase training points on different models:


Grid Search CV:


Plot learning curves to identify when to stop collecting data.

Supervised learning:

If there is an order in output data, then go for continuous model. Ex: income, age.

If there is no order, go for a discrete model. Ex: phone numbers, persons

Algorithms to minimize sum of squared errors:

  1. Ordinary least squares: sklearn LinearRegression.
  2. Gradient descent.

There can be multiple lines that minimize |error|, but only one that minimizes error^2.

Instance based learning:


  1. Remembers.
  2. Fast and doesn’t learn.
  3. simple.
  4. No generalization.
  5. Sensitive to noise.


  1. Look up:

In k-nn all features matter equally because when we calculate distance, all features are treated equally.


Locally weighted regression (evolved from k-nearest neighbors):

Naive bayes:

Powerful tools for creating classifiers for incoming labeled data.

Expectation maximization:

expectation maximization.PNG

This is very similar to k-means clustering.

EM is for soft clustering when there is any ambiguity regarding which data point to move to which cluster.

Which supervised classifiers are suitable for numerical as well as categorical data?

The data you have is called ‘mixed data’ because it has both numerical and categorical values. And since you have class labels; therefore, it is a classification problem.  One option is to go with decision trees, which you already tried. Other possibilities are naive Bayes where you model numeric attributes by a Gaussian distribution or so. You can also employ a minimum distance or KNN based approach; however, the cost function must be able to handle data for both types together. If these approaches don’t work then try ensemble techniques. Try bagging with decision trees or else Random Forest that combines bagging and random subspace. With mixed data, choices are limited and you need to be cautious and creative with your choices.

Feature scaling:

To give equal importance to all the different features, we can normalize them to the range of 0 to 1 before applying learning algorithm.


from sklearn.preprocessing import MinMaxScaler



Feature rescaling would be useful in k-means clustering and rbf svm where we calculate distances but not much in decision trees and linear regression.

Feature selection:


  1. Knowledge discovery, interpretability, and insight. To identify which features actually matter among all of them.
  2. Curse of dimensionality – The amount of data that you need to train grows exponential to the number of features that you have.


Feature selection:

Filtering(fast) and Wrapping(slow):

knn suffers from curse of dimensionality because it doesn’t know which features are important. So, we can use decision trees as filtering mechanism to determine important features and then pass them on to knn for learning.

Wrapper methods evaluate subsets of variables which allows, unlike filter approaches, to detect the possible interactions between variables.


  1.  A feature is said to be strongly relevant if the Bayes optimal classifier’s performance is strongly affected by the absence of this feature.
  2. A feature is weakly relevant if there exists some other feature which can suffice the purpose of this feature.
  3. Depends on how much information a feature provides.


  1. Depends on error/model/learner.

Composite feature:

composite feature.PNG

When to use PCA?

  1. When some latent features are driving the patterns in the data.
  2. Dimensionality reduction, reduce noise, better visualization.

Feature transformation:

Independent Component analysis: cocktail party problem.


Other feature transformations:

  1. Random component analysis – Fast and it usually works.
  2. Linear discriminant analysis –

Lesser the cross entropy, better is the model.

cross entropy is the negative logarithm of probabilities of actual events occurring from the perspective of the model we are trying to evaluate.


Difference between RMSE and RMSLE:

RMSLE measures the ratio between actual and predicted.


can be written as log((pi+1)/(ai+1))log((pi+1)/(ai+1))

It can be used when you don’t want to penalize huge differences when both the values are huge numbers.

Also, this can be used when you want to penalize underestimates more than overestimates.

Lets have a look at the below example

Case a) : Pi = 600, Ai = 1000

RMSE = 400, RMSLE = 0.5108

Case b) : Pi = 1400, Ai = 1000

RMSE = 400, RMSLE = 0.3365

As it is evident, the differences are same between actual and predicted in both the cases. RMSE treated them equally however RMSLE penalized the under estimate more than over estimate. Hope this helps.

Design patterns


Design that is more automated and require less maintenance is the standard.

Getters and setters:

Setters can be used for validation and constraining the value to be set to a particular range.

Getters can be used for returning default value, if the variable is not set yet or for lazy instantiation.

Can’t make objects out of abstract class. Abstract class can have some non abstract members.

Interface have only abstract methods.


Image result for design patterns java


  1. Singleton:
    1. Can only have one instance of that particular class.
    2. President of a country, System in java.
    3. Private constructor, singleton using enum.
    4. @Singleton annotation.
    5. Difficult to unit test – why?
  2. Factory:
    1. Having a logic to return a particular subclass object, when asked for a class object.
  3. Abstract Factory:
  4. Builder:
    1. Separates object construction from its representation.
    2. interfaces.
  5. Prototype:
    1. Chess game initial setup.
    2. Copying/cloning the initial setup rather than creating the initial setup everytime you need it. Reduce redundant work.
    3. Copy a fully initialized instance.
    4. Link to code.

How to create objects?


Inheritance? Interface? etc.

How are different classes related?

How are objects composed?

  1. Adapter:
    1.  Match interfaces of different classes. helps to communicate.
  2. Composite:
  3. Proxy:
    1. An object representing another object, like credit card as a proxy of bank account.
    2. Remote object and Home object(proxy).
  4. Flyweight:
    1. Reuse same objects by resetting values of the objects appropriately instead of creating new objects every time.
  5. Facade:
    1. Event managers, process, execute, group many steps into a single step.
  6. Bridge:
  7. Decorator:
    1. Add responsibilities to objects dynamically.
    2. Ex: adding different Toppings for different pizzas, adding discounts to different orders.


Interactions between different objects.

  1. Template method:
  2. Mediator:
    1. instead of applications talking to each other, we use an enterprise service bus.
  3. Chain of responsibility:
    1. Passing a request through different objects.
  4. Observer:
    1. A way of notifying a change to a number of classes.
    2. This pattern is implemented in java.
    3. Subject extends Observable.
    4. Who wants to listen implements Observer and registers with the subject.
  5. Strategy:
    1. change the implementation/strategy of an interface at a later point in time.
    2. Pass whatever implementation needs to be used as an argument.
  6. Command:
    1. Encapsulate a command request as an object.
    2. java.lang.runnable threads are implemented like this.
  7. State:
  8. Visitor:
    1. Adding new operations to a particular class without inheritance and wi
  9. Iterator:
    1. Sequentially access the elements of a collection.
  10. Interpreter:
  11. Memento:
    1. Saving states of something as objects to restore them in future point of time if necessary.
    2. Undo/Redo operations.

Strategy pattern:



Strategy design pattern.PNG


Strategy pattern - when.PNG

Observer pattern:


Observer pattern - when


Observer pattern

Factory pattern :

Factory pattern - whenFactory pattern

Abstract Factory pattern:

Singleton pattern:

Singleton pattern.PNG

 Builder pattern:

Builder pattern.PNG

Prototype pattern:

Try to put state and behaviors in different classes.