Value system and rules

Let’s say we have a machine that understands all the high-level abstractions and patterns in data. Let’s say it has built on its own models to predict what’s coming next. Data science ends here presenting models and predictions to Entrepreneurs or executives to make a decision. But Is it intelligent at this stage? No, it has to be able to make decisions on its own to be truly intelligent.

So, what guides decision making process?

Value system.

A company makes its decision to maximize its profit. Profit is their value and that guides their decision-making process.

A man’s value of his survival deeply coded in his DNA motivates him to make proper decisions that could help him get proper food, shelter, clothes, sex etc.

Evolution has managed to deduce very rich and distinct emotions that could help us identify reinforcements in the real world. This makes it an imperative for any machine to be hard coded with these rules of value system to make use of reinforcement learning or other latest tech meme of that kind.

Whether it is chess or Go, if there is something it should definitely know before starting the game, it has to be the winning state and possibly evaluation functions for all different states possible in the Game. But that’s a small world to code valuations/reinforcements. How would you go about writing such codes for a machine to evaluate the humongous set of states, instances, objects in the real world? This makes me believe that modeling the value system or giving to a machine ability to build its own value system is at the core of AI. Too early to ignore Minsky’s rule-based systems.


Markov Decision processes

Markov decision process.PNG

A tuple – (S,s1,A,P,R)

S – finite set of states.

s1 – initial state.

A – finite set of actions.

P – Given a state s1 and action a, what is the probability of ending up at a particular state s2? This information is provided by P. This is called a State transition probability matrix. S*A*S

R – What is the reward for taking an action a at the state s?

Pi – Policy, what action to take at which state? A mapping from state space to action space. The optimal policy is the policy with the maximum reward.

Value function – Expected reward starting from state s and following policy pi.

Bellman equations –


Ways to find an optimal policy:

Optimal policy when starting in state s is the policy whose value function is maximum starting from state s.

Think about the case where a particular policy is always better than another policy for all initial states. Then first policy is greater than second policy and this is called partial ordering.


There always exists a policy such that the partial ordering of it with all other policies is greater or equal. Such a policy/policies is called optimal policy.

Iterative methods:

  1. Policy iteration:
    1. At each state, pick the action with max value function.
    2. you get the new policy.
    3. Again go to step 1 and loop until the new policy and old policy are same.
  2. Value iteration:
    1. Finding an optimal value function rather than explicit policy.
    2. For every iteration improve value vector.
    3. once you converge to a particular value vector, use it to find the optimal policy.

Model-free methods:

Reinforcement learning is Markov decision process with unknown transition model and/or reward distribution.

Agent can observe samples and induce transition model and reward distribution from that.


Uses Q-values instead of the value function.

To be continued……….

Credit assignment problem:

Decision trees

What are decision trees?

Decision trees are classification models built using a tree where a particular feature is selected for branching at different levels down the tree.

They help you to create non-linear boundaries with series of linear questions.

Decision tree

The primary choice to make is which feature to use for splitting first?

The objective in choosing the order of split is to further narrow down the possibilities faster. For example, Animal? Person? famous? living? singer? female? Michael jackson?

If you have started with Michael, if I said yes, then that’s good, but if I said no, you effectively narrowed down almost nothing.

We use info gain or Gini to make that decision.

Info gain:


but what does the info gain represent?

Info gain of a particular feature(work experience in above example) represents the predictability gain as we go down one level using that feature.

So, How is this predictability gain calculated?

This is measured by difference in predictability at parent node and expected sum of predictabilities in child nodes.

What is predictability at a node?

If number of positive and negative examples at a particular node is almost equal, then it is equally likely for a new example at that node to be either positive or negative.

If number of positive examples is far more than the number of negative examples at a particular node, then we can say with much more confidence that a new example at this node is more likely to be positive.

Second case is more predictable and contains less surprise, whereas first case is less predictable and has more surprise.

Now we use Entropy to quantify surprise at any node. How?


If all the samples belong to the same class, entropy is 0. If the samples are evenly distributed among all classes, then the entropy is 1.

  1. How to train?
    1. How to select the next best feature for branching?
      1. Calculate info-gain for all the features and select the one with maximum.
  2. How to predict?
    1. Just pass the test example down the tree and you have your classification at the leaf node.
  3. How to write a decision tree in python using scikit?
    1. writing the classifierdtclf
    2. visualizing the decision treevizdt

When to use decision trees?

  1. well-suited for categorical data.

How much training data do you need?

Nature of decision boundary

Decision trees are interpretable, unlike neural networks. Here you can exactly understand why a classifier makes a decision.

They are weak learners and hence good for ensemble methods.

They can handle unbalanced data sets, 1 % positive examples and 99 % negative examples.

This is Univariate – does not combine features.

Expressiveness among all possible decision trees in n-attributes:

How to deal with continuous attributes? Age,weight,distance

It does make sense to repeat an attribute along a path of a tree if attributes are discrete. If attributes are continuous, then you ask different questions for same attributes.

What to do to avoid overfitting in decision tree?

  1. like usually, using cross validation.
  2. Checking with cross validation at every time you expand the tree and the cv score is less than a particular threshold, then stop the expansion.
  3. Or first, expand the whole tree and start pruning back the leaves and check cv score at each step.


ID3: Top down

Tuning parameters in sklearn:

  1. min_samples_split(default=2):  If node contains only n samples, then it will not be split any further.
  2. criterion=gini,entropy

Decision trees are generally prone to overfitting.

You can build bigger classifiers like ensemble methods using decision trees.





Induction is key for AGI

Induction -> models -> prediction -> decision -> action

For example, finding a number in the telephone directory.

Inducts that numbers are sorted by alphabetical order of names from the data.

Models or imagines a straight line domain for all data.

Predicts at what part of the book the number can be found.

Takes decision and action to open that part.

Again infers from data and updates its predictions, decisions, and acts accordingly with the new information.


It is not necessary for a theory to be computable in general, at least in most of the sciences. But as the Artificial intelligence sits as a branch of computer science, most of AI researchers looks for the theory of AI from the perspective of its computability. MARCUS HUTTER suggests that at least a small chunk of researchers should look for theory of intelligence without considering computational resource constraints and then we can approximate that theory to computability once we find it.

MML principle

Solomonoff’s theory of induction – Algorithmic probability and Kolmogorov complexity.

Artificial Intelligence

Definition: The study and design of intelligent agents, where an intelligent agent is a system that perceives its environment and takes actions that maximize its chances of success.

Agent and environment:

Rational agent: The agent which selects the action that maximizes its performance measure given its inputs or perceptions and built-in knowledge.

PEAS: Performance, Environment, Actuators and Sensors

Types of Environment:

  1. Fully vs partially observable:
  2. Deterministic( vs stochastic): The next state of the environment is completely determined by the current state.
  3. Episodic vs sequential: Each perception-action pair can be considered as an episode.
  4. static vs dynamic:
  5. discrete vs continuous:
  6. single-agent vs multi-agent:
  7. known vs unknown:

Types of Agents in the order of increasing generality:

  1. Simple reflex agents: based on current state only. Look up the table of percepts and actions.
  2. Model-based reflex agents: based on percept history, maintains a model of the world, condition-action rules decide the action of the agent.
  3. Goal-based agents: contains goal information additional to Model-based agents.Goals decide the actions of the agent.
  4. Utility-based agents:  Utility function is the agent’s performance measure. try to increase the happiness.
  5. Learning agents:

Agent’s organisation:

  1. Atomic representation: Consider each state as a black box.
  2. Factored representation: Each state has attribute value properties.
  3. Structured representation: Relationship between objects inside a state.

Search Agents

General example problems from AI textbook:

  1. Travelling salesman problem.
  2. Route finding problem
  3. Robot navigation
  4. protein design

State space, search space and search tree.


  1. Completeness: Can the algorithm find a solution when there exists one?
  2. Optimality: Can the algorithm find the least cost path to the solution?
  3. Time complexity: How long does it take to complete the search.
  4. Space complexity: How much memory is needed to complete the search.

Compare search

Two types of search:

  1. Uninformed: No information about the domain.
    1. Breadth-first: FIFO
      1. bfs
      2. Applications: computing shortest path when all the edge lengths are same.
    2. Depth first: LIFO
      1. dfs
      2. Depth limited is not optimal because it might return us a high-cost path on the left sub-tree without considering other low-cost paths in the right subtree.
      3. Applications: Topological ordering of directed acyclic graphs.
    3. Depth limited: Depth first with depth limit
    4. Iterative deepening: Depth limited with increasing limit, combines benefits of depth first and breadth first. First do depth first search to a depth limit n and then increase the depth and do depth first search again. diff of BFS and IDS
    5. Uniform cost: Expand least cost node first, FIFO
      1. Dijkstra’s algorithm is a variant of ucs, except that ucs finds shortest path from start node to goal node, whereas Dijkstra finds shortest path from start node to every other node.ucs and Djikshtra.Computing shortest path when edge lengths are different. Use heap data structure for speed.
      2. I doubt Google uses UCS for it’s route finding in maps. Cost between the nodes can be a mix of distance, traffic, weather etc.Maps algorithm
        1. ucs
  2. Informed: Some information about the domain and goals.
    1. Heuristics search (h(n)): Determine how close we are to the goal based on a heuristic function. Admissible heuristic thinks that the cost to the solution is less than the actual cost.
      1. Greedy search expands the node that appears to be closest to the goal. uses a heuristic function that has the information of cost from a particular node to the target node.
        1. greedy search.PNG
      2. A* search:  Consider both the heuristic that says about the cost from a node to the target node and cost to reach that node. A* search will revise it’s decision, whereas greedy search would not. This property makes greedy incomplete and A* complete.
      3. IDA*:
  3. Local search: iterative improvement algorithms by an optimization/fitness function.
    1. Hill climbing (steepest descent/ascent,greedy local search):some variants of this are:hill-climbing
      1. sideways moves – escape flat lines.
      2. Random restart – to overcome local maxima.
      3. Stochastic –
    2. Simulated annealing : from statistical physics.
    3. Local beam search – k states instead of single state.
    4. Genetic algorithms: from evolutionary biology
      1. genetic algo.PNG

Adversarial search and games:

Multiagent environment, uncontrollable opponent.

Not a sequence of actions, but a strategy or policy.

Approximation is the key ingredient in the face of combinatorial explosion.

Types of games.PNG

(perfect information, deterministic) – Zero sum games.

At any stage, one agent will try to pick the move with maximum utility out of all legal moves and the opponent will pick the move with minimum utility.

Embedded thinking or backward reasoning:

Thinking through the consequences and potential actions of the opponent before making the decision.

Minimax : Two opponents playing optimally, one trying to maximize and other trying to minimize the utility.

Alpha-Beta pruning:

When the agent that is trying to find Max utility of it’s children, you start with an alpha as (-infinity), and update alpha if you find any child nodes with utility better than present alpha. During this process, if you come across any 2nd gen childs with utility less than alpha, then you don’t need look for utilities of other 2nd gen childs of that particular 1st child. Why? because that 1st gen child is trying to find minimum, once you go down below alpha, that 1st gen child is out context at the zeroth generation which is trying to find max has found that this 1st gen child can no longer contribute to me. In this context, ordering of child nodes matters a lot. Right ordering can save you from exploring lot of unnecessary childs/paths.

Stochastic/Non-deterministic games:

A chance/random element involved.


Machine Learning:

Data collection -> Data preparation -> Exploratory Data analysis(picture the data using different graphs and charts) -> Machine learning techniques -> Visualization -> Decision

  1. Decision trees:
  2. Rule induction:
  3. Neural networks:
  4. SVMs
  5. Clustering method
    1. K-means
    2. Gaussian mixtures
    3. Hierarchical clustering
    4. Spectral clustering.
  6. Association rules
  7. Feature selection
  8. Visualization
  9. Graphical models
  10. Genetic algorithm


  1. Hypothesis testing
  2. Experimental design
  3. Anova
  4. Linear regression
  5. Logistic regression:
    1. It is a linear classifier, why?
      1. Because the decision boundary where the probability=0.5 is linear.
  6. GLM
  7. PCA

Definition of Machine learning:

A computer program is said to learn from experience E with respect to some class of tasks T and performance P, if its performance at tasks in T , as measured by P, improves with experience E.

BY it’s very definition, Machine learning algorithms are generally subdued to a particular class of tasks and not generic enough for general intelligence.

K-nearest neighbors:

O(n*d) – n is the number of training examples and d is the number of features.

K is usually chosen as odd value.

z-scaling – Each feature has a mean of 0 and a standard deviation of 1.

kernel regression is a flavour of k-nearest neighbors where we weight the contribution of each neighbor, where as in k-nearest neighbor we just take the average.


Recommender systems (collaborative filtering).

Breast cancer diagnosis

Handwritten character classification

information retrieval


To Avoid over fitting:

Reduce number of features manually or do feature selection.

Do a model selection

Use regularization (keep the features but reduce their importance by setting small parameter values)

Do a cross validation to estimate the test error.

Training set used for learning the model.

Validation set is used to tune the parameters, usually to control overfitting.

Linear Models:

  1. Linear regression.
    1. Using least square loss
  2. Linear classification:
    1. Perceptron: Data has to be linearly separable.

Classification: Logistic regression

Decision Trees: Tree classifiers

Start with a set of all our examples at the top root node.

Create branches considering a splitting criteria and repeat this for creating other branches as you go down.

write the final decision at the bottom of the tree based on training examples and used this tree to classify any new test data.

First, we calculate entropy at a node and then compare info gains for extending that node for different dimensions. Then we choose to extend based on the category of high info gain.

Pruning strategies to escape overfitting:

  1. Stop growing the tree earlier.
  2. Grow a complex tree and then prune it back. Remove a subtree and check it against a validation set, if the performance doesn’t degrade, remove the subtree from original.

Cart method:

Another method to build a decision tree.

  1. Adopt same greedy, top-down algorithm.
  2. Binary splits instead of multiway splits.
  3. Uses Gini index instead of information entropy.


p-proportion of positive examples

P-proportion of negative examples.

Bayes Rule:

Bayes rule.PNG

Discriminative Algorithms:

Idea: Model p(y/x), conditional distribution of y given x.

Find a decision boundary that separates positive from negative example.

Generative Algorithms:

modelling for Bayes rule.PNG

Naive Bayes assumes that feature values are conditionally independent given the label.

Naive Bayes classifier is linear.

Naive Bayes.PNG

Ensemble Methods:

An Ensemble method combines the predictions of many individual classifiers by majority voting.

Such individual classifiers, called weak learners, are required to perform slightly better than random.

  1. Majority voting:
    1. Condorcet’s Jury theorem:
      1. Assumptions:
        1. Each individual makes the right choice with a probability p.
        2. The votes are independent.
      2. If p>0.5, then adding more voters increases the probability that the majority decision is correct. If p<0.5, then adding more voters makes things worse.
  2. Boosting:
    1. One of the most popular Machine learning methods.
    2. AdaBoost.M1-popular algorithm.
    3. Weak learners can be trees, perceptrons, decision stumps etc.
    4. The predictions of all weak learners are combined with a weighted majority voting.
    5. Idea: Train the weak learners on weighted training examples.AdaBoost.PNG
      1. AdaBoost with Decision stumps lead to a form of feature selection.
      2. Bootstrapping is a resampling technique for training weak learners from m samples of original training data.
    6. Iteratively give more weight on examples that are misclassified and less weight for examples that are classified properly.(Refer weight updation step d in above picture).
    7. Boosting.PNG
    8. Boosting tends to avoid overfitting because of increasing margin as you add more and more weak learners.
    9. Boosting tends to overfit if weak learners themselves are overfitting.
  3. Bagging:Bagging.PNG
    1. Bagging on trees works better than building a big tree on whole data.
  4. Random forests: Make a lot of decision trees with each tree trained on a random subset of training data and a random subset of all features considered. use the majority votes of different decision trees to determine the final outcome for a random forest. This is relying on the principle that majority of decision trees will compensate for some wrongly classifying decision trees at different instances.
    1. Quora uses random forests for determining whether two questions are duplicates.

A practical guide to applying Support Vector Classification:

Clustering examples:

  1. Clustering of the population by their demographics.
  2. Audio signal separation.
  3. Image segmentation.
  4. Clustering of geographic objects
  5. Clustering of stars.

Desirable clustering properties:

  1. Richness: For any assignment of objects to clusters, there is some distance matrix D such that a clustering scheme returns that clustering. For example, Stop when clusters are x units apart reached is rich because it can address all possible clusterings with variations of x. On the other hand, stop when a fixed number of clusters reached, is not rich because it can’t address all possible clusters.
  2. Scale invariance: Scaling distances by any value shouldn’t change the clustering of the objects.
  3. Consistency: Shrinking(making similar things more similar) intra cluster distances and expanding(making non-similar things more non-similar) inter clustering distances shouldn’t change the clustering.

Impossibility theorem: No clustering algorithm can achieve all three of richness, scale invariance, and consistency.


  1. Based on initial conditions of the centroid, you may end up with totally different clusters. To deal with this, we take the ensemble of clusterings with different initializations. n_init parameter determines that.
  2. sklearn.cluster.KMeans(n_clusters=8,max_iter=300, n_init=10).

Limitations of k-means:

  1. Output for any given training set would always be same as long as we do not randomly pick initial centroids. But we usually choose centroids randomly in the beginning.
  2. Hill climbing algorithm.

K-means fails for:


Solution for How to select K.

G-means:G-Means betterment of k-means.PNG

How to evaluate?

internal evaluation: High intra-cluster similarity, Low inter-cluster similarity.

External evaluation: Mutual information, entropy, adjusted random index.

Other clustering methods:

  1. Special clustering
  3. BIRCH

1.Single linkage clustering:

Keep on finding two closest points among all points and join them as long as you end up with k required clusters. This is a hierarchical agglomerative cluster structure.

single linkage cluster.PNG

Association Rules:

Supp(A->C): P(A^C)

Probabilistic interpretation:

R: A->C


Though this looks like a valid measure, it doesn’t take into account P(C).


interest(R)=1 then A and C are independent.

interest(R)>1 then A and C are positively dependent.



Collaborative filtering, customer behavior analysis, web organization, affinity promotion, cross-selling

Multidimensional rules:

Read more about Quantitative Association rules as well.


Care about goal rather than the path.


Three elements:

  1. A set of variables.
  2. A set of domains for each variable.
  3. A set of constraints that specify allowable combinations of values.

Types of consistency:

  1. Node consistency – unary constraints
  2. Arc-consistency (Constraint propagation)- binary constraints – For every value that can be assigned to X, if there exists a value that can be assigned to Y, then X->Y is arc consistent. AC-3 is the algorithm that makes a CSP arc consistent. O(n^2*d^3)
  3. Path-consistency – n-ary constraints

Forward checking propagates the information from assigned to unassigned variables only and does not check the interaction between unassigned variables. Arc consistency checks for all arcs and keep doing it whenever the domain of a variable is reduced. In this sense, AC is more general and an improvement of FC.

Two types of CSPs:

  1. Search based
    1. BFS
    2. DFS
    3. BTS
  2. Inference

Choose the one with minimum remaining values to fill first.

Fill with least constraining value first – the one that rules out fewest values in remaining variables.

Can work on independent subproblems separately based on the problem structure.


d-size of domain, n-number of variables.


Reinforcement learning:

Learning behaviour or actions based on the rewards/feedback from environment.

Science of sequential decision making.

  1. Markov decision process – Formal model for sequential decision making.
    1. Markov property – Future is independent of the past given the current state.
    2. Applying reinforcement framework to Markov process.
    3. Theorem – Due to the Markov property, A discounted MDP always has the stationary optimal policy (pi).
    4. A policy is better than another policy if it’s value is greater than the other one calculated from all starting states.
    5. One useful fact is there always exist an optimal policy which is better than all other policies.
    6. Bellman equations.
    7. No closed form solution to Bellman equations to solve for optimal policy.
    8. other iterative solution approaches are
      1. Policy iteration.
      2. Value iteration.
    9. Think of different policies to traverse through MDP. Calculate value of each policy.
    10. Q-value iteration.
    11. Epsilon greedy exploration – with probability epsilon, take random action, with probability (1-epsilon) , take greedy action.

Logical agents or knowledge-based agents:

Mainly based on knowledge representation(domain specific) and inference(domain independent).

  1. Using propositional logic.
  2. Using First – Order Logic – used in personalized assistants.

pros and cons of Logical agents:

  1. Doesn’t handle uncertainty, probability does.
  2. Rule-based doesn’t use data, ML does.
  3. It is hard to model every aspect of the world.
  4. Models are encoded explicitly.

Resolution algorithm

Natural Language Processing:

Text classification (spam filtering)- Naive Bayes classifier.

Sentiment analysis

Naive bayes classifier – Assumes features are independent.

Classifies the new example based on the probabilities of different classes given the features of that particular example.

m-estimate of the probability:

  1. Augment the sample size by m virtual examples, distributed according to prior p.

Chain rule of probability:

  1. To estimate joint probability

N-gram model: Look N-1 words in the past to predict next word.

Bigrams capture synctactic dependencies such as noun comes after eat and verb comes after to etc.

3-grams and 4-grams are common in practice.

Perplexity – Hiigher the conditional probability, lower the perplexity.

Great progress:

Tagged text – which word is what type – noun, adjective,verb

Name entity recognition – yesterday(time), I(person), five (quantity)

Good progress:

Parsing – Exhibit the grammatical structure of a sentence.

Sentiment analysis – Does it identify sarcasm?

Machine translation

Information extraction.

work in progress:

Text summarization


Dialog systems: Siri, echo etc.

Deep learning :

  1. Less or no feature engineering required.
  2. Need to choose number of layers in the stack and number of nodes in each layer.
  3. Needs to be sensitive to minute details to distinguish between two breeds of a dog and at the same time, it should be invariant to large irrevelant variations like background, lighting and pose etc.
  4. ReLU usually learns much faster in networks with many layers compared to sigmoid.
  5. Pre-training was only needed for small data sets with the revival of deep learning.
  6. Convnet combined with recurrent net to generate image captions. Vision to language interface.
  7. Handwritten characters, spoken words, faces were most succesfull applications.
  8. Theorem: No more than 2 hidden layers can represent any arbitrary region (assuming sufficient number of neurons or units).
  9. The number of neurons in each hidden layer are found by cross validation.
  10. The Mammalian visual cortex is hierarchical with 5-10 levels just for the visual system.
  11. Examples of Deep neural networks – Restricted Boltzmann machines, convolutional NN, autoencoders etc.
  12. Software for Deep learning:
    1. Tensorflow – Python
    2. Theano – Python
    3. Torch – LUA
    4. Caffe – Suited for vision


Path planning:

  1. Visibility graph: VGRAPH – A-star search through the nodes between obstacles. guaranteed to give shortest path in 2d. Doesn’t scale well to 3d. Path close to obstacles (nodes at vertices).
  2. Voronoi path planning:
  3. Potential field path planning: Consider both robot and obstacles as positive charges repelling each other and Goal as negative charge attracting the robot. Take a Random walk in case you reach a local minimum.
  4. Probabilistic roadmap planner: First needs to sample the space entirely and detect some collision free nodes. connect k nearest neighbor nodes. then do graph search to go from start to goal with these connected nodes. Can’t be sure that all nodes will be connected.


  • Discriminative models learn the (hard or soft) boundary between classes
  • Generative models model the distribution of individual class.
  • Confusion matrix.
  • ply – each move by one of the players.
  • policy – State to action mapping
  • stationary policy -Particular action for a particular state all the time.
  • Representation learning – automatically discovers representations required to classify or predict, Not much feature engineering required. Ex: Deep learning.
  • Saddle points/ minimax point – where derivative of the function is zero, but because of max in one ax and min in other axes. In the shape of the saddle on the horse.
  • Unsupervised pre-training – Making features before feeding into the learning network.
  • Configuration space(C – space): Set of parameters that completely describe the robot’s state. Proprioception.

CREDITS: Columbia University

Neural Networks


Image result for neuron

Inputs from the dendritic tree

Outputs at axon terminal.

The effectiveness of the synapse can be changed.

  1. vary the number of vesicles of the transmitter.
  2. vary the number of receptor molecules.

With more practice, Myelin sheath gets thicker and acts as a stronger insulator to reduce the loss of electrical signals during transmission.

Need to model Myelin sheath effect in neural networks to simulate addictive, habitual effects of human behavior. The parameter representing Myelin sheath should change with each instance of travel through that path.

If we introduce this effect in Multi layer neural networks after each feature learning with a different set of weights for the understanding importance of this particular feature in learning the output and train these weights in the same way as that of back propagation.

Models of the Neurones:

Binary threshold neurones:


Rectified linear neuron:

rectified linear neuron.png

Sigmoid Neurons:

sigmoid-neuron(differentiable)Image result for sigmoid equation

derivative flattens out for very large and very small x values.

Stochastic Binary Neurons:

The graph of this neuron is same except that the y-axis is the probability of output strength instead of original strength.

Different types of Machine Learning:

Supervised Learning:

  •  Learn to predict an output when given an input vector.
  • Two types:
    • Regression: when output is continuous data, like change in housing prices over the years
    • Classification: Outputs are discrete class labels.

Reinforcement learning:

  •  Learn to select an action to maximize payoff.
  • The goal in selecting each action is to maximize the expected sum of the future rewards.

Unsupervised learning:

  • Discover a good internal representation of the input.
  • Principal component analysis.
  • Clustering.

Types of Neural Network architectures:

Feed forward neural networks:


if we have more than one hidden layer, then that is called deep neural network.

Recurrent Networks:


These are difficult to train but biologically realistic.

Recurrent neural networks for modeling sequences:


To model sequential data.

They are able to remember information in their hidden state for a long time.

One of the applications developed by IIya sutskever is predicting the next character in a sequence.

Symmetrically connected networks:

recurrent nets with same weights in both directions between two nodes.

Symmetrically connected networks with hidden units(Boltzmann machines):



The objective is to choose weights and bias value so that it can rightly classify the classes of our requirement.


If the output unit is correct, don’t change the weights.

if the output unit is zero instead of one, add input vector to weight vector.

if the output unit is one instead of zero, subtract input vector to weight vector.

The limitations of perceptrons:

Can only learn linear boundaries.

XOR gate can’t be trained by perception.

Minsky and Papert’s group invariance theorem.

If it is nonlinear, number iterations to converge doesn’t end and keeps on going.

Human coded feature detection is the key part of pattern recognition but not the learning procedure.

The long term conclusion of the study on Perceptrons is neural networks without hidden layers are very limited or needs to be fed with features for the proper result on complicated pattern recognition. The presence of hidden layers can learn features themselves if we can find a way to update weights across all layers appropriately.

The backpropagation learning(overkill of chain rule):

In linear neurons:


Iterative method:

Not efficient but generalizable.

To appropriately modify a particular weight, we first calculate, rate of change of error across all training cases with respect to change in this weight. we use this quantity and a learning rate that we define to calculate the change of that weight.


In delta-rule, we increment or decrement the weight vector by the input vector scaled by the residual error and the learning rate.

Convergence of weights depends upon the correlation between input dimensions. It is hard to decide upon the weights (wi), when corresponding inpurs (xi) are same and highly correlated.

Online learning: With delta rule, you don’t need to collect all the training cases and then train them. We can train the network with one training example at a time as we get them.

For linear neuron, error surface looks like a quadratic bowl.Weights on horizontal axes and error on vertical axes.

Steepest descent: It is not effective in cases of elongated surfaces.

Non-linear neuron: output as logistic function of logit i.e . X*W+b.

Backpropogation comes into picture when we need to learn the weights of the hidden units.

The main objective is to find rate of change of Error w.r.t change of a particular weight (w_ij) in hidden unit for anyone training case.

This quantity can be expressed by chain rule as derivative of logit w.r.t weight multiplied by derivative of error w.r.t logit.

Derivative of error w.r.t logit is derivative of a particular output unit w.r.t logit multiplied by derivative of error w.r.t particular output unit.

Optimisation issues: How do we use the error derivatives on individual cases to discover a good set of weights?

How often?

  1. Online – good when there is redundancy in data.
  2. Full batch
  3. Mini-batch

How much?

  1. fixed learning rate
  2. adapt the global learning rate.
  3. adapt the learning rate on each connection separately?
  4. Don’t use the steepest descent?

Generalization issues: How do we ensure that learned weights work for cases for which we have not trained them as well?

Ways to reduce overfitting:

  1. Weight decay.
  2. Weight sharing.
  3. Early stopping.
  4. Model averaging.
  5. Bayesian fitting of neural nets.
  6. Dropout.
  7. Generative pre-training.
  8. Cross-validation

For logistic neurons, we have dy/dz=y*(1-y) term in residual error while calculating dw for learning. In cases of y=0.000001 and target  output is 1,  we are having the biggest error we can have. yet, our learning would be very less because of y term in dy/dz.


If we use softmax function instead of logistic function at the output, we will have outputs as probability distribution between 0 and 1 over mutually exclusive alternatives. Probability distribution for all the classes would sum up to be 1.


Cross entropy cost function:



Object recognition:

  1. Why object recognition is difficult?
    1. Objects defined based on purpose rather than its look or structure. We need to co-ordinate with module 3 and 7 to overcome this.
    2. We need to identify the object even though the viewpoint of it changes. When viewpoint changes, we have this problem of dimension hopping in training neural networks to recognize the object.Usually, inputs of neural networks for image recognition would be pixels, but when viewpoint changes, the input at one pixel at one training instance will be same at another pixel during different training instance. This is dimension hopping.#viewpoint_invariance.
  2. Solutions to achieve view point invariance:
    1. redundant invariant features:
    2. Put a box around the object.
    3. Convolutional neural nets:
    4. Use hierarchy of parts that have explicit poses relative to the camera.

Mini-batch learning:

  1. It is better than online learning where you update the weights for each case, because of the computational efficiency in dealing with multiple training cases at once by matrix manipulations.

Initializing weights:

  1. To break symmetry, initialize weights to random values.
  2. If you start with a very big learning rate, weights will become very large or very small. Error derivative will become tiny and you might confuse plateau with local minimum.

Use principal component analysis to decorrelate inputs. This achieves some dimensionality reduction after removing the components with small eigen values.

Stochastic gradient descent with mini-batches comined with momentum is the most common recipe used for learning big neural nets.

when the gradient descent gets stuck in local optima, we might need to seek out for other advanced methods like:

  1. Momentum in gradient:
  2. higher order derivatives.
  3. randomized optimization – initialize weights with random values.
  4. the penalty for complexity (overfitting) – by reducing layers, by reducing nodes, by reducing the range of weights.

All things being equal, simpler explanation is preferred – similar to Occam’s razor

My Thoughts:

  1. Usually, we design the hierarchical structure of layers of neurons for learning a particular task, classification or prediction. But, Brain has already evolved some hierarchical neural network structures which are very good at what they do. What if we try to emulate this evolution by applying genetic algorithms to neurons and their connections to see whether we can come up with a beautiful, stable structure of neurons that is efficient in learning lots of tasks. What should be the fitness function to validate our network structures at any point of simulation?
  2. For object recognition, instead of training a network to classify a specific set of objects, can we train a network just to classify all the different objects spatially, even though it doesn’t know and not yet trained to name or a class of what it is? For every new image, it just has to be able to differentiate all the elements in the object irrespective of whether they have encountered them in previous training sets or not. If you show humans an object in an image, even though they haven’t seen that object before, they identify as a unique object, whatever it may be.


  •  Model class (f) : The function used to map inputs to outputs like models of neurones discussed above.
  • Binary representation:  Many neurones are involved in representing one concept and one neurone is involved in representing many concepts.
  • Bottle-neck layers: The layer for which the number of nodes is less than the input nodes.
  • Drop-out: Half of the hidden units in a layer are randomly removed, Regularization technique.
  • Fan-in: It’s number of inputs.

A directed graph, the neural network itself can be an output of a neural network.


  1. How to apply Neural networks to time series data? With recurrent network.