The Master Algorithm







No free lunch theorem.

Curses of machine learning:

1. Over fitting.

2. High dimensionality – Often you need to learn from data of very high dimensions like number of pixels in an image, number of clicks or tastes to identify your preference, words on pages etc. Most of these dimensions or attributes are irrelevant to what you are currently trying to learn. Considering these irrelevant attributes in our learning may result in bad prediction. We don’t have an automated way of finding relevancy for that particular learning task. This is particularly the case in nearest neighbor.

As the number of features/dimensions grows, the amount of Data we need to generalize accurately grows exponentially.

Rationalism vs empiricism

How can we ever be justified from generalizing from what we have seen to what we haven’t .


Credit assignment problem

Boltzmann machine.

Back propagation algorithm. Doesn’t know global optimum.

Gradient ascent and descent.

Autoencoder: multi layer perceptron whose output is same as that of input. To make hidden layer much smaller than the input and output layer.

Stacked Sparse auto encoders to learn high level concepts like face from low level concepts like edges and shades, hierarchically.

Human intelligence boils down to a single algorithm-andrew  ng

Convolutional neural networks are modeled by the inspiration of visual cortex.

Optimal learning is the bayesian’s central goal.

Laplace’s Rule of Succession:

The probability that an event will occur after it has occurred n times successively

= \frac { n+1 }{ n+2 }

A controversy in the definition of probability:

  1. Prior probability is a subjective degree of belief.
  2. Probability is the frequency with which a subset event occurs in the sample space.

Bayesian learning is computationally costly.

A learner that assumes different effects are independent given the cause is called the Naive-Bayes classifier.

Page-rank uses the idea of markov chain. Web pages with many incoming links are probably more important than views.

Hidden Markov models are used for inference in speech recognition,

Continuous version of HMM is Kalman filter.

Bayesian networks

Markov chain Monte Carlo to converge the distributions of Bayesian network.

MCMC is a random walk on Markov chains and in long run, number of times each node is visited is proportional to its probability.

Applying probability to medical diagnosis.


Markov networks


Analogizers can work with less data.

They don’t form a model.

Analogy is behind many scientific advances.

If two things are similar , the thought of one will tend to trigger the thought of other – Aristotle

Algorithms in analogy domain are

1. K-Nearest neighbor (weighted)

Doesn’t work well with lots of dimensions, hyperspace.

Can’t Identifying relevant attributes

Discovering blanket space in hyper space.

2. Support vector machines

Weights have a single optimum instead of many local ones – advantage over multi-layer perceptron.

Only one layer.

Extending to new dimension.

3. Full blown analogical reasoning.
Structure mapping
Learning cross problem domains.


An algorithm that could spontaneously group together similar objects or different images of the same object – clustering problem.

K means, naive Bayes, em algorithms.


Dimensionality reduction


ISOmap – for nonlinearity reduction.


Reinforcement learning – optimal control in an unknown environment.

Deep mind at the intersection of reinforcement and multilayer perception.


Meta learning: combining all learning algorithms.

Stacking – a metalearner

Bagging – divide the training set into multiple samples by random sampling and apply the learning algorithm to each set. This would decrease the variance and increases accuracy.

Boosting – a meta learner.

Master algorithm resides in the circles of optimization town, towers of representation.evaluation

Optimization techniques in different tribes:

Inverse deduction – symbolists

Gradient descent – connectionist

Genetic search ( cross over and mutation) – evolutionaries

Constrained optimization – analogizers

What are to be combined?

Decision trees, multi layer perceptions, classifier systems, naive Bayes,svms

Popular lines.

The most important thing in an equation is all the quantities that don’t appear in it.

The universe maximises entropy subject to keeping energy constant.

PCA is to unsupervised learning what linear regression is to supervised learning.

Five most important personality traits to look for, extroversion,agreeableness,conscientiousness,neuroticism and openness to experience.

The law of effect

Children explore and adults exploit.

Snippets of reinforcement learning also known as habits make up most of what you do.

You don’t try to outrun a horse , you ride it. It’s not computer vs humans , its humans with computers vs without computers.

Whatever is true of everything we have seen is true of everything in the universe – Newton

Time is the principal component of memory.


Fractal dimension, roughness

The scaling factor(SF):

Scale a shape to one-half or any other fraction of it’s original

The mass scaling factor(MSF):

Find the ratio of the mass of new shape to an original shape or the ratio of a number of pixels captured by new shape to original shape.


D is the fractal dimension. This is used as a measure of the roughness of the shape.

Usually, natural objects have fractal dimensions and man-made objects have more integer dimension.

Fractal dimension

Techniques in Math

When you see a curve that looks like some exponential curve and you don’t know the exponent, then take logarithm on both sides and plot to find the slope of the straight line as your exponent.

The logarithm of a number by two different base values differs by a constant factor.

If we want to find the extremum of a function with constraints, Lagrange Multipliers will be useful. These help us to incorporate constraints and function into another function so that we don’t need to think about constraints anymore and just finding the extremum of this new function will suffice.

Architecture for AGI


This article mainly runs on following dimensions.

  1. Perceiving the state of the world and state of yourself in the world.
    1. The power of discriminating and encapsulating different material and immaterial concepts which are often indefinitely defined, fuzzy, context -based and have no rigid boundaries.
      1. Natural language processing.
      2. Abstractions- clustering problem. Identifying the need for a level of clustering for current problem domain.importance of similarity measure in clustering. Rudimentary kind of clustering is clustering based on how different objects in the world be useful to him in what ways… Like things to eat to be clustered as one, things to talk to, things to use as the tool, or which body part to be used for what object. K-means algorithm.naive Bayes inferring classes from the models and models from the classes.em algorithm.
    2. pattern recognition.
    3. proprioception and self-referential systems.
    4. object recognition (
    5. Donald Hoffman’s theory of perception.perceptual systems tuned to fitness outcompete those that are tuned to truth/reality.
  2. Goals and priorities in them.
    1. Switching between the goals based on the timeframes and dependencies for each goal.
    2. Inbuilt goals like getting sex, food, staying alive and goals we choose for ourselves by the culture and experiences and interaction between two kinds. Ethical issues of giving a power of creating self-goals to AI systems.
    3. Decision-making.
  3. Relevance measure for objects perceived in the world to the current goal.
    1. Prioritising perception or attention based on relevance to goal.
    2. Adaptivity of attention based on the skill level(acquired by repetition of the same process)
  4. Identifying actions or a sequence of actions to impact objects for achieving our goals.(planning)
    1. The element of stochasticity or randomness in finding the sequence is necessary to be creative when the sequence is unknown or not obvious.
    2. This can happen from analogies, empathy and copying, pattern recognition.
    3. Considering optimisation in face of multiple paths to a goal.
    4. related to imagination.
  5. A parallel Emotion machine with continuous feedback of utility or happiness on your present state in the world.
    1. Decision-making.
    2. operant conditioning.
  6. Platform knowledge: what information do we have at our birth about the world we are surrounded by and how to build it for a machine?
    1. Evolution.
    2. Jealous baby.
    3. A young baby doesn’t get surprised if a teddy bear passes behind a screen and reemerges as an aeroplane, but a one-year-old does.
    4. No categorical classification at the very beginning.
    5. As a child grows, a number of neurones decreases but a number of synapses increases.
  7. Knowledge representation, search and weights for retention(what you don’t use, you lose)
    1. Brain plasticity.
    2. Repetition and skill.
    3. Priming and classical conditioning.
    4. The DRM effect
    5. Do elements containing similar structure trigger each other.(Hofstader’s Danny in the grand canyon), pattern recognition by analogy.
    6. Reconstruction of data as memory at hippocampus with already existing memory.
    7. Procedural , episodic, semantic
    8. Representation of ideas that could be independent of language, which could allow for generalized reasoning to draw conclusions from combining ideas.
  8. Imagination at the core of knowledge, emotion machine and goals.
    1. Demis hassabis research on Imagination and memory.

What this article doesn’t talk about is:

  1. Whether a machine can have qualia even though it acts like it has general intelligence.Mary’s room. How to test qualia for a machine? Is Turing test enough for that?
  2. When did the consciousness first evolved and How?
  3. Moral implications of AGI and singularity.

Organising the interaction of all the above:

Why do I think I am talking about AGI? because we should be able to deduce any human endeavour from the interactions of above dimensions, at least from the infinitely long interactions and their orderings. Connections missed in below diagram:

  1. Goals to emotion machine.


Deep dive (one at a time):

    1. Why object recognition is difficult?
      1. Objects defined based on purpose rather than its look or structure. We need to co-ordinate with module 3 and 7 to overcome this.
      2. We need to identify the object even though the viewpoint of it changes. When viewpoint changes, we have this problem of dimension hopping in training neural networks to recognize the object.Usually, inputs of neural networks for image recognition would be pixels, but when viewpoint changes, the input at one pixel at one training instance will be same at another pixel during different training instance. This is dimension hopping.#viewpoint_invariance.
    2. True language understanding is impossible without internal modeling the world.
  1.  Goals and priorities
  2. Relevance
  3. Planning (identifying action sequences):
  4. Emotion machine and utility:
  5. Platform knowledge: Jean Piaget, a child development psychologist argued that children create a mental model of the world around them and continuously modify that model as they interact with reality and receives feedback. He called this phenomenon as Equilibration.
    1. Infant reflexes .
    2. A child brain is less focussed with more learning rate on multitude of things, whereas an adult is more focussed, having better self control but with less learning rate.
    3. It seems true that humans come into the world with some innate abilities. some examples by steven pinker,
      1. when a human and a pet are both exposed to speech, human acquires the language whereas pet doesn’t, presumably because of some innate difference between them.
      2. Sexual preferences of men and women vary.
      3. Experimental studies on identical twins who are separated at birth and examined at later stages of life showing astonishing similarities.
      4. Do we have moral sense at birth?
    4. The above discusses about some facts on human nature at birth, but our main focus is to find Is there a seed meta-program that we are endowed with at birth which makes all other learning and behavior possible? If there exists what is it? These kinds of systems are called constructivist systems (thorisson).

Existing Cognitive Architectures: Two types

  1. Uniformity first:
    1. Soar Architecture
  2. Diversity first:
    1. ACT-R : intelligent tutoring systems are its application. John Anderson.

      Autocatalytic Endogenous Reflective Architecture (AERA)

Sigma architecture:


Deep mind(Demis hassabis):

Deep learning + Reinforcement learning


  1. Art is partly skill and partly stochastic.
  2. Poetry at the intersection of analogy and utility.
  3. How much of General and/or human intelligence has to do with reason? If it is fully correlated with reason and morality is built on only reason, then we and AGI has no conflict of interest as more intelligence only means more morality. #moralitynintelligence
  4. Value system guiding attention, attention setting the problem domain, problem domain looking for heuristics, heuristics guiding the search path.


  1. Distributed representation: Many neurones are involved in representing one concept and one neurone is involved in representing many concepts.

Robust software development process

At scratch:

Document the requirements.

Divide and conquer: Break the implementation of the project into the small chunks of independent interacting software modules.

Design the proper architecture for these interacting modules.

Document the design.


  1. Build a module.
  2. write any simulators needed to give it input and see the output (optional).
  3. Test the module completely.

Repeat above three steps until all the modules are completed.

Put modules together in a proper interaction flow.

Write test cases and automate testing.


Read new requirements and add it to the requirements document to track or identify any reported scenarios as bugs or unexpected(requires new implementation).

Implement the requirement and test for implementation.

Do regression testing and update the regression test suite with the test cases of new implementation and modify the already existing test cases if necessary as per the requirement.

The process is complete.

Let’s talk about main elements of debugging for developer

  1. search (ctrl+f).
  2. breakpoints.
  3. call stack.
  4. call hierarchy.
  5. watch.
  6. data breakpoints and conditional breakpoints.
  7. logs.

Learn system design:


Sorting a list of tuples by multiple conditions:

sorted_by_length = sorted(list_,
                         key=lambda x: (x[0], len(x[1]), float(x[1])))

To get the lower bound integer in python:

int(1.6) =1



def function_name(i):


write a python file with extension and add definitions and statements in them.

‘import filename’ in another file.

access definitions inside that file using filename.nameofdef


concatenating to a tuple: tup=tup+(a[i],)

list1=[] – ordered sequence.


Can’t mutate a string. assignment error.

Aliasing problem in python:

if you have an object say a list and you attach it to the name list1

if you say list2=list1

list2=list1, then list2 points to same object as that of list one.

If you add an element to list2, then it will be reflected in list1 as well which you might not have intended.

So how to take a copy of list1 without having any impact on list1 if I do changes to the copy?

list2=list1[:]              cloned

sort the original list-> list1.sort()

take a sorted list of original list into another object

list2=sorted(list1) – [[1,2]]*3, all three lists inside outer list refer to same object. This doesn’t sync with your intuition at the beginning.

tuple and string doesn’t support item assignment i.e. s[1]=5, only list supports.

Functions as objects:

sending functions as arguments to other functions.

Higher order programming: Map

Dictionaries: A list of key, value pairs


‘Vivek’ in my_dict  -> True




Memorization: Storing the already computed values in dictionary and looking up for them in case we need them again. This will save from computing something that is already computed.


  1. Testing:
    1. Unit testing-
      1. validate each piece of the program.
      2. testing each function separately.
    2. Regression testing-
      1. After each fix, you need to retest to confirm already tested modules or not broken by the latest fix.
    3. Integration testing-
      1. Does overall program work?
    4. Black box testing-
      1. without looking at the code.
      2. done by other than the implementer to avoid some implementer biases.
    5. Glass box testing:
      1. Path complete.
      2. Design of test cases based on code.
  2. Defensive programming:
    1. specifications for functions.
    2. Modularize programs.
    3. check conditions on inputs/outputs.
  3. Eliminate the source of bugs:





raise ()

assert             to check whether the assumptions are met or not.

classes and objects:

Data attributes associated with the class, but not with instance or objects.

Generator: yield 1 in method. runs upto the yield and stops. again call the funct.__next__() runs upto next yield and stops.

Lisp and python programming languages allow you to implement reflection.

Using Pylab:

import pylab as plt

plt.figure(‘figurename’) – to plot in new figure




plt.xlabel(‘sample points’)

plt.ylabel(‘linear function’)


plt.plot(list of x values, list of y values,’b-‘,label=’linear’)


model=pylab.polyfit(observedX,observedY, n)

Finds coefficients of a polynomial of degree n, that provides a best least squares fit for the observed data. It returns n+1 values.



Anonymous function – lambda:

f1=lambda x,y: x+y


Credits: Jake Vanderplas

An abstract class in python can’t be instantiated.



using both 2.7 and 3.5 of python in diferent environments:

It won’t act weird if you will do it in tensor flow environment. Please follow these steps:
Install python 3.5

you can do the following to use python 3.5 on the new conda environnement “tensorflow”:

conda create –name tensorflow python=3.5
activate tensorflow
conda install jupyter
conda install scipy
pip install tensorflow
pip install tensorflow-gpu

You can uninstall python 2.7 also, but if you will follow above steps you can keep python 2.7 and python 3.5 also. They wont act weird. So in this case for anaconda by default it would be python 2.7. But for tensorflow environment it will work on python 3.5

view models of keras

import os
os.environ[“PATH”] += os.pathsep + ‘C:/Anaconda2/envs/tensorflow/Library/bin/graphviz’

from keras.utils.vis_utils import plot_model
plot_model(model, to_file=’model.png’)


Science and the art of collecting, analyzing and interpreting data.

Qualitative data:




Quantitative data:




Summarizing data (Summary statistics):

  1. by a typical value (Averages)
    1. Median – More robust than mean, because if one of the elements in data is missing, it might have a large impact on mean, but not a significant impact on the median.
    2. Mean

The mean is affected by outliers(skewing) while the median isn’t

The average doesn’t efficiently summarize the data set in case of bimodal distribution.

2. How different the values of data set from this typical value(variability).

1. range(max-min) – Outliers have the big impact.

2. interquartile range – resistant to outliers.

3. Variance and standard deviation.

Visualizing data:

Law of Large numbers/Bernoulli’s law:

Image result for law of large numbers equation

For a random variable, if you take a large number of samples and average them, as the number of samples n tends to infinity, the average tends to approach expectation of that random variable.

Gambler’s fallacy (wrong):

Law of large numbers doesn’t imply that if deviation from expected behavior occurs, then these deviations will be evened out by opposite occurrences in future.

Example: He is due for a hit because he hasn’t had any.

Regression to the mean (correct):

Following an extreme random event, the next random event is likely to be less extreme.

Central Limit Theorem:

You have a random variable with a random distribution.

Pick n samples from that random distribution and average them.

Do this multiple times and plot those averages of n samples of original distribution.

This new plot/ distribution will be Normal distribution irrespective of the shape of the original distribution.

If you increase sample size n, variance of the new distribution decreases by a factor of n (var^2/n).

Monte Carlo simulation:

Stratified sampling:

Bigger sample size -> less standard deviation -> small confidence interval -> narrow error bar.


Statistically significant:

When confidence intervals don’t overlap.

When you conduct an experiment, you may come up with a hypothesis from the data. How do you know this result or hypothesis isn’t come up just by chance?

This is what we do to find out that.

We think of null hypothesis, where the intended element has no effect. Then we simulate the experiment with null hypothesis for multiple times and plot that.

Image result for 95 confidence interval

What is the probability of the original hypothesis results appearing on this null hypothesis data/plot? If the original hypothesis is not in the 95% confidence interval of null hypothesis plot, then we can say that the experiment is statistically significant and can favor the original hypothesis that it does have an effect.

Standard error:

The standard deviation of the sampling distribution of the sample mean or

sampling distribution of the mean is also called the standard error of the mean.

Always take independent random samples.

Statistical fallacies and morals:

Statistics about the data is not the same as data.

Use visualization tools to look at the data.

Look at the axes labels scales.

Are things being compared really comparable?

Covariance: Covariance is a measure of the joint variability of two random variables. If the greater values of one variable mainly correspond with the greater values of the other variable, and the same holds for the lesser values, i.e., the variables tend to show similar behavior, the covariance is positive.

The sign of the covariance, therefore, shows the tendency in the linear relationship between the variables. The magnitude of the covariance is not easy to interpret because it is not normalized and hence depends on the magnitudes of the variables. The normalized version of the covariance, the correlation coefficient, however, shows by its magnitude the strength of the linear relation.