Induction is key for AGI

Induction -> models -> prediction -> decision -> action

For example, finding a number in the telephone directory.

Inducts that numbers are sorted by alphabetical order of names from the data.

Models or imagines a straight line domain for all data.

Predicts at what part of the book the number can be found.

Takes decision and action to open that part.

Again infers from data and updates its predictions, decisions, and acts accordingly with the new information.


It is not necessary for a theory to be computable in general, at least in most of the sciences. But as the Artificial intelligence sits as a branch of computer science, most of AI researchers looks for the theory of AI from the perspective of its computability. MARCUS HUTTER suggests that at least a small chunk of researchers should look for theory of intelligence without considering computational resource constraints and then we can approximate that theory to computability once we find it.

MML principle

Solomonoff’s theory of induction – Algorithmic probability and Kolmogorov complexity.


Which ML technique to use for what problem.

Microsoft Azure Machine Learning Algorithm Cheat Sheet

Training better.PNG


Use ‘s Deep dive framework for extracting structured data from unstructured data.

Use bagging or boosting techniques for classification of randomly classified data. Try XGBoost, My personal favorite.

Computational combinatorics vs heuristics

In most of the AI algorithms, it is often the case that you enumerate in some sort and search through all the possible states and check whether we are at the goal state or particular constraints are being satisfied for the different combinations of states. This pretty much looks like a brute force and doesn’t seem to be intelligent though it could solve a pretty huge number of problems because of computational power we are endowed with. There would be a hard upper bound to the kind of problems we can tackle because of the computational limitation.

To make the search more efficient and more directional, we try to use human thought up heuristics into the search algorithms to guide the search. This is the crucial part of Human-machine interaction, leveraging the benefits of computational power and accuracy to the uniquely human-like heuristic intuition.

But the main question is, could we ever be able to make a computer think up its own heuristics appropriate to problem domain to guide its search?

Another question, How do we come up with heuristics in the first place? Do we develop it by copying the practices of our peers and mentors through our experience in the domain? or did we develop an innate sense of what to do when in some regular day to day chore settings because of millions of years of evolution? Is the ability to think up right heuristics is what true intelligence is?

Introduction to Computational thinking and Data science

Credits: MIT 6002

Computational models:

  1. Optimization models
  2. Statistical models.
  3. Simulation models.

Optimization models:

Itertools module is an extremely useful toolkit for optimization module.

General form of Optimization model is

Maximize or minimize a function,

given certain constraints to solve it.

Example is Knapsack problem, two variants of knapsack problem are

  1. )/1-Discrete items – you either consider that item as a whole or not, like gold bars which can’t be split.
  2. Continuous – Here item is not discrete, this more like rice, milk or something of that sort where you can quantify them on a continuous scale.

Formalizing discrete knapsack problem.

Maximize across all possible combinations of summation(item*isthisitemconsiderednow*itemvalue) for all items, given the constraint summation of weights of all considered items in that particular combinatorial instance is less than a particular number(because that is the max weight knapsack can hold)


If we consider the brute force approach of listing all the possible combinations(power set) and then filtering from those which satisfy the constraint and choosing the one with maximum value, it would be computationally very expensive because computing the power set itself is exponential, O(2^n).

Anyway, if you want to find optimal solution to knapsack problem, then the complexity is inherently exponential.

How about an approximate solution which is still good?

Greedy Algorithm:

O(nlogn) – Computationally efficient

Define a parameter by which you need to prioritize your list and start adding items into your knapsack until it gets full.

This doesn’t guarantee the globally optimal solution.

Easy to implement.

Let’s go to algorithms which give us optimal solutions:

Brute Force Algorithm:


Dynamic programming:

Memorization(look up) is the key idea behind dynamic programming, trading time for space.

In cases of problems which have Optimal substructure and overlapping subproblems, dynamic programming works great.

Plotting in python;

import pylab as plt



plt.xlabel(‘sample points’)



Stochastic processes:

The processes which have some element of randomness in them.

Causal nondeterminism: As per the Copenhagen Doctrine, the behavior of the physical world can’t be predicted.

Argument: Irrespective of the fact whether the world is predictable or not, our lack of knowledge about the world does not allow us to make accurate predictions. Therefore we might as well treat the world as inherently unpredictable.

import random


Simulation models:

are descriptive, they are about what might happen, in contrast to prescriptive, which are about what to do to reach a particular goal. Optimization is prescriptive.

To model systems that are mathematically intractable.

Inferential statistics:

If the sample is random, then it rightly reflects the nature of the population from which it is drawn.

Confidence intervals:

Confidence in our estimated depends on two things:

  1. The sample size.
  2. The variance of the sample.


Provides a range that is likely to contain the unknown value and the confidence of that unknown value lying in that range.

The way we talk about it is, the value lies in range this to this with this particular percentage of confidence level.

As the sample size grows, confidence interval shrinks.

The coefficient of determination:

To determine how good a curve fit is?

coefficient of determination

yi’s=measured value

pi’s=predicted value


if R^2=1, the model explains all of the variability in the data.

More R^2, better the model is, but not always, especially in the case of over fitting.


To answer the question, does our model generalize well to outside of the training data? This is against over fitting.

Everything should be made as simple as possible but not simpler.

Machine learning methods require:

  1. Representation of features.
  2. Distance metric for feature vectors.
  3. Objective function and constraints.
  4. Optimization method for learning the model.
  5. Evaluation method.

How to establish causation:

Randomized control studies.




Residual: difference between observed and predicted.

Constraint Satisfaction Problems

Combines guessing with deduction.

Guesses can walk you through blind alleys, avoiding these blind alleys with dead ends is the main challenge.

P vs NP problem – It essentially asks whether every problem whose answer can be checked quickly by a computer can also be quickly solved by a computer.

Constraint solving has found most success with scheduling problems.

A search problem where we don’t care about the path but the goal itself.

Example: Map coloring problem

  • No two adjacent states can have same color.

Pick the one with fewest remaining values to fill first.

Application: Paraphrasing with rhyming constraint to generate poetry.


For example in sudoku, variables are all the boxes in that 9*9 matrix and domain is the set of all possible values we can fill each variable with (0-9 in sudoku) and constraints are specific to that problem.

Support Vector Machines

The objective is to find the widest street and the best boundary that separates the two classes.

Imagine a vector perpendicular to the median line to the street.




The equation to maximize 1/2*(w^2), while classifying everything correctly.



Kernels of svmsvm circular similarity



SVM optimization doesn’t get stuck in local maximum, it has a convex base, unlike Neural nets which could often get stuck in local maxima.

For linearly separable points, SVM works fine, but for nonlinearly separable points, you need to do a transformation to project these points to higher dimensional space, so that they can get linearly separable. For this, you need to use Gaussian or polynomial kernel.

Kernels represent similarity measure for different points and impart domain knowledge.

Adding a nonlinear feature often makes the SVM linearly separable.

Mercer condition:

Writing svm in sklearn:

from sklearn import svm


parameters of svm:

  1. //other kernels, kernel=”rbf”,kernel=”poly”
  2. C – Controls tradeoff between smooth decision boundary and classifying training points correctly. Large C will take the side of classifying more training points correctly.
  3. gamma – Reach of each training example. Low values -> far reach, high values -> close reach.,y)


  1. A general method that is convex and guaranteed to produce a global solution.
  2. Small Sigma in Gaussian kernel can cause overfitting because then classification is shrunk right around the sample points.
  3. For handwritten character recognition, the linear kernel with n=2(nonlinear) works well.
  4. SVMs don’t perform very well in large datasets and lot of features because of the training time being cubic. Naive Bayes classifier would be better when there is lot of overlap and noise compared to svms.
  5. SVMs “don’t work well with lots and lots of noise, so when the classes are very overlapping, you have to count independent evidence.
  6. Compute/training time is way too high for this.