Introduction to Computational thinking and Data science

Credits: MIT 6002

Computational models:

  1. Optimization models
  2. Statistical models.
  3. Simulation models.

Optimization models:

Itertools module is an extremely useful toolkit for optimization module.

General form of Optimization model is

Maximize or minimize a function,

given certain constraints to solve it.

Example is Knapsack problem, two variants of knapsack problem are

  1. )/1-Discrete items – you either consider that item as a whole or not, like gold bars which can’t be split.
  2. Continuous – Here item is not discrete, this more like rice, milk or something of that sort where you can quantify them on a continuous scale.

Formalizing discrete knapsack problem.

Maximize across all possible combinations of summation(item*isthisitemconsiderednow*itemvalue) for all items, given the constraint summation of weights of all considered items in that particular combinatorial instance is less than a particular number(because that is the max weight knapsack can hold)

Knapsack.PNG

If we consider the brute force approach of listing all the possible combinations(power set) and then filtering from those which satisfy the constraint and choosing the one with maximum value, it would be computationally very expensive because computing the power set itself is exponential, O(2^n).

Anyway, if you want to find optimal solution to knapsack problem, then the complexity is inherently exponential.

How about an approximate solution which is still good?

Greedy Algorithm:

O(nlogn) – Computationally efficient

Define a parameter by which you need to prioritize your list and start adding items into your knapsack until it gets full.

This doesn’t guarantee the¬†globally optimal solution.

Easy to implement.

Let’s go to algorithms which give us optimal solutions:

Brute Force Algorithm:

 

Dynamic programming:

Memorization(look up) is the key idea behind dynamic programming, trading time for space.

In cases of problems which have Optimal substructure and overlapping subproblems, dynamic programming works great.

Plotting in python;

import pylab as plt

plt.figure(‘nameoffig’)

plt.clf()

plt.xlabel(‘sample points’)

plt.ylabel(‘ylabel’)

plt.plot(X,Y)

Stochastic processes:

The processes which have some element of randomness in them.

Causal nondeterminism: As per the Copenhagen Doctrine, the behavior of the physical world can’t be predicted.

Argument: Irrespective of the fact whether the world is predictable or not, our lack of knowledge about the world does not allow us to make accurate predictions. Therefore we might as well treat the world as inherently unpredictable.

import random

random.choice(list)

Simulation models:

are descriptive, they are about what might happen, in contrast to prescriptive, which are about what to do to reach a particular goal. Optimization is prescriptive.

To model systems that are mathematically intractable.


Inferential statistics:

If the sample is random, then it rightly reflects the nature of the population from which it is drawn.

Confidence intervals:

Confidence in our estimated depends on two things:

  1. The sample size.
  2. The variance of the sample.

 

Provides a range that is likely to contain the unknown value and the confidence of that unknown value lying in that range.

The way we talk about it is, the value lies in range this to this with this particular percentage of confidence level.

As the sample size grows, confidence interval shrinks.


The coefficient of determination:

To determine how good a curve fit is?

coefficient of determination

yi’s=measured value

pi’s=predicted value

mu=mean

if R^2=1, the model explains all of the variability in the data.

More R^2, better the model is, but not always, especially in the case of over fitting.


Cross-validation:

To answer the question, does our model generalize well to outside of the training data? This is against over fitting.

Everything should be made as simple as possible but not simpler.


Machine learning methods require:

  1. Representation of features.
  2. Distance metric for feature vectors.
  3. Objective function and constraints.
  4. Optimization method for learning the model.
  5. Evaluation method.

How to establish causation:

Randomized control studies.

 


 

Glossary:

Residual: difference between observed and predicted.

Advertisements

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s