## Naive Bayes

Bayes theorem helps us to incorporate new evidences/information into our model.

Code from sklearn

from sklearn.naive_bayes import GaussianNB

clf=GaussianNB()

clf.fit(features_train,labels_train)

Bayesian Learning:

Linear in the number of variables.

Naive Bayes assumes conditional independence across attributes and doesn’t capture inter-relationship among attributes.

Gaussian naive Bayes assumes continues values associated with each class are distributed in Gaussian fashion.

Even if the probability of one of the attributes given label becomes zero, the whole thing ends up being zero.

Maximum Likelihood :

## Machine learning

Pandas and numpy:

numpy.mean(df[ (df.gold>0) ][“bronze”])

olympic_medal_counts = {‘country_name’:countries,
‘gold’: Series(gold),
‘silver’: Series(silver),
‘bronze’: Series(bronze)}
df = DataFrame(olympic_medal_counts)

del df[‘country_name’]
avg_medal_count=df.apply(lambda x:numpy.mean(x))

b=numpy.array([4,2,1])

a=numpy.column_stack((numpy.array(gold),numpy.array(silver),numpy.array(bronze)))

points=numpy.dot(a,b)

olympic_points_df=DataFrame({‘country_name’:countries,’points’:points})

df[‘points’] = df[[‘gold’,’silver’,’bronze’]].dot([4, 2, 1]) olympic_points_df = df[[‘country_name’,’points’]]

Preprocessing:

For highly-skewed feature distributions such as 'capital-gain' and 'capital-loss', it is common practice to apply a logarithmic transformation on the data so that the very large and very small values do not negatively affect the performance of a learning algorithm. Using a logarithmic transformation significantly reduces the range of values caused by outliers. Care must be taken when applying this transformation however: The logarithm of 0 is undefined, so we must translate the values by a small amount above 0 to apply the the logarithm successfully.

Taking the log of values has the effect of spreading small values and bringing closer to large values.

Scaling:

In addition to performing transformations on features that are highly skewed, it is often good practice to perform some type of scaling on numerical features. Applying a scaling to the data does not change the shape of each feature’s distribution (such as 'capital-gain' or 'capital-loss' above); however, normalization ensures that each feature is treated equally when applying supervised learners. Note that once scaling is applied, observing the data in its raw form will no longer have the same original meaning, as exampled below.

We will use sklearn.preprocessing.MinMaxScaler for this.

If data is not normally distributed, especially if the mean and median vary significantly (indicating a large skew), it is most often appropriate to apply a non-linear scaling — particularly for financial data. One way to achieve this scaling is by using a Box-Cox test, which calculates the best power transformation of the data that reduces skewness. A simpler approach which can work in most cases would be applying the natural logarithm.

Identifying Outliers:

Tukey’s method

from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.25)

If  is the mean of the observed data:

then the variability of the data set can be measured using three sums of squares formulas:

The most general definition of the coefficient of determination is

Accuracy = true positive + true negatives/total

from sklearn.metrics import accuracy_score

accuracy_score(y_true,y_pred)

Accuracy is not a right metric when data is skewed.

Precision:

Recall:

ROC CURVE:

Errors:

Underfitting: Error due to bias, Oversimplified model, performs badly on both training and testing data.

Overfitting: Error due to variance. Over complicated model. model is too specific, performs badly on testing data.

Model complexity graph:

Training set: for training model.

Cross validation set: for choosing right parameters like degree of the polynomial. Useful to check whether the trained model is overfitting. If trained model performs poorly on this cross validation set, then the model is overfitted.

Testing set: For final testing.

K-FOLD CROSS VALIDATION:

Divide all your data into k buckets and iteratively create models by choosing one bucket as testing set and remaining for training.

Use average of these models for final model.

If you increase training points on different models:

Grid Search CV:

Plot learning curves to identify when to stop collecting data.

Supervised learning:

If there is an order in output data, then go for continuous model. Ex: income, age.

If there is no order, go for a discrete model. Ex: phone numbers, persons

Algorithms to minimize sum of squared errors:

1. Ordinary least squares: sklearn LinearRegression.

There can be multiple lines that minimize |error|, but only one that minimizes error^2.

Instance based learning:

Properties:

1. Remembers.
2. Fast and doesn’t learn.
3. simple.
4. No generalization.
5. Sensitive to noise.

1. Look up:

In k-nn all features matter equally because when we calculate distance, all features are treated equally.

Locally weighted regression (evolved from k-nearest neighbors):

Naive bayes:

Powerful tools for creating classifiers for incoming labeled data.

Expectation maximization:

This is very similar to k-means clustering.

EM is for soft clustering when there is any ambiguity regarding which data point to move to which cluster.

Which supervised classifiers are suitable for numerical as well as categorical data?

The data you have is called ‘mixed data’ because it has both numerical and categorical values. And since you have class labels; therefore, it is a classification problem.  One option is to go with decision trees, which you already tried. Other possibilities are naive Bayes where you model numeric attributes by a Gaussian distribution or so. You can also employ a minimum distance or KNN based approach; however, the cost function must be able to handle data for both types together. If these approaches don’t work then try ensemble techniques. Try bagging with decision trees or else Random Forest that combines bagging and random subspace. With mixed data, choices are limited and you need to be cautious and creative with your choices.

Feature scaling:

To give equal importance to all the different features, we can normalize them to the range of 0 to 1 before applying learning algorithm.

x-x_min/(x_max-x_min)

from sklearn.preprocessing import MinMaxScaler

scaler=MinMaxScaler()

rescaled_weights=scaler.fit_transform(weights)

Feature rescaling would be useful in k-means clustering and rbf svm where we calculate distances but not much in decision trees and linear regression.

Feature selection:

why?

1. Knowledge discovery, interpretability, and insight. To identify which features actually matter among all of them.
2. Curse of dimensionality – The amount of data that you need to train grows exponential to the number of features that you have.

NP-HARD

Feature selection:

Filtering(fast) and Wrapping(slow):

knn suffers from curse of dimensionality because it doesn’t know which features are important. So, we can use decision trees as filtering mechanism to determine important features and then pass them on to knn for learning.

Wrapper methods evaluate subsets of variables which allows, unlike filter approaches, to detect the possible interactions between variables.

Relevance:

1.  A feature is said to be strongly relevant if the Bayes optimal classifier’s performance is strongly affected by the absence of this feature.
2. A feature is weakly relevant if there exists some other feature which can suffice the purpose of this feature.
3. Depends on how much information a feature provides.

Usefulness:

1. Depends on error/model/learner.

Composite feature:

When to use PCA?

1. When some latent features are driving the patterns in the data.
2. Dimensionality reduction, reduce noise, better visualization.

Feature transformation:

Independent Component analysis: cocktail party problem.

Other feature transformations:

1. Random component analysis – Fast and it usually works.
2. Linear discriminant analysis –

Lesser the cross entropy, better is the model.

cross entropy is the negative logarithm of probabilities of actual events occurring from the perspective of the model we are trying to evaluate.

## Design patterns

Design that is more automated and require less maintenance is the standard.

Setters can be used for validation and constraining the value to be set to a particular range.

Getters can be used for returning default value, if the variable is not set yet or for lazy instantiation.

Can’t make objects out of abstract class. Abstract class can have some non abstract members.

Interface have only abstract methods.

Creational:

1. Singleton:
1. Can only have one instance of that particular class.
2. President of a country, System in java.
3. Private constructor, singleton using enum.
4. @Singleton annotation.
5. Difficult to unit test – why?
2. Factory:
1. Having a logic to return a particular subclass object, when asked for a class object.
3. Abstract Factory:
4. Builder:
1. Separates object construction from its representation.
2. interfaces.
5. Prototype:
1. Chess game initial setup.
2. Copying/cloning the initial setup rather than creating the initial setup everytime you need it. Reduce redundant work.
3. Copy a fully initialized instance.

How to create objects?

Structural:

Inheritance? Interface? etc.

How are different classes related?

How are objects composed?

1.  Match interfaces of different classes. helps to communicate.
2. Composite:
3. Proxy:
1. An object representing another object, like credit card as a proxy of bank account.
2. Remote object and Home object(proxy).
4. Flyweight:
1. Reuse same objects by resetting values of the objects appropriately instead of creating new objects every time.
1. Event managers, process, execute, group many steps into a single step.
6. Bridge:
7. Decorator:
1. Add responsibilities to objects dynamically.
2. Ex: adding different Toppings for different pizzas, adding discounts to different orders.

Behavioral:

Interactions between different objects.

1. Template method:
2. Mediator:
1. instead of applications talking to each other, we use an enterprise service bus.
3. Chain of responsibility:
1. Passing a request through different objects.
4. Observer:
1. A way of notifying a change to a number of classes.
2. This pattern is implemented in java.
3. Subject extends Observable.
4. Who wants to listen implements Observer and registers with the subject.
5. Strategy:
1. change the implementation/strategy of an interface at a later point in time.
2. Pass whatever implementation needs to be used as an argument.
6. Command:
1. Encapsulate a command request as an object.
2. java.lang.runnable threads are implemented like this.
7. State:
8. Visitor:
1. Adding new operations to a particular class without inheritance and wi
9. Iterator:
1. Sequentially access the elements of a collection.
10. Interpreter:
11. Memento:
1. Saving states of something as objects to restore them in future point of time if necessary.
2. Undo/Redo operations.

Strategy pattern:

What:

When:

Observer pattern:

when:

What:

Factory pattern :

Abstract Factory pattern:

Singleton pattern:

Builder pattern:

Prototype pattern:

Try to put state and behaviors in different classes.

## Dependency injection

Enable a class to be generic by making it disown the responsibility of defining an object in itself. Make other classes define the input object for it.

If you want to change the input object , you dont need to change the original class with dependency injection.

A spring container contains a set of objects or beans.

Identify what are assets and what are liabilities.

Increase asset column.

Don’t work for money, make money work for you.

Rich guys income statement and balance sheet.

How corporations help rich with taxes.

Corporations earn, spend and then pay taxes on the rest.

Individuals earn, get taxed and then spend.

Listing some assets:

1. Businesses that do not require my presence.
2. Stocks – Fortunes are made in new stock issues(new stocks are tax-free).
3. Bonds.
4. Income generating real estate.
5. Notes (IOUs).
6. Royalties from intellectual property.

Dimensions of financial literacy:

1. Accounting:
2. Investing:
3. Understanding markets:
4. The law:

## Random number Generator

The ideal random number generator is impossible. So, we usually have a pseudo-random generator which will be seeded by a number initially and then calculated subsequent random numbers using elements like current system time etc. For debugging stochastic programs, we have random.seed() in python which will generate the same sequence of random numbers again and again as we seed it with the same number.