Naive Bayes

Bayes theorem helps us to incorporate new evidences/information into our model.

Code from sklearn

from sklearn.naive_bayes import GaussianNB

clf=GaussianNB()

clf.fit(features_train,labels_train)

Bayesian Learning:

Bayesian learning.PNG


Naive bayes

 

Linear in the number of variables.

Naive Bayes assumes conditional independence across attributes and doesn’t capture inter-relationship among attributes.

Gaussian naive Bayes assumes continues values associated with each class are distributed in Gaussian fashion.

Even if the probability of one of the attributes given label becomes zero, the whole thing ends up being zero.

Advertisements

Machine learning

Pandas and numpy:

numpy.mean(df[ (df.gold>0) ][“bronze”])

olympic_medal_counts = {‘country_name’:countries,
‘gold’: Series(gold),
‘silver’: Series(silver),
‘bronze’: Series(bronze)}
df = DataFrame(olympic_medal_counts)

del df[‘country_name’]
# YOUR CODE HERE
avg_medal_count=df.apply(lambda x:numpy.mean(x))

b=numpy.array([4,2,1])

a=numpy.column_stack((numpy.array(gold),numpy.array(silver),numpy.array(bronze)))

points=numpy.dot(a,b)

olympic_points_df=DataFrame({‘country_name’:countries,’points’:points})

df[‘points’] = df[[‘gold’,’silver’,’bronze’]].dot([4, 2, 1]) olympic_points_df = df[[‘country_name’,’points’]]


Preprocessing:

For highly-skewed feature distributions such as 'capital-gain' and 'capital-loss', it is common practice to apply a logarithmic transformation on the data so that the very large and very small values do not negatively affect the performance of a learning algorithm. Using a logarithmic transformation significantly reduces the range of values caused by outliers. Care must be taken when applying this transformation however: The logarithm of 0 is undefined, so we must translate the values by a small amount above 0 to apply the the logarithm successfully.

Scaling:

In addition to performing transformations on features that are highly skewed, it is often good practice to perform some type of scaling on numerical features. Applying a scaling to the data does not change the shape of each feature’s distribution (such as 'capital-gain' or 'capital-loss' above); however, normalization ensures that each feature is treated equally when applying supervised learners. Note that once scaling is applied, observing the data in its raw form will no longer have the same original meaning, as exampled below.

We will use sklearn.preprocessing.MinMaxScaler for this.

If data is not normally distributed, especially if the mean and median vary significantly (indicating a large skew), it is most often appropriate to apply a non-linear scaling — particularly for financial data. One way to achieve this scaling is by using a Box-Cox test, which calculates the best power transformation of the data that reduces skewness. A simpler approach which can work in most cases would be applying the natural logarithm.

Identifying Outliers:

Tukey’s method


 

 

from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.25)

R2 score:

r2 score.PNG

If {\bar {y}} is the mean of the observed data:

{\bar {y}}={\frac {1}{n}}\sum _{i=1}^{n}y_{i}

then the variability of the data set can be measured using three sums of squares formulas:

SS_{\text{tot}}=\sum _{i}(y_{i}-{\bar {y}})^{2},
SS_{\text{reg}}=\sum _{i}(f_{i}-{\bar {y}})^{2},
{\displaystyle SS_{\text{res}}=\sum _{i}(y_{i}-f_{i})^{2}=\sum _{i}e_{i}^{2}\,}

The most general definition of the coefficient of determination is

R^{2}\equiv 1-{SS_{\rm {res}} \over SS_{\rm {tot}}}.\,

 

Accuracy = true positive + true negatives/total

from sklearn.metrics import accuracy_score

accuracy_score(y_true,y_pred)

Accuracy is not a right metric when data is skewed.


 

Precision:

precision.PNG

Recall:

Recall.PNG

 

f beta.PNG


ROC CURVE:


Errors:

Underfitting: Error due to bias, Oversimplified model, performs badly on both training and testing data.

Overfitting: Error due to variance. Over complicated model. model is too specific, performs badly on testing data.

Model complexity graph:

Model complexity graph.PNG

Training set: for training model.

Cross validation set: for choosing right parameters like degree of the polynomial. Useful to check whether the trained model is overfitting. If trained model performs poorly on this cross validation set, then the model is overfitted.

Testing set: For final testing.

K-FOLD CROSS VALIDATION:

Divide all your data into k buckets and iteratively create models by choosing one bucket as testing set and remaining for training.

Use average of these models for final model.

K-FOLD.PNG

 

If you increase training points on different models:

 

Grid Search CV:

GRID SEARCH CV.PNG


Plot learning curves to identify when to stop collecting data.


Supervised learning:

If there is an order in output data, then go for continuous model. Ex: income, age.

If there is no order, go for a discrete model. Ex: phone numbers, persons


Algorithms to minimize sum of squared errors:

  1. Ordinary least squares: sklearn LinearRegression.
  2. Gradient descent.

There can be multiple lines that minimize |error|, but only one that minimizes error^2.


Instance based learning:

Properties:

  1. Remembers.
  2. Fast and doesn’t learn.
  3. simple.
  4. No generalization.
  5. Sensitive to noise.

 

  1. Look up:

In k-nn all features matter equally because when we calculate distance, all features are treated equally.

 

Locally weighted regression (evolved from k-nearest neighbors):


Naive bayes:

Powerful tools for creating classifiers for incoming labeled data.


Expectation maximization:

expectation maximization.PNG

This is very similar to k-means clustering.

EM is for soft clustering when there is any ambiguity regarding which data point to move to which cluster.


Which supervised classifiers are suitable for numerical as well as categorical data?

The data you have is called ‘mixed data’ because it has both numerical and categorical values. And since you have class labels; therefore, it is a classification problem.  One option is to go with decision trees, which you already tried. Other possibilities are naive Bayes where you model numeric attributes by a Gaussian distribution or so. You can also employ a minimum distance or KNN based approach; however, the cost function must be able to handle data for both types together. If these approaches don’t work then try ensemble techniques. Try bagging with decision trees or else Random Forest that combines bagging and random subspace. With mixed data, choices are limited and you need to be cautious and creative with your choices.


Feature scaling:

To give equal importance to all the different features, we can normalize them to the range of 0 to 1 before applying learning algorithm.

x-x_min/(x_max-x_min)

from sklearn.preprocessing import MinMaxScaler

scaler=MinMaxScaler()

rescaled_weights=scaler.fit_transform(weights)

Feature rescaling would be useful in k-means clustering and rbf svm where we calculate distances but not much in decision trees and linear regression.


Feature selection:

why?

  1. Knowledge discovery, interpretability, and insight. To identify which features actually matter among all of them.
  2. Curse of dimensionality – The amount of data that you need to train grows exponential to the number of features that you have.

NP-HARD


Feature selection:

Filtering(fast) and Wrapping(slow):

knn suffers from curse of dimensionality because it doesn’t know which features are important. So, we can use decision trees as filtering mechanism to determine important features and then pass them on to knn for learning.

Wrapper methods evaluate subsets of variables which allows, unlike filter approaches, to detect the possible interactions between variables.


Relevance:

  1.  A feature is said to be strongly relevant if the Bayes optimal classifier’s performance is strongly affected by the absence of this feature.
  2. A feature is weakly relevant if there exists some other feature which can suffice the purpose of this feature.
  3. Depends on how much information a feature provides.

Usefulness:

  1. Depends on error/model/learner.

Composite feature:

composite feature.PNG


When to use PCA?

  1. When some latent features are driving the patterns in the data.
  2. Dimensionality reduction, reduce noise, better visualization.

Feature transformation:

Independent Component analysis: cocktail party problem.

PCA vs ICA.PNG

Other feature transformations:

  1. Random component analysis – Fast and it usually works.
  2. Linear discriminant analysis –