Pandas and numpy:

numpy.mean(df[ (df.gold>0) ][“bronze”])

olympic_medal_counts = {‘country_name’:countries,

‘gold’: Series(gold),

‘silver’: Series(silver),

‘bronze’: Series(bronze)}

df = DataFrame(olympic_medal_counts)

del df[‘country_name’]

# YOUR CODE HERE

avg_medal_count=df.apply(lambda x:numpy.mean(x))

b=numpy.array([4,2,1])

a=numpy.column_stack((numpy.array(gold),numpy.array(silver),numpy.array(bronze)))

points=numpy.dot(a,b)

olympic_points_df=DataFrame({‘country_name’:countries,’points’:points})

df[‘points’] = df[[‘gold’,’silver’,’bronze’]].dot([4, 2, 1]) olympic_points_df = df[[‘country_name’,’points’]]

```
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.25)
```

R2 score:

If is the mean of the observed data:

then the variability of the data set can be measured using three sums of squares formulas:

The most general definition of the coefficient of determination is

Accuracy = true positive + true negatives/total

from sklearn.metrics import accuracy_score

accuracy_score(y_true,y_pred)

Accuracy is not a right metric when data is skewed.

Precision:

Recall:

**ROC CURVE:**

Errors:

Underfitting: Error due to bias, Oversimplified model, performs badly on both training and testing data.

Overfitting: Error due to variance. Over complicated model. model is too specific, performs badly on testing data.

**Model complexity graph**:

Training set: for training model.

Cross validation set: for choosing right parameters like degree of the polynomial. Useful to check whether the trained model is overfitting. If trained model performs poorly on this cross validation set, then the model is overfitted.

Testing set: For final testing.

**K-FOLD CROSS VALIDATION:**

Divide all your data into k buckets and iteratively create models by choosing one bucket as testing set and remaining for training.

Use average of these models for final model.

**If you increase training points on different models**:

Grid Search CV:

Plot learning curves to identify when to stop collecting data.

Supervised learning:

If there is an order in output data, then go for continuous model. Ex: income, age.

If there is no order, go for a discrete model. Ex: phone numbers, persons

Algorithms** to minimize **sum** of squared errors:**

- Ordinary least squares: sklearn LinearRegression.
- Gradient descent.

There can be multiple lines that minimize |error|, but only one that minimizes error^2.

**Instance based learning:**

Properties:

- Remembers.
- Fast and doesn’t learn.
- simple.
- No generalization.
- Sensitive to noise.

- Look up:

In k-nn all features matter equally because when we calculate ditance , all features are treated equally.

Locally weighted regression (evolved from k-nearest neighbors):

Naive bayes:

Powerful tools for creating classifiers for incoming labeled data.