### Neuron:

Inputs from the dendritic tree

Outputs at axon terminal.

The effectiveness of the synapse can be changed.

- vary the number of vesicles of the transmitter.
- vary the number of receptor molecules.

With more practice, Myelin sheath gets thicker and acts as a stronger insulator to reduce the loss of electrical signals during transmission.

Need to model Myelin sheath effect in neural networks to simulate addictive, habitual effects of human behavior. The parameter representing Myelin sheath should change with each instance of travel through that path.

If we introduce this effect in Multi layer neural networks after each feature learning with a different set of weights for the understanding importance of this particular feature in learning the output and train these weights in the same way as that of back propagation.

**Models of the Neurones**:

#### Binary threshold neurons:

(nondifferentiable)

#### Rectified linear neuron:

This activation function can be effectively used against vanishing gradient problem but be careful not to fall prey to exploding gradient problem.

Vanishing gradient problem: While training neural networks with backpropagation, weights are updated based on the gradients of error function w.r.t that particular weight. By the chain rule, that involves the gradient of the activation function. If the gradient of an activation function is between -1 and 1, and when multiple gradients are being multiplied especially for training front layers, the gradient becomes very small and weight updates don’t happen effectively.

#### Sigmoid Neurons:

(differentiable)

derivative flattens out for very large and very small x values.

tanh function is in the range of -1 to 1 in y-axis and can be better than sigmoid for faster training.

#### Stochastic Binary Neurons:

The graph of this neuron is same except that the y-axis is the probability of output strength instead of original strength.

### Different types of Machine Learning:

#### Supervised Learning:

- Learn to predict an output when given an input vector.
- Two types:
- Regression: when output is continuous data, like change in housing prices over the years
- Classification: Outputs are discrete class labels.

#### Reinforcement learning:

- Learn to select an action to maximize payoff.
- The goal in selecting each action is to maximize the expected sum of the future rewards.

#### Unsupervised learning:

- Discover a good internal representation of the input.
- Principal component analysis.
- Clustering.

### Types of Neural Network architectures:

#### Feed forward neural networks:

if we have more than one hidden layer, then that is called deep neural network.

#### Recurrent Networks:

These are difficult to train but biologically realistic.

#### Recurrent neural networks for modeling sequences:

To model sequential data.

They are able to remember information in their hidden state for a long time.

One of the applications developed by IIya sutskever is predicting the next character in a sequence.

#### Symmetrically connected networks:

recurrent nets with same weights in both directions between two nodes.

#### Symmetrically connected networks with hidden units(Boltzmann machines):

### Perceptron:

The objective is to choose weights and bias value so that it can rightly classify the classes of our requirement.

#### Learning:

If the output unit is correct, don’t change the weights.

if the output unit is zero instead of one, add input vector to weight vector.

if the output unit is one instead of zero, subtract input vector to weight vector.

#### The limitations of perceptrons:

Can only learn linear boundaries.

XOR gate can’t be trained by perception.

Minsky and Papert’s group invariance theorem.

If it is nonlinear, number iterations to converge doesn’t end and keeps on going.

Human coded feature detection is the key part of pattern recognition but not the learning procedure.

The long term conclusion of the study on Perceptrons is neural networks without hidden layers are very limited or needs to be fed with features for the proper result on complicated pattern recognition. The presence of hidden layers can learn features themselves if we can find a way to update weights across all layers appropriately.

## The backpropagation learning(overkill of chain rule):

In linear neurons:

Iterative method:

Not efficient but generalizable.

To appropriately modify a particular weight, we first calculate, rate of change of error across all training cases with respect to change in this weight. we use this quantity and a learning rate that we define to calculate the change of that weight.

In delta-rule, we increment or decrement the weight vector by the input vector scaled by the residual error and the learning rate.

Convergence of weights depends upon the correlation between input dimensions. It is hard to decide upon the weights (wi), when corresponding inpurs (xi) are same and highly correlated.

Online learning: With delta rule, you don’t need to collect all the training cases and then train them. We can train the network with one training example at a time as we get them.

For linear neuron, error surface looks like a quadratic bowl.Weights on horizontal axes and error on vertical axes.

Steepest descent: It is not effective in cases of elongated surfaces.

Non-linear neuron: output as logistic function of logit i.e . X*W+b.

Backpropogation comes into picture when we need to learn the weights of the hidden units.

The main objective is to find rate of change of Error w.r.t change of a particular weight (w_ij) in hidden unit for anyone training case.

This quantity can be expressed by chain rule as derivative of logit w.r.t weight multiplied by derivative of error w.r.t logit.

Derivative of error w.r.t logit is derivative of a particular output unit w.r.t logit multiplied by derivative of error w.r.t particular output unit.

Optimisation issues: How do we use the error derivatives on individual cases to discover a good set of weights?

How often?

- Online – good when there is redundancy in data.
- Full batch
- Mini-batch

How much?

- fixed learning rate
- adapt the global learning rate.
- adapt the learning rate on each connection separately?
- Don’t use the steepest descent?

Generalization issues: How do we ensure that learned weights work for cases for which we have not trained them as well?

Ways to reduce overfitting:

- Weight decay.
- Weight sharing.
- Early stopping – Training(num of epochs) until the testing error starts to increases.
- Model Averaging.
- Bayesian fitting of neural nets.
- Dropout.(in the range of 0 to 1. Start with small value and increase it if you think it is necessary)
- Generative pre-training.
- Cross-validation

For logistic neurons, we have dy/dz=y*(1-y) term in residual error while calculating dw for learning. In cases of y=0.000001 and target output is 1, we are having the biggest error we can have. yet, our learning would be very less because of y term in dy/dz.

Softmax:

If we use softmax function instead of the logistic function at the output, we will have outputs a probability distribution between 0 and 1 over mutually exclusive alternatives. The probability distribution for all the classes would sum up to be 1.

Cross entropy cost function:

C=summation(t_j*log(y_j))

**Categorical cross entropy loss function** :

Object recognition:

- Why object recognition is difficult?
- Objects defined based on purpose rather than its look or structure. We need to coordinate with module 3 and 7 to overcome this.
- We need to identify the object even though the viewpoint of it changes. When viewpoint changes, we have this problem of dimension hopping in training neural networks to recognize the object.Usually, inputs of neural networks for image recognition would be pixels, but when viewpoint changes, the input at one pixel at one training instance will be same at another pixel during different training instance. This is dimension hopping.#viewpoint_invariance.

- Solutions to achieve viewpoint invariance:
- redundant invariant features:
- Put a box around the object.
- Convolutional neural nets:
- Use hierarchy of parts that have explicit poses relative to the camera.

Mini-batch learning:

- It is better than online learning where you update the weights for each case, because of the computational efficiency in dealing with multiple training cases at once by matrix manipulations.

Initializing weights:

- To break symmetry, initialize weights to random values.
- If you start with a very big learning rate, weights will become very large or very small. Error derivative will become tiny and you might confuse plateau with the local minimum.

Use principal component analysis to decorrelate inputs. This achieves some dimensionality reduction after removing the components with small eigenvalues.

Stochastic gradient descent with mini-batches combined with momentum is the most common recipe used for learning big neural nets.

Unlike in Gradient descent where you take all available data to calculate the error and optimize, in stochastic gradient descent you take the small chunk of the whole data, calculate the error and optimize the decision boundary and repeat this process for remaining chunks of the data as well.

If your model is not working, decrease the learning rate so that it takes smaller steps and has a good chance of reaching the bottom rather moving to and fro.

when the gradient descent gets stuck in local optima, we might need to seek out for other advanced methods like:

- Momentum in gradient:
- higher order derivatives.
- randomized optimization – initialize weights with random values.
- the penalty for complexity (overfitting) – by reducing layers, by reducing nodes, by reducing the range of weights.
- Nesterov Momentum (This slows down the gradient when it’s close to the solution).

L1 regularization: Add modulus of weights in error function. Good for feature selection as many weights can get reduced to zero and become sparse.

L2 regularization: Add squared weights to error function. Normally better for training models.

All things being equal, the simpler explanation is preferred – similar to Occam’s razor

Keras :

Flavours of neural networks:

**My Thoughts:**

- Usually, we design the hierarchical structure of layers of neurons for learning a particular task, classification or prediction. But, Brain has already evolved some hierarchical neural network structures which are very good at what they do. What if we try to emulate this evolution by applying genetic algorithms to neurons and their connections to see whether we can come up with a beautiful, stable structure of neurons that is efficient in learning lots of tasks. What should be the fitness function to validate our network structures at any point of simulation?
- For object recognition, instead of training a network to classify a specific set of objects, can we train a network just to classify all the different objects spatially, even though it doesn’t know and not yet trained to name or a class of what it is? For every new image, it just has to be able to differentiate all the elements in the object irrespective of whether they have encountered them in previous training sets or not. If you show humans an object in an image, even though they haven’t seen that object before, they identify as a unique object, whatever it may be.

** Glossary:**

- Model class (f) : The function used to map inputs to outputs like models of neurons discussed above.
- Binary representation: Many neurons are involved in representing one concept and one neuron is involved in representing many concepts.
- Bottle-neck layers: The layer for which the number of nodes is less than the input nodes.
- Drop-out: Half of the hidden units in a layer are randomly removed, Regularization technique.
- Fan-in: It’s the number of inputs.

A directed graph, the neural network itself can be an output of a neural network.

Miscellaneous:

- How to apply Neural networks to time series data? With recurrent network.
- Overview of all gradient descent algorithms.