Deep Learning Book
What is Maximum Likelihood Estimation?
In statistics, maximum likelihood estimation (MLE) is a method of estimating the parameters of a statistical model given observations, by finding the parameter values that maximize the likelihood of making the observations given the parameters. The method of maximum likelihood corresponds to many well-known estimation methods in statistics. For example, one may be interested in the heights of adult female penguins, but be unable to measure the height of every single penguin in a population due to cost or time constraints. Assuming that the heights are normally distributed with some unknown mean and variance, the mean and variance can be estimated with MLE while only knowing the heights of some sample of the overall population. MLE would accomplish this by taking the mean and variance as parameters and finding particular parametric values that make the observed results the most probable given the model. In general, for a fixed set of data and underlying statistical model, the method of maximum likelihood selects the set of values of the model parameters that maximizes the likelihood function. Intuitively, this maximizes the "agreement" of the selected model with the observed data, and for discrete random variables it indeed maximizes the probability of the observed data under the resulting distribution. Maximum likelihood estimation gives a unified approach to estimation, which is well-defined in the case of the normal distribution and many other problems.
What initial weight and bias values should you use for feed forward networks?
It is important to initialize all weights to small random values. The biases may be initialized to zero or to small positive values.
How should you think of a layer of a neural network?
Rather than thinking of the layer as representing a single vector-to-vector function,we can also think of the layer as consisting of many units that act in parallel,each representing a vector-to-scalar function. Each unit resembles a neuron in the sense that it receives input from many other units and computes its own activation value.
In modern neural networks, what is the default recommendation for an activation function?
ReLU - Rectified Linear Unit. Defined by the activation function g(z) = max{0, z}
What determines the width and depth of a neural network?
The dimensionality of the hidden layers is the width. The depth is the number of hidden layers or functional compositions there are.
Why are feedforward networks called that?
These models are called feedforward because information flows through the function being evaluated from x, through the intermediate computations used to define f, and finally to the output y. There are no feedback connections in which outputs of the model are fed back into itself.
What's the process to build a machine learning algorithm?
You specify an optimization procedure,a cost function, and a model family