Deep Learning

¡Supera tus tareas y exámenes ahora con Quizwiz!

What is the only parameter of the perceptron algorithm?

*MaxIter*: the number of passes to make over the training set (too many passes -> overfit)

What are the problems the MSE loss?

- Does not penalise big mistakes - Derivative of the sigmoid is the problem, for large values of | wx + b | it becomes very small - We can repair it according to the chain rule

What is a variational auto encoder?

- Encode data by inferring the parameters of a distribution - Reconstruct data in terms of a distribution How it works: - To generate a sample from the model, the VAE first draws a sample z from the code distribution pmodel(z) - The sample is then run through a differentiable generator network g(z) - Finally x is sampled from a distribution - Problem is they tend to be blurry (Reparameterization Trick)

What are the properties of convolution?

- Sparse connectivity / interactions - Parameter sharing - Equivariant representations

Describe sparse connections / interactions in CNN.

- Traditional NN layers use matrix multiplication by a matrix of parameters with separate parameters describing the interaction b/w each input unit. (means that every output unit interacts with every input unit) - CNNs however have sparse interactions. This is accomplished by making the kernel smaller thatn the input. --EG when processing image, the input image might have thousands or millions of pixels, but we can detect small, meaningful features such as edges with kernels that occupy only tens or hundreds of pixels. -- This means that we need to store fewer parameters, which both reduces the memory requirements of the model and improves its efficiency -- Also means that computing the output requires fewer operations -- In a deep CN, units in the deeper layers may indirectly interact with a larger portion of the input

What is a multi layer perceptron?

A class of feedforward artificial neural networks.

How does back-propagation work?

A neural network propagates the signal of the input data forward through its parameters towards the moment of decision, and then backpropagates information about the error through the network so that it can alter the parameters one step at a time.

What is a perceptron?

A perceptron is a single layer neural network. A multi-layer perceptron is called neural networks

Describe pooling in CNN

A pooling function replaces the output of the net at a certain location with a summary statistic of the nearby outputs.

What is a Gated Recurrent Unit?

A single gating unit simultaneously controls the forgetting factor and the decision to update the state unit

What is a word embedding?

A vector representation of words or sentences that captures their semantics to some degree

Advantages and Disadvantages of LSTMs?

Advantage: successful in modelling long-term dependencies Disadvantage: They have a lot of parameters (many gates) --But: it is possible to simplify LSTMs into Gated Recurrent Units

What are CNNs?

Are a specialized kind of neural network for processing data that has a known grid-like topology. CNNs are simply neural networks that use convolution in place of general matrix multiplication in at least one of their layers.

Perceptron's classification abilities?

Classify x as either a positive or negative instance

What is a disadvantage of RNN?

Deep RNNs can not cope well with *long-term dependecies* -Gated RNNs and Long Short Term Memory LSTM networks are better able to cope with long term dependencies

What are the types of embedding?

Frequency based: Prediction based

What are the initialisation schemes for deep learning?

Gaussian Initialization: - We could set the initial weights to N(0,1) {mean 0 and std 1) -- A neuron receiving inputs from N neurons (including bias) -- Let's assume 50% of the neurons are active -- As a result, *large values are likely to saturate the sigmoids* - Therefore, we could set the initial weights to to N(0,1/sqrt(N)) -- Called Xavier (Glorot) Initialization and effectively prevents saturation of sigmoid function. Xavier initialisation does not work well for ReLU's - Instead set the initial weights to N(0,2/sqrt(N))

What is softmax?

The summation is over all output neurons and normalizes their outputs so that they sum to 1

What are the options for loss functions?

Two options: - Sigmoids and cross-entropy loss - Softmax and log-likelihood loss (*provides probablities*)

How do we change the parameters in backpropagation?

Using an optimization algorithm called *gradient descent*, which is useful for finding the minimum of a function. We seek to minimize the error (aka loss function / objective function)

Explain the prediction based word embedding methods.

Word2Vec combines two neural network methods that map words onto target words: 1) Continuous bag of words (CBOW): - Predict the probability of a word given its context - Order of context words does not influence prediction 2) Skip-gram model - Model uses the current word to predict the surrounding window of context words. CBOW is faster while skip-gram architecture weighs nearby context words more heavily than more distant context words.

How does tensorflow work?

Works by creating a directed acyclic graph (DAG) to represent your computation A tensor is an n dimensional array of data So your data in tensorflow are tensors. They flow through the DAG

What is tensor flow?

enables writing computation code in a high-level language like Python, and have it executed fast

Describe equivariant representations in CNN.

if the input changes, the output changes in the same way

What is deep learning?

is part of a broader family of machine learning methods based on artificial neural networks

DL strengths?

• Ability to integrate information from huge heterogeneous sources • Ability to predict • Ability to detect/recognise • Ability to discover patterns

DL Weaknesses?

• The computer is a very powerful but rather stupid calculation machine • AI algorithms lack "common sense" • AI cannot put events into their context • AI depends critically on the quality of the underlying statistics/data •Deep Learning is opaque

Describe parameter sharing in CNN?

In a CNN each member of the kernel is used at every position of the input (except perhaps some boundary pixels) Parameter sharing used by the convolution operation means that *rather than learning a separate set of parameters for every location*, we *learn only one set*.

What is the basic idea of LSTM?

Introduce adaptive "gates" that can suppress the propagation of info in the hidden layer of an RNN.

What is back-propagation?

It is the central mechanism by which neural networks learn. It is the messenger telling the network whether or not the net made a mistake when it made a prediction.

What does it mean for a perceptron to converge?

It means it can make an entire pass through the training data without making any more updates (i.e. it has correctly classified every traiing example). In this case the data is linearly separable.

What does a generative model do?

Learning data distributions using unsupervised learning that allow to generate new data points

When do two vectors have a zero dot product?

Only if they are *perpendicular*. Thus if we think of the weights as a vector w, then the decision boundary is simply the plane perpendicular to W

What does it mean for a perceptron to be online?

Rather than considering the *entire data set* it *only looks at one example*. It processes the example and goes on to the next one.

What is an RNN?

Recurrent neural network. Class of artificial neural network where connections b/w nodes form a directed graph along a temporal sequence. Allows it to exhibit temporal dynamic behaviour Unlike feedforward neural networks, RNNS can use their internal state (memory) to process inputs

What does it mean for perceptron to be error-driven?

So long as it is doing well, it doesn't update its parameters

How does a Generative Adversarial Net (GAN) work?

The generator generates data samples from a distribution Z ∈ ℝ ^(kx1) The discriminator separates generated data from training data (a binary classification) Two training parts: - Update the discriminator: the discriminator sees training data and samples from the generator. Only the parameters of the discriminator are updated - Update the generator: the discriminator is fed only with samples from the generator. Only

What are the frequency based embedding methods?

1) Count Vectors: - Corpus C containing D documents, and N unique tokens - Count-vector matrix M: D x N entries, with each row specifying the frequency of term tokens in a specific document 2) TF-IDF - Common words such as "the" or "a" occur very frequently, it penalises these words by assigning them lower weights, while prioritising relevant words 3) Co-Occurrence matrix - Similar words tend to occur together and will have a similar context

What does the bias of a perceptron change?

The bias alters the position (though not the orientation) of the decision boundary.

What is the decision boundary of a perceptron?

The decision boundary is precisely where the sign of the activation, a, changes from -1 to +1. In other words, it is the set of points x that achieve zero activation. The points are not clearly positive nor negative


Conjuntos de estudio relacionados

History of Rock and Roll - Exam 2 OK STATE

View Set

into to business ~ chapter 2 exam

View Set

Intermediate 2- Exam 2 multiple choice

View Set

Managing Diversity - Workplace Chapter 5

View Set

Unit 4 Review: Triumph of Industry; Labor Movement; Cities, Immigration and Farmers

View Set

EMT chapter 16- Cardiovascular Emergencies

View Set

Patient Care Final - Study Guide

View Set