Deep Learning, New ML

Réussis tes devoirs et examens dès maintenant avec Quizwiz!

What NN should be used for Seq2Seq tasks?

Encoder-Decoder (RNN)

What is end-to-end learning? Give a few of its advantages.

End-to-end learning is usually a model which gets the raw data and outputs directly the desired outcome, with no intermediate tasks or feature engineering. It has several advantages, among which: there is no need to handcraft features, and it generally leads to lower bias.

What assumptions does the gradient descent method make?

It makes a strong assumption about local minimums it the selection of a fixed step size and it assumes isotropy in that the same step size should be used in all directions.

What Is the Difference Between Epoch, Batch, and Iteration in Deep Learning?

Epoch - Represents one iteration over the entire dataset (everything put into the training model). Batch - Refers to when we cannot pass the entire dataset into the neural network at once, so we divide the dataset into several batches. Iteration - if we have 10,000 images as data and a batch size of 200. then an epoch should run 50 iterations (10,000 divided by 50).

Does over-parametrizing a deep model improve or reduce test performance?

It often improves the test performance, contrary to what the bias-variance decomposition predicts.

Tranposed Convolutions.

Expands from low dimension to high dimension. EXPAND ON THIS

When approximating a deep network 'f', with a single hidden layer 'g', how does the width of 'g' grow as the depth of 'f' grows?

Exponentially.

What is the idea of pooling?

Extracting relevant features independent of their position in the input

How does transition-based dependency parsing work?

It only builds one tree, in one left-to-right sweep over the input

Why is language modeling to restrictive?

It only considers the left context and not the right context

Name one characteristic of the t-Distributed Stochastic Neighbor Embedding (t-SNE).

It optimizes the SGD with the yis to maintain the distances to close neighbors of each point. This makes it easier to distinguish between clusters. Often used to reduce high dimensional data.

What is the receiver operating curve?

It shows the trade-off between True Positives (TP) and False Positives (FP). A standard measure of this curve is the Area Under the Curve (AUC).

What does a dependency structure show?

It shows which words depend on (modify or are arguments of) which other words

How does BiSkip work?

It uses sentence and word aligned texts, then runs a skip-gram model whose contexts are words from both languages

What is the advantage of the sigmoid function?

It's a differentiable approximation ==> optimizable

Auto-encoders as self training

FILL IN FROM SLIDES

What is the main Idea of Gated Recurrent Units (GRU)?

Keep around memories to capture long distance dependencies

How does Looks linear initialization work?

LLI makes the network linear initially by initializing weights in a mirrored block structure. Often, concatenated rectifiers (CReLU) are used. This method doubles the number of units, hence parameters. W = (W1, W2) becuase to the CReLu. Works best with tanh CNNs. https://stats.stackexchange.com/questions/339054/what-values-should-initial-weights-for-a-relu-network-be

What is POS tagging?

Label each token in a sentence with its part-of-speech (= word class)

What are exploding gradients?

Large error gradients result in large updates to the NNs weights

How does an attention-based encoder-decoder model work?

Learn an attention vector that indicates which part of the input should be in the focus

What is supervised learning?

Learning a model from labled data

What is unsupervised learining?

Learning a model from unlabled data

How are Encoder-Decoder models used in sentence embeddings?

Let the input sequence equal the output sequence, and take the final hidden vector on the input side to be the sentence representation (auto encoder)

How Does an LSTM Network Work

Long-Short-Term Memory (LSTM) is a special kind of recurrent neural network capable of learning long-term dependencies, remembering information for long periods as its default behavior. There are three steps in an LSTM network: Step 1: The network decides what to forget and what to remember. Step 2: It selectively updates cell state values. Step 3: The network decides what part of the current state makes it to the output.

What is context attention?

Look it up online.

What is the most common pooling in NLP?

Max-over-time pooling

How should an advanced optimizer like AdaGrad or Adam be used with SGD (Mini-Batch)?

First train with Adam, fine-tune with SGD

How does an auto-encoder work?

First, the auto-encoder "encodes" the input onto a space of smaller dimension with convolutional layers (as we have done early in this course). Then, a decoder uses transposed convolutions to go back to the original signal space. However, this new image is approximated with the remaining lower dimensional information from the encoder.

Explain how a GAN is trained.

First, the generator G is fixed while the discriminator D is trained. Then, G is trained with the results from D and D is fixed. The loss of G is high if D is doing a good job at detecting fakes.

What does parameter sharing mean in the context of a CNN?

For each window, we use the same weights and bias values

When is the Rouge Score used?

For evaluating automatic summarization by orienting on the Recall-Score

When is the BLEU Score used?

For evaluating the quality of text which has been machine-translated from one natural language to another. (Quality is comparison between output of machine and human)

Why do we need new measures for higher level NLP tasks?

For higher level NLP tasks, automatic measures may correlate poorly with human evaluation

When is the F1 measure used for evaluation?

For two or more classes, one typically computes the F1- score of each class and then combines this (eg. averaging) in an overall score

Is square loss a good loss function for multi-class prediction?

No

What is the CBOW task?

Given a context, predict the missing word

What is the Skip-gram task?

Given a word, predict the context words

What is one advantage of dilation?

Having a dilation greater than one increases the units' receptive field size without increasing the number of parameters.

Does a standard attention layer take into account the absolute locations of the values? If not, what is a solution for this problem?

No, a standard attention layer does not take into account the position of the values. To fix this, we can provide the network with a positional encoding, which can be concatenated with the input batch.

Is it necessary to shuffle the training data when using batch gradient descent?

No, because the gradient is calculated at each epoch using the entire training data, so shuffling does not make a difference.

What are hard alignments in attention based encoder-decoder models?

"Degenerate prob. distribution" with 0/1 values

Is the autograd graph the same as the structure of the network?

No, they are related but may differ.

How does dropout help improve performance?

"In a standard neural network, the derivative received by each parameter tells it how it should change so the final loss function is reduced, given what all other units are doing. Therefore, units may change in a way that they fix up the mistakes of the other units. This may lead to complex coadaptations. This in turn leads to overfitting because these co-adaptations do not generalize to unseen data. We hypothesize that for each hidden unit, dropout prevents co-adaptation by making the presence of other hidden units unreliable. Therefore, a hidden unit cannot rely on other specific units to correct its mistakes. It must perform well in a wide variety of different contexts provided by the other hidden units."

What does NP => DET A** N PP** mean?

"NP" (noun phrase) expands to determiner, zero or more adjectives, noun, and zero or more prepositional phrases

The output unit of a network can be considered as what?

(Statistical) model parametrizied by the weigths

What are the advantages and disadvantages of a character-based machine translation encoder-decoder?

+ Can predict OOV and rare Words + Can better deal with morphological variants - state-space may explode - long range dependencies

What is a good %-split of the data into Training, Validation and Testing splits?

- 60% Training - 20% Validation - 20% Testing

How can the optimization part of learning be modified?

- Adaption of learning rate - Inclusion of regulization term - Inclusion of momentum

Should we do cross-validation in deep learning?

No. The variance of cross-folds decrease as the samples size grows. Since we do deep learning only if we have samples in thousands, there is not much point in cross validation.

What does sparse connectivity mean in the context of a CNN?

Not every input is connected to every output in the following layer

What are the properties of RNNs?

- Are deep MLPs - Weight sharing - Sparse connectivity - Skip connections

How can words be represented?

- As a dictionary entry - By their relation to other words (Taxonomy) - As a one-hot vector (Word vectors)

What does a language model do?

- Assigns a sequence of tokens (words, characters, ...) a probability (likelihood of observing this sequence in text) - Probability(the cat sat on the mat) > Probability(mat sat cat on the the) - Can be used to generate language

What does macro-averaging do and when should it be used?

- Averages on the class level (classes with less instances still have influence) - When performance on small classes is of importance

What does micro-averaging do and when should it be used?

- Averages on the level of test instances (classes with more instances have more influence) - When performance on largest classes is of importance

Why are contextualized embeddings useful?

- Because of abiguity - Embed words with consideration to their context

What are the sources for determining dependency relations?

- Bilexical affinities - Dependency distance - Intervening material - Valency of heads

Which properties do dependency relations have?

- Binary - Asymmetric - Sometimes typed

What are the components and operations of a transition-based system?

- Buffer - Stack - A set of archs - Transitions (left-arc, right-arc, etc.)

How can we model the distributional hypothesis?

- By calculating co-occurence count - Context is modeled using a window over the words

What are the two auxiliary task for language models (word2vec)?

- CBOW (Continuous Bag of Words) - Skip-gram

What are the differences between a consuent tree and a dependency tree?

- CFG (Context-free grammar) uses non-terminal symbols, describes how sentence is structured in components/phrases - Dependency Parsing gives head/dependent relationships between words

What are the advantages of FastText?

- Can embed OOV Words - Naturally works for morphologically rich languages

Why are hidden layers needed?

- Can learn useful intermediate representations of the data - Good organization of layers makes learning much faster

What are the properties of Gated Recurrent Units (GRUs)?

- Can store memory in cell indefinitely - Can forget past memory and reset everything - Can go back to standard RNN-Mode

Instead of words what else can be embedded?

- Characters (i n s i g h t f u l) - Syllables (in sight ful) - Morphemes (insight ful)

What are two ways to specify the grammar of sentences?

- Constuent tree - Dependency tree

What are some tree structure violations?

- Cycles - Multiple parents of a node - Non-connected parts

What are the disadvantages of Leaky ReLU function?

Not zero centered

In what do we split our training set data?

- Development Set - (Proper) Training Set

What are bi-lingual mappings one can use for BiSkip?

- Dictionaries - Inter-lingual links in Wikipedia - Word alignments learned from parallel corpora

What are auto-encoders traditionally used for?

- Dimensionality reduction - Representation learning

What are the four big wins of neural MT?

- End-to-end training - Distributed representations share strength - Better exploitation of context - More fluent text generation

What changes were implemented in the DCGAN (Deep convolutional GAN)?

- FC layers were converted to convolutions - Batchnorm was used in both G and D - Pooling replaced by strided convolutions in D and strided transposed convolutions in G - among other things

What are the options of word vectors for given tasks?

- Fixed word representations - Adjust the word representations to the task

Why are sentence/document embeddings needed?

- For clustering - For retrieval - As an alternative to sentence representations learned from word-level models (e.g. CNN)

What are the advantages Maxout function?

- Generalizes ReLU und Leaky ReLU - Gradients don't die

What are three optimization methods?

- Gradient descent - Newton methods - Conjugate gradient

What are the advantages of ReLU function?

- Gradients don't die in +Region - Computationally efficient - Convergence is faster

What are the advantages of Leaky ReLU function?

- Gradients don't die in +Region - Gradients don't die in -Region - Computationally efficient - Convergence is faster

What methods are used to search for good hyperparameters?

- Grid Search - Random Search - Bayesian Methods

What are the training methods of Word2Vec?

- Hierarchical softmax - Negative sampling

What are the components of a Network Structure?

- Input - Weights - Summation and Bias - Activation - Output Neuron

What are the methods to evaluate word representations?

- Intrinsic Evaluation - Extrinsic Evaluation

What is the F1 Score?

- It is a measure of a test`s accuracy - It is the harmonic mean, or weighted average, of the precision and recall scores

Why can't we add more and more context to POS tagging?

- It will get slower (more parameters) - Overfitting

What are the disadvantages of ReLU function?

- Kills gradients in -Region - Not zero centered

What helps against exploding gradients?

- L1 or L2 regularization on recurrent weights - Gradient clipping

What are the problems with the RNN Encoder-Decoder architecture?

- Long range dependencies - Context vector c has fixed length - Input size is variable

For what tasks can encoder-decoder models be used?

- Machine Translation - Lemmatization - Spelling correction - POS tagging

What are the problems with evaluating sentence embeddings?

- Many different sizes - Different models trained on different datasets - Which classifier to use on top of embeddings in extrinsic tasks?

What are the advantages of Mini-Batch learning?

- Mini-batch learning converges faster to a good solution than batch learning - Computing loss function gradient on all datapoints can be computationally expensive

What are the disadvantages of the sigmoid function?

- Not zero centered - Kills gradients

What are problems with the direct transfer of words between languages in bilinguality?

- OOV Words - Syntactic ordering

Which representations does ELMo combine?

- One on character level - Two representations obtained from the two layers in an RNN

What is the development set used for?

- Optimization of hyperparameters - Early Stopping

What are the problems of pooling?

- Output remains the same if a feature occurs once or multiple times - Order of features is not considered

What are sequence labeling (sequence tagging) tasks for a RNN?

- POS tagging - NER - Grapheme-to-Phoneme Conversion - Lemmatization - Language Modeling

How to avoid overfitting?

- Reduce number of features - Do model selection - Use regularization - Do cross-validation

Why is bilinguality better for embeddings?

- Second language may act as an additional "signal" => Make Monolingual Embeddings better - If words are projected in a common space ("shared features") => this may allow for Direct Transfer (zero-shot transfer)

What are the three types of NLP tasks?

- Senctence / Text Classification - Sequence Tagging - Seq2Seq

What are the uses of CNNs in NLP?

- Sentence classification - Semantic role labeling (SRL) - Character-based approaches

What linguistic information is captured in sentence embeddings?

- Sentence lenght - Word order - Wether a certain word is in the sentence - Agreement between subject and verb

How can a text be split into fragments (grams) for n-gram langage models?

- Sentences - Words - Characters - etc.

What are three examples of CNN sentence classification tasks?

- Sentiment classification of e.g. movie reviews - Question classification into category e.g. Person - Classifying whether a sentence is ironic

What are RNNs used for?

- Sequence Tagging - Classification

What is the difference between Skip-toughts vs. SDAE in sentence embeddings?

- Skip-thoughts requires text in context - SDAE only requires individual sentences without context - Both are unsupervised methods

What are tasks Machine Learning is used for?

- Spam filtering - Fraud detection - Face detection - Recommendation systems

What are the properties of CNNs?

- Sparse connectivity - Parameter sharing

What is the main difference between the convolutional layer and the dense layer?

- Sparse connectivity - Parameter sharing - In principle variable-sized inputs (in practice not)

What are the properties of InferSent?

- Supervised - Trains on high-quality data (Stanford Natural Language Inference Data (SNLI)) - Uses LSTM (RNN variant, bidirectional)

What are the problems with n-gram language models?

- They are inherently limited in the past window that they can take into consideration - Not commonly used anymore

How do Paragraph Vector Models work?

- They learn word vectors and paragraph vectors at the same time - Very similar to CBOW and Skip-Gram model, but with an id for each sentence/paragraph

What are the ideas of Concatenated Power Mean Embeddings?

- To generalize the average to the so-called power mean - To concatenate diverse averaged word embeddings

In what sets do we split the input data?

- Training Set - Testing Set - Validation Set

What are two naive approaches for sentence embeddings?

- Treat sentence as long word, predict surrounding sentences like in CBOW or SKIP-GRAM - Take some sort of mean (e.g. arithmetic mean of words in the sentences = centroid) (good baseline)

What is the basic idea of the RNN Encoder-Decoder architecture?

- Two RNNs - One to encode the input sentence - One to decode the sentence embedding (context vector)

How are Encoder-Decoder models constructed?

- Two RNNs stacked together - The last hidden layer in the input is taken as representation of the input

Which two measures are used to evaluate dependency parsing?

- UAS (only count correct relations) - LAS (also include the types)

What are the disadvantages of Mini-Batch learning?

- Unclear of how to choose k (small --> better solutions, large --> more computationally efficient)

What are the differences between all the embedding approches?

- Unit of representation (characters, words, phrases, ...) - Definition of context for training (CBOW, skip-gram, Glove, ...)

What's the difference between Word2Vec and DepEmbeddings?

- Word2Vec finds words that associate with other words (domain similarity) - DepEmbeddings finds words that behave like others (functional similarity)

What is the difference between the tanh function compared to the sigmoid function?

- Zero centered - Interval from (-1, 1) - Also kills gradients

What are the Toolkits for training word representations?

- word2vec - GloVe

Benefits of the ReLU?

1) Derivative non vanishing (above 0) 2) Result in sparse coding 3) Steeper slope in loss surface to speed up training.

Describe different ways to attempt to understand what a network is doing?

1) Look at the parameters: 2) Look at the activations: 3) Look at how the network behaves "around the image; with Occlusion sensitivity, saliency maps, GradCam

Name two short comings of basic clustering and embeddings such as K-means and PCA that lead to wanting DNN.

1) Objects in background of images not taken into account. 2) Translations and deformations are very bad on results.

Which two pathologies are often encountered when training GANs?

1) Oscillations without convergence. Contrary to standard loss minimization, we have no guarantee here that it will actually decrease. 2) The infamous "mode collapse", when G models very well a small sub-population, concentrating on a few modes.

Name some techniques which help the training of very deep architectures.

1) Rectifiers: prevent the vanishing of the gradient 2) Dropout: force a distributed representation 3) Batch Normalization: dynamically maintain the statistics of the activations 4) smart initial conditions

What are two standard performance measures for image classification?

1) The error rate, or conversely accuracy. 2) The balanced error rate (BER)

How to denoise auto-encoders?

1. Add some noise to the input 2. Should learn to "remove the noise"

What are the steps of the Data Scienece Process?

1. Data Collection 2. Data Preparation 3. Expolaroty Data Analysis 4. Machine Learning 5. Visualization

What are the steps of Backward Propagation?

1. Forward Propagation 2. Backward Propagation

What are the rankings of the following models in regards of usefulness (RNN / LSTM / GRU)?

1. LSTM (most complex units) 2. GRU (more complex units) 3. RNN (least complex units)

How does BiVCD work?

1. Merge aligned Documents, then random shuffle all words 2. Then run a Monolingual Model (e.g. CBOW, Glove, Skip-Gram) on it

What are the components of a convolutional layer?

1. Pooling 2. Non-linear activation 3. Convolution

How do you now train an NLP system with sense-disambiguated embeddings?

1. Run word2vec on your data and compute embeddings 2. For each target word, represent its context as avg. or concatenated embedding 3. Cluster the context representations, and assign each word's context to a cluster 4. Run word2vec on sense-disambiguated corpus

How does an n-gram language model work?

1. Split text into fragments (grams) 2. Get probability of a following word by having n-grams (n amount of fragments) as context

How does one get a deep auto-encoder?

1. Stack multiple hidden layers 2. Train each hidden layer independently as an auto-encoder

What is the extrinsic evaluation scheme of sentence embeddings?

1. Take your sentence embedding model 2. Embed your sentences in an extrinsic task 3. Train classifier on embedded sentences 4. Repeat with different sentence embedding model and compare performances

What is the intrinsic evaluation scheme of sentence embeddings?

1. Take your sentence embedding model 2. Embed your sentences in an intrinsic task 3. Use cosine to measure distance between pairs 4. Correlate with human judgments

How does ELMo work?

1. The language model is pre-trained on a large corpus 2. For a new task, weights for the three representations are learned to get a taskspecific representation 3. This task specific representation is concatenated with standard static word embeddings

What are the steps of layerwise training of a deep auto-encoder?

1. Train a sparse auto-encoder on the input 2. Use the hidden layer (the features) of step 1 as input for a second auto-encoder 3. calculate a softmax classifier (only) on the last layer

RNN vs FFNN

A Feedforward Neural Network signals travel in one direction from input to output. There are no feedback loops; the network considers only the current input. It cannot memorize previous inputs (e.g., CNN). A Recurrent Neural Network's signals travel in both directions, creating a looped network. It considers the current input with the previously received inputs for generating the output of a layer and can memorize past data due to its internal memory.

What is the difference between a Perceptron and Logistic Regression?

A Multi-Layer Perceptron (MLP) is one of the most basic neural networks that we use for classification. For a binary classification problem, we know that the output can be either 0 or 1. This is just like our simple logistic regression, where we use a logit function to generate a probability between 0 and 1. So, what's the difference between the two? Simply put, it is just the difference in the threshold function! When we restrict the logistic regression model to give us either exactly 1 or exactly 0, we get a Perceptron model

Batch vs Layer Norm

If our activations are N*C*H*W, then batch norm normalizes the same filter (H*W) in a specific channel over each example in the batch. Layer norm normalizes each filter (H*W) across each filter in the same example (layer) of that batch. Instance norm normalizes each filter for each example respectively. In the end we add learned mean and STD to all forms · In batch normalization, input values of the same neuron for all the data in the mini-batch are normalized. Whereas in layer normalization, input values for all neurons in the same layer are normalized for each data sample. · In ΒΝ, the statistics are computed across the batch and the spatial dims. In contrast, in Layer Normalization (LN), the statistics (mean and variance) are computed across all channels and spatial dims. Thus, the statistics are independent of the batch.

What is the assumption of co-occurance counts?

If we collect over thousands of sentences, the vectors for "enjoy" and "like" will have similar vector representations

What are the advantages with n-gram language models?

Implementing an n-gram language model with an MLP is (also) easy

What is an activation map?

A channel produced by a convolution layer.

When should cross-entropy loss be used?

In classification problems.

When are Encoder-Decoder models usually used?

In machine translation (e.g. translate a sentence from german to english)

What is the benefit of a RNN?

In theory, infinite influence from the past

How should syntactic relations between words be represented?

In vectors

What is the advantage of a bidirectional RNN?

Infinite window size in both directions => Context knowledge from both past and future

What is the advantage of a RNN?

Infinite window size to left => remembers everything (context) from past

What are the two main components of a GAN network?

A generator, which generates data and a discriminator, which predicts whether the data is from the generator or the true data set.

What is a deep network?

A network with multiple hidden layers

What is an auto-encoder?

A neural network that is trained to attempt to copy its input to its output

What is a Multi-Layer-Perceptron (MLP)?

A neural network with (multiple) hidden layers and outputs

What is an NVP network?

A non-volume preserving network

What is the output of a Softmax function?

A propability over all given classes

What is the general idea behind a convolution layer?

A representation meaningful at a certain location can / should be used everywhere. It applies the same linear transformation locally, everywhere, and preserves the signal structure.

What is a residual network?

A residual network has a branch where x splits and re-joins y = f(x) later(skip connection). The parameters are then trained to optimize the difference between the value before the block and the one needed after (residual). Helps to mitigate the effect of shattered gradient (where the relation btw the input and the gradient wrt the input is shattered by the depth).

What is the Masked-LM pre-training objective of BERT?

A selection of tokens are randomly masked in the input sequence. The model needs to infer the masked word from the context.

What is NP => DET A** N PP** ?

A so-called phrase structure (aka contex-free grammar) rule

What is a receptive field?

A sub-area of an input map that influences a component of the output.

What is Xavier's initialization?

A weight initialization that normalizes the variance of activations and variance of gradient wrt activations. It is a compromise between controlling the variance of the activations and that of the gradients.

What does polysemy (or homonymy) mean?

A word may have many different meanings (eg. fly/Fly)

Explain the idea behind the Adam optimizer.

Adam, or adaptive momentum, combines two ideas to improve convergence: per-parameter updates which give faster convergence, and momentum which helps to avoid getting stuck in saddle point.

What does weight initialization aim to control?

Aims to control such that weights evolve at the same rate across layers and no layer becomes saturated before the others.

What is the main weakness of the original for seq2seq translation? This weakness is fixed by attention mechanisms.

All the information must flow through a single state v, which must have the capacity for any situation. Additionally, there are no direct channels to transport local information to where it is useful. Attention mechanisms can do this be having such direct channels.

What problem does a conditional GAN fix?

All the models we have seen so far model a density in high dimension and provide means to sample according to it, which is useful for synthesis only. However, most of the practical applications require the ability to sample a conditional distribution. The conditional GAN consists of parameterizing both G and D by a conditioning quantity Y.

Describe an LSTM network.

An LSTM is a type of gated RNN network. They were made to mitigate the exploding and vanishing gradient problem. Each hidden layer has an LSTM cell that remembers not only the hidden state vector but also the cell state. Each LSTM cell has 3 gates: the input gate, the forget gate and the output gate.

What is the universal approximation theorem?

Any continuous function can be approximated with a single layer perceptron. IE, a combination of ReLUs can approximate any continuous function. It states that the training error can be as low as we want by making the layer larger. However, nothing is stated about the test error.

As what can mini-batch learning be seen?

As an intermediate solution between batch and online learning

What Is a Multi-layer Perceptron(MLP)?

As in Neural Networks, MLPs have an input layer, a hidden layer, and an output layer. It has the same structure as a single layer perceptron with one or more hidden layers. A single layer perceptron can classify only linear separable classes with binary output (0,1), but MLP can classify nonlinear classes. Except for the input layer, each node in the other layers uses a nonlinear activation function. This means the input layers, the data coming in, and the activation function is based upon all nodes and weights being added together, producing the output. MLP uses a supervised learning method called "backpropagation." In backpropagation, the neural network calculates the error with the help of cost function. It propagates this error backward from where it came (adjusts the weights to train the model more accurately).

What do attention mechanisms do?

Attention mechanisms aggregate features with an important score that (ie. focus on specific parts of the input): 1) Depends on the features themselves, not their position in the tensor. 2) Relax locality constraints. This is done by modulating dynamically the weighting of different parts of a signal and allow the representation and allocation of information channels to be dependent on the activations themselves. (Compare regions of input with input context and calculate the weights accordingly) Their main use is to provide long-term dependency for sequence-to-sequence translation.

What do auto-regression methods do?

Auto-regression methods model components of a signal serially, each one conditionally to the ones already modeled.

What is backpropagation?

Backpropagation is a recursive algorithm for determining error derivatives

What Is Bagging and Boosting?

Bagging and Boosting are ensemble techniques to train multiple models using the same learning algorithm and then taking a call. With Bagging, we take a dataset and split it into training data and test data. Then we randomly select data to place into the bags and train the model separately. With Boosting, the emphasis is on selecting data points which give wrong output to improve the accuracy.

Which type of learning computes the gradient based on all datapoints?

Batch learning

What is the main objective of batch processing? How does it achieve this?

Batch processing speeds up computations by cutting down the amount of times parameters need to be copied to the cache. Memory transfers to the cache are very slow. Additionally, it reduces the amount of required Python loops.

Why are stacked auto-encoders loosing importance in NLP?

Because researchers have gained better intuition for initializing the parameters and bigger datasets are becoming available

Why do we split the input data?

Because we're interested in generalization performance

What's the relationship between dependency parsing and semantics?

By considering the attachment decisions, several meaning relevant questions about the sentence are answered

How would you get the co-occurence count of a word?

By creating the co-occurance matrix and reading it of

How could one get a bidirectional RNN?

By running another RNN from right to left and then concatenating the forward and backward hidden states

How are extrinsic word representations evaluated?

By the performance of a model that uses the word representations for solving a task

How are intrinsic word representations evaluated?

By using the representations directly

De-noising autoencoders.

Capture dependencies between signal components to restore a degraded input. However, the best reconstruction may sometimes be unlikely as it computes an average.

What problem does positional encoding fix when using a causal network to generate pixels?

Causal models alone do not have a way to make a prediction position dependent which leads to locally consistent results, but fragmentation. Positional encoding fixes this by provided information about the position.

What rule is used in backpropagation?

Chain rule

Name a few applications of the following: Classification, Regression, Density Estimation

Classification: Image recognition, cancer detection. Regression: Stock predictions. Density Estimation: Outlier detection, data viz.

What is an example task for a character-based CNN approach?

Classifying articles into topics

What is the aim of pooling?

Compute a low dimensional signal from a higher dimensional one. To group several activations into a single more meaningful one. Egs are max-pooling, average pooling etc.

What is the major architectural improvement of BERT?

Computing a word's context in bidirectional fashion instead of unidirectional only

What Are the Programming Elements in Tensorflow?

Constants - Constants are parameters whose value does not change. To define a constant we use tf.constant() command. For example: a = tf.constant(2.0,tf.float32) b = tf.constant(3.0) Print(a, b) Variables - Variables allow us to add new trainable parameters to graph. To define a variable, we use the tf.Variable() command and initialize them before running the graph in a session. An example: W = tf.Variable([.3].dtype=tf.float32) b = tf.Variable([-.3].dtype=tf.float32) Placeholders - these allow us to feed data to a tensorflow model from outside a model. It permits a value to be assigned later. To define a placeholder, we use the tf.placeholder() command. An example: a = tf.placeholder (tf.float32) b = a*2 with tf.Session() as sess: result = sess.run(b,feed_dict={a:3.0}) print result Sessions - a session is run to evaluate the nodes. This is called the "Tensorflow runtime." For example: a = tf.constant(2.0) b = tf.constant(4.0) c = a+b # Launch Session Sess = tf.Session() # Evaluate the tensor c print(sess.run(c))

Describe two ways to visualize features of a CNN in an image classification task.

Input occlusion — cover a part of the input image and see which part affect the classification the most. For instance, given a trained image classification model, give the images below as input. If, for instance, we see that the 3rd image is classified with 98% probability as a dog, while the 2nd image only with 65% accuracy, it means that the part covered in the 2nd image is more important. Activation Maximization — the idea is to create an artificial input image that maximize the target response (gradient ascent).

What is the Wasserstein distance?

Intuitively, it is the minimum mass displacement to transform one distribution into the other. It solves the problem of the JS divergence regarding the structure of the distributions.

Why does the .backward function need to be set to zero every time it is used?

It accumulates past results. Hence, to be used a new time it must be set to zero.

What is the goal of GloVe?

It aims at reconciling the advantages of global co-occurrence counts and local context windows

What is the goal of the stochastic depth network?

It aims to reduce the depth of a network during training, but not during testing. It achieves this by bypassing entire ResBlocks during training. Essentially, you train a "shallower" network than the one you use. Helps with vanishing gradient problem and computation time.

What does Pytorch's autograd mechanism do?

It automatically constructs the DAG of operations to compute the gradient of any quantity with respect to any involved tensor. Thus, the user is only concerned with the forward pass and the dynamic nature of the graph allows the forward pass to be modulated.

What are the pros and cons of Layer normalization versus batch normalization?

It behaves similarly in training and test and processes samples individually.

How does Siamese CBOW work?

It directly averages word embeddings for sentences, so that it learns that words with little semantic impact have a low vector norm

What is the dimensionality of a Word-Vector in the "One Hot" vector representation?

It equals the size of the vocabulary

Intuitively, how does a networks "irregularity" grow with respect to its width and depth?

It grows linearly with its width and exponentially with its depth.

What is achieved by data normalization?

It helps keep the activation variations constant through the layers.

What is a temporal convolutional network?

It is a 1d convolutional network that processes an input of the maximum length. It uses dilated convolutions and the model size is O(log T). The memory footprint is O(T log T). This is the simplest approach to sequence processing.

What is a Recursive Neural Network (RNN)?

It is a MLP with additional feedback loop (which has a time delay)

What is the Jenson-Shannon divergence?

It is a measure of the similarity between two distributions. However, it does not account much for the metric structure of the space.

What is gradient norm clipping?

It is a threshold set of the gradient to prevent it from growing excessively large. Any value above the threshold will be replaced by the threshold itself.

Is the empirical risk a biased or an unbiased estimator of the risk?

It is an unbiased estimator of the risk.

What is the "maxout" layer?

It is not technically an activation function, as it has trainable parameters, but it takes the max of several linear units. It can encode ReLU as well as other functions, even any convex function.

What is a directed acyclic graph (DAG) required for?

It is required to compute derivatives.

What is neural machine translation?

It is the approach of modeling the entire MT process via one big artificial neural network

What is replacing static word embeddings?

Contextualized embeddings

What is CBOW?

Continuous Bag of Words is a common word embedding where the embedding vectors are chosen such that a word can be linearly predicted based on the sum of the embeddings of the surrounding words.

What is the 'natural' loss for softmax?

Cross-Entropy Loss

What are adversarial networks?

It is where one network tries to spot the mistakes of another one, while this network tries to fool the "spotter" network.

What is Deep Learning?

Deep Learning involves taking large volumes of structured or unstructured data and using complex algorithms to train neural networks. It performs complex operations to extract hidden patterns and features (for instance, distinguishing the image of a cat from that of a dog).

Noise 2 noise

Denoising can be achieved without clean samples if the noise is additive and unbiased. Used for image restoration.

What does arc-standard dependency parsing mean?

Dependency parsing with arc-standard operations: Left-arc, right-arc, shift

Why does gradient descent not always lead to a good solution?

Depending on the starting point it may find a local minima not a global minima

Explain Shattered Gradient.

Depth shatters the relation between the input and the gradient wrt the input. Residual networks limit this effect. Since linear networks avoid this problem, it has been suggested to use Looks Linear initialization, which makes the network linear initially.

Dilation, stride and padding in transposed convolutions.

Dilation: Same as for convolution Stride and padding: defined in the output map

What are the disadvantages Maxout function?

Doubles the number of parameters

Describe dropout.

Dropout is a regularization technique which consists of removing units at random during the forward pass of training and putting them all back during testing. This method increases independence between units, and distributes the representation. It generally improves performance. A network with dropout can be interpreted as an ensemble of 2^N models with heavy weight sharing.

What is the FastText approach?

Embed n-grams => Words are represented as bags of character n-grams => Representation for a word is given by average over its n-gram embeddings

How does forward propagation and backpropagation work in deep learning?

Now, this can be answered in two ways. If you are on a phone interview, you cannot perform all the calculus in writing and show the interviewer. In such cases, it best to explain it as such: Forward propagation: The inputs are provided with weights to the hidden layer. At each hidden layer, we calculate the output of the activation at each node and this further propagates to the next layer till the final output layer is reached. Since we start from the inputs to the final output layer, we move forward and it is called forward propagation Backpropagation: We minimize the cost function by its understanding of how it changes with changing the weights and biases in a neural network. This change is obtained by calculating the gradient at each hidden layer (and using the chain rule). Since we start from the final cost function and go back each hidden layer, we move backward and thus it is called backward propagation

How many output units does a perceptron have?

One

What is the Boltzmann Machine?

One of the most basic Deep Learning models is a Boltzmann Machine, resembling a simplified version of the Multi-Layer Perceptron. This model features a visible input layer and a hidden layer -- just a two-layer neural net that makes stochastic decisions as to whether a neuron should be on or off. Nodes are connected across layers, but no two nodes of the same layer are connected.

What is mini-batch learning with k=1?

Online learning

Which type of learning approximates the gradient by computing it on one datapoint?

Online learning

What is the simplest way to generate an adversarial input?

Optimize the output to maximize the loss of the discriminator. The goal is to have a modified image where the changes are indistinguishable to the human eye, but the image is misclassified.

What is Overfitting and Underfitting, and How to Combat Them?

Overfitting occurs when the model learns the details and noise in the training data to the degree that it adversely impacts the execution of the model on new information. It is more likely to occur with nonlinear models that have more flexibility when learning a target function. An example would be if a model is looking at cars and trucks, but only recognizes trucks that have a specific box shape. It might not be able to notice a flatbed truck because there's only a particular kind of truck it saw in training. The model performs well on training data, but not in the real world. Underfitting alludes to a model that is neither well-trained on data nor can generalize to new information. This usually happens when there is less and incorrect data to train a model. Underfitting has both poor performance and accuracy. To combat overfitting and underfitting, you can resample the data to estimate the model accuracy (k-fold cross-validation) and by having a validation dataset to evaluate the model.

Describe what padding, stride and dilation do?

Padding: adds zeroes to the edges of the filter. Stride: Determines the step sizes of the filter. Dilation: Dilutes the filter by adding rows and columns of zeroes between its values.

What are basic clusterings and embeddings (K-means, PCA) used for?

Parameterize and re-parametrize multiple times the input signal into representations that are more and more invariant and free of noise.

The weights and activation function of a network can be considered as what?

Parameters (we learn this)

What NN should be used for Sentence / Text Classification tasks?

Perceptron / MLP / CNN

What are soft alignments in attention-based encoder-decoder models?

Probability distribution over tokens that a current token aligns to

What NN should be used for Sequence Tagging tasks?

RNN

What is an approach that determines the meaning of a sentence using e.g. phrase structure (or similarly dependency structure)?

RNN (Recursive Neural Nets)

Why does the LSTM read words in reverse? (LSTM = Long Short Term Memory Networks)

Reading in reverse introduces many short term dependencies in the data and this makes the optimization problem much easier.

What is the problem of the "One Hot" Word-Vector representation?

Relations between words are not represented

What do all the embedding approaches have in common?

Represent natural language input with real-valued vectors

What is the difference between stochastic descent and mini-batch stochastic descent?

SGD consists of updating the parameters after each sample. Mini-batch SGD updates parameters after visiting the samples in small batches ie. update weights after batchs of 100 samples for example. This tends to speed up the training process.

What can word embeddings represent?

Semantic and syntactic relations between words in the vector space

When are penalties most useful?

Small models and scarce data.

What is Spectral Normalization and why is it implemented in GANs?

Spectral Normalization is a layer normalization that estimates the largest singular value of a weight matrix, and re-scale it accordingly. It was proposed to control the Lipschitz constant of D by rescaling its weights, so that all the linear layers have their singular values lesser than 1, and consequently Lipschitz constant, lesser than 1. If the non-linear layers are also Lipschitz of constant lesser than 1 (e.g. ReLU), this is a sufficient condition.

How can a sequence-2-sequence model have a differently sized inputs and outputs?

Such a model can use an encoder-decoder. For example, a translator which translates french to english; if the sentence in french has 10 words, that does not mean that the sentence in english will also have 10 words.

What is TF-IDF?

TFIDF or Term Frequency-Inverse Document Frequency indicates the importance of a word in a set. It helps in information retrieval with numerical statistics. For a specific document, TF-IDF shows a frequency that helps identify the keywords in a document. The major use of TF-IDF in NLP is the extraction of useful information from crucial documents by statistical data. It is ideally used to classify and summarize the text in documents and filter out stop words. TF helps calculate the ratio of the frequency of a term in a document and the total number of terms. Whereas, IDF denotes the importance of the term in a document. The formula for calculating TF-IDF: TF(W) = (Frequency of W in a document)/(The total number of terms in the document) IDF(W) = log_e(The total number of documents/The number of documents having the term W) When TF*IDF is high, the frequency of the term is less and vice versa. Google uses TF-IDF to decide the index of search results according to the relevancy of pages. The design of the TF-IDF algorithm helps optimize the search results in Google. It helps quality content rank up in search results.

What does "You shall know a word by the company it keeps" mean?

That word meanings can be infered by context

What is the difference between the Adam optimizer and the Momentum optimizer?

The Adam optimizer rescales the lr for each coordinate separately using information from the previous steps while the momentum does not do this for each coordinate separately.

What is the difference between momentum descent and the Adam optimizer?

The Adam updates each coordinate separately based on a moving average applied on each coordinate.

What is the difference between an LSTM and a GRU?

The LSTM cell has 3 gates; the update gate, the forget gate and the output gate. The GRU only has 2 gates; the update gate and the forget gate. GRUs therefore have fewer parameters to train and perform similarly to LSTM.

What is one advantage of the Leaky-ReLU versus the regular ReLU?

The Leaky ReLU solves the vanishing gradient problem sometimes encountered by the ReLU, since its derivative is never zero, as can be the case with the regular ReLU. Leaky ReLU --> max(ax,x) with 0<a<1.

What is the Monte Carlo integrator?

The Monte Carlo integrator is constructed such that the intensity of each pixel is the expectation of the random path sampling process, i.e., the sampling noise is zero-mean. It is used in de-noising applications. It generates physically accurate renderings of virtual environments.

In an MLP, should the activation function be linear or non-linear? Why?

The activation function should be non-linear because if it is linear, the entire MLP will be an affine transformation with a peculiar parametrization.

What do bias and variance quantify in a model?

The bias quantifies how much the model fits on average and the variance quantifies how much the model changes across different datasets.

What is the feature representation in a CNN?

The convolved representation

What tends to happen if weights are ill conditioned at initialization?

The gradients go towards zero in deeper layers.

What do each gate of the LSTM cell do?

The input gate decides whether to update the cell. The forget gate decides whether to reset the state to 0. The output gate decides whether or not to use the current cell state.

What do you mean by exploding and vanishing gradients?

The key here is to make the explanation as simple as possible. As we know, the gradient descent algorithm tries to minimize the error by taking small steps towards the minimum value. These steps are used to update the weights and biases in a neural network. However, at times, the steps become too large and this results in larger updates to weights and bias terms - so much so as to cause an overflow (or a NaN) value in the weights. This leads to an unstable algorithm and is called an exploding gradient. On the other hand, the steps are too small and this leads to minimal changes in the weights and bias terms - even negligible changes at times. We thus might end up training a deep learning model with almost the same weights and biases each time and never reach the minimum error function. This is called the vanishing gradient. A point to note is that both these issues are specifically evident in Recurrent Neural Networks - so be prepared for follow-up questions on RNN!

What happens to the discriminator if the initial samples generated by G are too unrealistic?

The loss of D will be too far in the tail of the sigmoid function and the gradient will vanish. The response of D will saturate and will not train. This is solved by implementing a non-saturated cost. The loss of D remains unchanged.

What is batch normalization?

The main goal is to maintain proper statistics of the activations and derivatives. This is done by forcing the the activation statistics during the forward pass by re-normalizing them. During training batch normalization shifts and rescales according to the mean and variance estimated on the batch. During test, it simply shifts and rescales according to the empirical moments estimated during training.

What is precision and recall?

The metrics used to test an NLP model are precision, recall, and F1. Also, we use accuracy for evaluating the model's performance. The ratio of prediction and the desired output yields the accuracy of the model. Precision is the ratio of true positive instances and the total number of positively predicted instances. Recall is the ratio of true positive instances and the total actual positive instances.

What is the Next Sentence Prediction pre-training objective of BERT?

The model is fed a sequence of two sentences, where the second sentences either is the follow-up sentence to the first or a different random sentence. The model learns a binary classifier deciding between the two cases.

elMo / ULMfit vs BERT

The reason you're seeing BERT and its derivatives as benchmarks is probably because it is newer than the other models mentioned and shows state-of-the-art performance on many NLP tasks. Thus, when researchers publish new models they normally want to compare them to the current leading models out there (i.e BERT). I don't know if there has been a study on the strengths of BERT compared to the other methods but looking at their differences might give some insight: Truly BidirectionalBERT is deeply bidirectional due to its novel masked language modeling technique. ELMo on the other hand uses an concatenation of right-to-left and left-to-right LSTMs and ULMFit uses a unidirectional LSTM. Having bidirectional context should, in theory, generate more accurate word representations. Model InputBERT tokenizes words into sub-words (using WordPiece) and those are then given as input to the model. ELMo uses character based input and ULMFit is word based. It's been claimed that character level language models don't perform as well as word based ones but word based models have the issue of out-of-vocabulary words. BERT's sub-words approach enjoys the best of both worlds. Transformer vs. LSTMAt its heart BERT uses transformers whereas ELMo and ULMFit both use LSTMs. Besides the fact that these two approaches work differently, it should also be noted that using transformers enables the parallelization of training which is an important factor when working with large amounts of data. This list goes on with things such as the corpus the model was trained on, the tasks used to train and more. So while it is true that BERT shows SOTA performance across a variety of NLP tasks, there are times where other models perform better. Therefore, when you're working on a problem it is a good idea to test a few of them a see for yourself which one suits your needs better.

Whats the difference between a RNN that is used for sequence tagging and one that is used for classification?

The sequence tagging RNN has multiple output units the classification one only one

What does the Skip Gram model do?

The skip gram model is an algorithm which optimizes word embeddings so that a word can be predicted by a single word in its context. It is the opposite of CBOW, where a context is used to predict a word. Here, a word is essentially used to predict a context.

What are the problems with more advanced optimizers like AdaGrad or Adam in comparison to SGD (Mini-Batch)?

The solution found is usually not as good

What is the stride?

The steps size for moving over the sentence (in NLP usually 1)

How are GPUs used in PyTorch?

The use of the GPUs in PyTorch is done by creating or copying tensors into their memory. Operations on tensors in a device's memory are done by the said device.

What is the main use of an attention mechanism?

Their main use was to provide deep learning networks with memory-like modules, and now it is to provide long-term dependency for sequence to sequence translation.

How is an non-projective dependency tree different from a projective one?

There are crossing edges in it

What Are the Different Layers on CNN?

There are four layers in CNN: Convolutional Layer - the layer that performs a convolutional operation, creating several smaller picture windows to go over the data. ReLU Layer - it brings non-linearity to the network and converts all the negative pixels to zero. The output is a rectified feature map. Pooling Layer - pooling is a down-sampling operation that reduces the dimensionality of the feature map. Fully Connected Layer - this layer recognizes and classifies the objects in the image.

What do Sequential Denosing Autoencoders (SDAE) do?

They are autoencoders which denoise (corrupt) the input a little

What do L1 and L2 penalty terms do?

They are both convex penalties and help avoid over-fitting of complex models.

What is the difference between a more advanced optimizer like Adam oder AdaGrad and Stochastic Gradient Descent (SGD / mini-batch learning)

They are usually faster but not as good as SGD

Why are ensemble methods superior to individual models?

They average out biases, reduce variance, and are less likely to overfit. There's a common line in machine learning which is: "ensemble and get 2%."

How do Support Vector Machines work?

They maximize the distance between samples and the decision boundary. The support vectors are the samples closest to this boundary.

What do ELMo and BERT do?

They use language models to get contextualized word representations

In autoregressive models, why are the best results achieved with cross-entropy?

This is due in large part to the ability of categorical distributions and cross-entropy to deal with exotic posteriors, in particular multi-modal. (A multimodal posterior is when conflicting interpretations of x exist)

What is the vanishing gradient problem for language models/sequence labeling?

Time steps far away are not taken into consideration (not specific to RNNs)

What is the idea of Paragraph Vector Models?

To assign a vector to a paragraph (one sentence or several) so that we can predict words in a text

Why is input data split into training and testing data in Machine Learning?

To check generalization performance

What is the goal of Siamese CBOW?

To embed each word so that the averaged word embeddings of "similar" sentences are close

What is the goal of linear discriminant analysis?

To find a linear combination of features that separate two or more classes of objects or events. This can then be used as a linear classifier or to reduce dimensionality.

What is the goal of Machine Learning?

To generalize well to unseen data

What is the idea of a CNN?

To identify indicative local predictors in a large structure, and combine them to produce a fixed size vector representation of the structure, capturing these local aspects that are most informative for the prediction task at hand

Why do we need a bias unit?

To increase the capacity of the statistical model

What are Linguistig Probing Tasks used for?

To interpret sentence embeddings

Is an auto-encoder supervised or unsupervised?

Traditionally unsupervised

How can you get a different vector representation for each sense of a word?

Train word vectors on sense-disambiguated corpora

The input into a network can be considered as what?

Training/Test Data

True/False. Constructing deep generative architectures requires layers to increase the signal dimension.

True, unlike what we have done with the feed-forward networks in the earlier portions of the course. This can be done in the forward pass with transposed convolution layers whose forward operation corresponds to a convolution layer's backward pass.

True/False. Convolutions preserve the signal's structure?

True.

True/False. Loss functions (ie. nn.MSE()) do not accept targets with requires_grad = True

True.

True/False. For dropout, the model behaves differently during train and test.

True. As batch normalization.

True/False. During inference, batch normalization performs a component-wise affine transformation.

True. During inference, batch normalization shifts and rescales independently each component of the input x according to statistics estimated during training.

What is the Homoscedascity assumption?

Two populations are Gaussian and have the same covariance matrix.

Are word embeddings trained on "supervised" or "un-supervised" data?

Un-supervised. This is a big benefit as it allows to train on potentially huge corpora of data.

What are the advantages with more advanced optimizers like AdaGrad or Adam in comparison to SGD (Mini-Batch)?

Usually faster than SGD

What is the advantage of text flow as opposed to images?

Usually only one dimension

What is the problem with co-occurance counts?

Vectors become very large with real data and need dimensionality reduction

What are the advantages of mini-batch SGD?

Visiting samples in mini-batches and updating parameters every time helps avoid local minima while reducing the computation times

Variational Autoencoders

WATCH VIDEO

When going deeper in a network, what problems must be avoided concerning the gradient?

We must control the amplitude of the gradient. Ensure that: 1) the gradient does not vanish. 2) gradient amplitude is homogeneous so that all parts of the network train at the same rate. 3) The gradient does not vary too unpredictably.

When are to sets linear separable?

When a weight vector of a perceptron can classify each instance in the sets correctly

When is an auto-encoder undercomplete?

When the hidden dimensionality is smaller than the input dimensionality

Why is gating used in RNN?

When unfolding through time, the models get deep and can run into the vanishing gradient problem. TO solve this, "pass-through" gates allow the recurrent state to avoid going repeatedly through squashing non-linearities.

How does Google's Multilingual NMT System indicate the target language?

With an artificial token at the beginning of the input sentence

What is the goal of word embedding?

Word embedding represents words that are of similar classes/meanings closely. The idea is that each word can be embedded into a vector of 10s to 100s of features will create much more dense vectors than say using one-hot encoding for each word which would be magnitudes larger in terms of dimension. Essentially, word embedding is necessary to make the vectors function well in a deep learning application.

What is the problem with the representation of syntactic relations between words?

Word order matters

What is the most popular toolkit for training word representations?

Word2Vec

What is the cross-lingual objective of bilingual embeddings?

Words that are translations of each other should be close in the projected space

What is the mono-lingual objective of bilingual embeddings?

Words that occur in monolingually similar contexts should be close to each other in vector space

Are activation functions hyperparameters?

Yes

Is the backward pass more computationally expensive than the forward pass?

Yes, by a factor of about 2.

Is it possible to convert a constuent tree into an dependency tree or the other way around?

Yes, with the use of heuristic techniques or machine learning, but they are not equivalent

Can Autoregressive models be generative?

Yes.

Is the output of an auto-encoder the same size as the input?

Yes. However, the complexity of the output is reduced as the auto-encoder reduced the dimension of the data of the input. It is compressed.

How can the capacity of a linear predictor be increased?

You can increase the dimension D of the data in order to make it linearly separable. A simple example of this is the xor example in slides 3.3. Known as feature design.

Compare batch, mini-batch and stochastic gradient descent.

batch refers to estimating the data by taking the entire data, mini-batch by sampling a few datapoints, and SGD refers to update the gradient one datapoint at each epoch. The tradeoff here is between how precise the calculation of the gradient is versus what size of batch we can keep in memory. Moreover, taking mini-batch rather than the entire batch has a regularizing effect by adding random noise at each epoch.

What is the correct lemmatization of the word: gespielt?

spielen

What is the formular for recall?

tp / (tp + fn)

What is the formular for precision?

tp / (tp + fp)

What are some advantages in using a CNN (convolutional neural network) rather than a DNN (dense neural network) in an image classification task?

while both models can capture the relationship between close pixels, CNNs have the following properties: It is translation invariant — the exact location of the pixel is irrelevant for the filter. It is less likely to overfit — the typical number of parameters in a CNN is much smaller than that of a DNN. Gives us a better understanding of the model — we can look at the filters' weights and visualize what the network "learned". Hierarchical nature — learns patterns in by describing complex patterns using simpler ones.

What is the concatenated ReLU?

x --> (max(0,x),max(0,−x)), it doubles the amount of activations, but maintains the norm of the signal intact during both the forward and backward passes.

Brief summary of techniques that allow deep networks.

• rectifiers to prevent the gradient from vanishing during the backward pass • dropout to force a distributed representation • batch normalization to dynamically maintain the statistics of activations • identity pass-through to keep a structured gradient and distribute representation • smart initialization to put the gradient in a good regime.


Ensembles d'études connexes

Observation and Assessment Final

View Set

Module 8 - NC General Statutes & Regulations

View Set

Human Cell Organelle Identification

View Set

English 12B Unit 1 Robinson Crusoe

View Set

Chapter 3. Cell Structure and Function

View Set

accounting 202 chapter 3 learnsmart

View Set

International Marketing CH. 1,2,3,4,5,6,7

View Set