Deep Learning with Python by Francois Chollet, Deep Learning, New ML

¡Supera tus tareas y exámenes ahora con Quizwiz!

How does dropout help improve performance?

"In a standard neural network, the derivative received by each parameter tells it how it should change so the final loss function is reduced, given what all other units are doing. Therefore, units may change in a way that they fix up the mistakes of the other units. This may lead to complex coadaptations. This in turn leads to overfitting because these co-adaptations do not generalize to unseen data. We hypothesize that for each hidden unit, dropout prevents co-adaptation by making the presence of other hidden units unreliable. Therefore, a hidden unit cannot rely on other specific units to correct its mistakes. It must perform well in a wide variety of different contexts provided by the other hidden units."

The output unit of a network can be considered as what?

(Statistical) model parametrizied by the weigths

How can the optimization part of learning be modified?

- Adaption of learning rate - Inclusion of regulization term - Inclusion of momentum

What does a language model do?

- Assigns a sequence of tokens (words, characters, ...) a probability (likelihood of observing this sequence in text) - Probability(the cat sat on the mat) > Probability(mat sat cat on the the) - Can be used to generate language

What are the differences between a consuent tree and a dependency tree?

- CFG (Context-free grammar) uses non-terminal symbols, describes how sentence is structured in components/phrases - Dependency Parsing gives head/dependent relationships between words

What are the properties of Gated Recurrent Units (GRUs)?

- Can store memory in cell indefinitely - Can forget past memory and reset everything - Can go back to standard RNN-Mode

Instead of words what else can be embedded?

- Characters (i n s i g h t f u l) - Syllables (in sight ful) - Morphemes (insight ful)

What are some tree structure violations?

- Cycles - Multiple parents of a node - Non-connected parts

What are the four big wins of neural MT?

- End-to-end training - Distributed representations share strength - Better exploitation of context - More fluent text generation

What changes were implemented in the DCGAN (Deep convolutional GAN)?

- FC layers were converted to convolutions - Batchnorm was used in both G and D - Pooling replaced by strided convolutions in D and strided transposed convolutions in G - among other things

Why are sentence/document embeddings needed?

- For clustering - For retrieval - As an alternative to sentence representations learned from word-level models (e.g. CNN)

What are the advantages Maxout function?

- Generalizes ReLU und Leaky ReLU - Gradients don't die

What are three optimization methods?

- Gradient descent - Newton methods - Conjugate gradient

What are the advantages of ReLU function?

- Gradients don't die in +Region - Computationally efficient - Convergence is faster

What are the advantages of Leaky ReLU function?

- Gradients don't die in +Region - Gradients don't die in -Region - Computationally efficient - Convergence is faster

What methods are used to search for good hyperparameters?

- Grid Search - Random Search - Bayesian Methods

What are three technical forces that are driving advances in machine learning?

- Hardware - Datasets and benchmarks - Algorithmic advances

What are the training methods of Word2Vec?

- Hierarchical softmax - Negative sampling

What three things do we need to do machine learning?

- Input data points - Examples of the expected output - A way to measure if the algorithm is doing a good job

What is the F1 Score?

- It is a measure of a test`s accuracy - It is the harmonic mean, or weighted average, of the precision and recall scores

Why can't we add more and more context to POS tagging?

- It will get slower (more parameters) - Overfitting

What are the advantages of Mini-Batch learning?

- Mini-batch learning converges faster to a good solution than batch learning - Computing loss function gradient on all datapoints can be computationally expensive

What are the disadvantages of the sigmoid function?

- Not zero centered - Kills gradients

What are problems with the direct transfer of words between languages in bilinguality?

- OOV Words - Syntactic ordering

Which representations does ELMo combine?

- One on character level - Two representations obtained from the two layers in an RNN

What are the uses of CNNs in NLP?

- Sentence classification - Semantic role labeling (SRL) - Character-based approaches

What linguistic information is captured in sentence embeddings?

- Sentence lenght - Word order - Wether a certain word is in the sentence - Agreement between subject and verb

What is the difference between Skip-toughts vs. SDAE in sentence embeddings?

- Skip-thoughts requires text in context - SDAE only requires individual sentences without context - Both are unsupervised methods

What are tasks Machine Learning is used for?

- Spam filtering - Fraud detection - Face detection - Recommendation systems

What are the properties of CNNs?

- Sparse connectivity - Parameter sharing

What are the properties of InferSent?

- Supervised - Trains on high-quality data (Stanford Natural Language Inference Data (SNLI)) - Uses LSTM (RNN variant, bidirectional)

To make learning easier for your network, your data should:

- Take "small" values: typically most values should be in the 0-1 range. - Be homogenous, i.e. all features should take values roughly in the same range.

By which two key parameters are convolutions defined?

- The size of the patches that are extracted from the inputs - The depth of the output feature map, i.e. the number of filters computed by the convolution

How do Paragraph Vector Models work?

- They learn word vectors and paragraph vectors at the same time - Very similar to CBOW and Skip-Gram model, but with an id for each sentence/paragraph

What are the ideas of Concatenated Power Mean Embeddings?

- To generalize the average to the so-called power mean - To concatenate diverse averaged word embeddings

What are two naive approaches for sentence embeddings?

- Treat sentence as long word, predict surrounding sentences like in CBOW or SKIP-GRAM - Take some sort of mean (e.g. arithmetic mean of words in the sentences = centroid) (good baseline)

Which two measures are used to evaluate dependency parsing?

- UAS (only count correct relations) - LAS (also include the types)

What you need to do first to find the border between underfitting and overfitting?

- build a model that overfits - monitor the training loss and validation loss - when you see that the performance of the model on the validation data starts degrading, you have achieved overfitting

How to deal with nonstationary problems?

- constantly retrain your model on data from the recent past - gather data at a timescale where the problem is stationary

What four steps does training loop consist of?

- draw a batch of training samples x and corresponding targets y - run the network on x (this is called "forward pass"), obtain predictions y_pred - compute the "loss" of the network on the batch, a measure of the mismatch between y_pred and y - update all weights of the network in a way that slightly reduces the loss on this batch

What two options of features representation do you have when dealing with multilabel classification?

- encoding the labels via "categorical encoding" (also known as "one-hot encoding") and using categorical_crossentropy as your loss function. - encoding the labels as integers and using the sparse_categorical_crossentropy loss function.

What are some ways to fight overfitting in neural networks?

- reduce the network's size (number of layers and units per layer) - Adding weight regularization - Adding dropout

What evaluation techniques do exist?

- simple hold-out validation - k-fold validation - iterated K-fold validation with shuffling

what you can do with TensorBoard?

- visually monitor your metrics during training - visualize your model architectures - visualize histograms of activations and gradients - exploring embeddings in 3D

Which two pathologies are often encountered when training GANs?

1) Oscillations without convergence. Contrary to standard loss minimization, we have no guarantee here that it will actually decrease. 2) The infamous "mode collapse", when G models very well a small sub-population, concentrating on a few modes.

Name some techniques which help the training of very deep architectures.

1) Rectifiers: prevent the vanishing of the gradient 2) Dropout: force a distributed representation 3) Batch Normalization: dynamically maintain the statistics of the activations 4) smart initial conditions

How to denoise auto-encoders?

1. Add some noise to the input 2. Should learn to "remove the noise"

What are the steps of the Data Scienece Process?

1. Data Collection 2. Data Preparation 3. Expolaroty Data Analysis 4. Machine Learning 5. Visualization

What are the rankings of the following models in regards of usefulness (RNN / LSTM / GRU)?

1. LSTM (most complex units) 2. GRU (more complex units) 3. RNN (least complex units)

How does BiVCD work?

1. Merge aligned Documents, then random shuffle all words 2. Then run a Monolingual Model (e.g. CBOW, Glove, Skip-Gram) on it

How do you now train an NLP system with sense-disambiguated embeddings?

1. Run word2vec on your data and compute embeddings 2. For each target word, represent its context as avg. or concatenated embedding 3. Cluster the context representations, and assign each word's context to a cluster 4. Run word2vec on sense-disambiguated corpus

How does one get a deep auto-encoder?

1. Stack multiple hidden layers 2. Train each hidden layer independently as an auto-encoder

What is the intrinsic evaluation scheme of sentence embeddings?

1. Take your sentence embedding model 2. Embed your sentences in an intrinsic task 3. Use cosine to measure distance between pairs 4. Correlate with human judgments

What is the difference between a Perceptron and Logistic Regression?

A Multi-Layer Perceptron (MLP) is one of the most basic neural networks that we use for classification. For a binary classification problem, we know that the output can be either 0 or 1. This is just like our simple logistic regression, where we use a logit function to generate a probability between 0 and 1. So, what's the difference between the two? Simply put, it is just the difference in the threshold function! When we restrict the logistic regression model to give us either exactly 1 or exactly 0, we get a Perceptron model

What are the two main components of a GAN network?

A generator, which generates data and a discriminator, which predicts whether the data is from the generator or the true data set.

What is a residual network?

A residual network has a branch where x splits and re-joins y = f(x) later(skip connection). The parameters are then trained to optimize the difference between the value before the block and the one needed after (residual). Helps to mitigate the effect of shattered gradient (where the relation btw the input and the gradient wrt the input is shattered by the depth).

What is NP => DET A** N PP** ?

A so-called phrase structure (aka contex-free grammar) rule

What is the universal approximation theorem?

Any continuous function can be approximated with a single layer perceptron. IE, a combination of ReLUs can approximate any continuous function. It states that the training error can be as low as we want by making the layer larger. However, nothing is stated about the test error.

As what can mini-batch learning be seen?

As an intermediate solution between batch and online learning

What Is a Multi-layer Perceptron(MLP)?

As in Neural Networks, MLPs have an input layer, a hidden layer, and an output layer. It has the same structure as a single layer perceptron with one or more hidden layers. A single layer perceptron can classify only linear separable classes with binary output (0,1), but MLP can classify nonlinear classes. Except for the input layer, each node in the other layers uses a nonlinear activation function. This means the input layers, the data coming in, and the activation function is based upon all nodes and weights being added together, producing the output. MLP uses a supervised learning method called "backpropagation." In backpropagation, the neural network calculates the error with the help of cost function. It propagates this error backward from where it came (adjusts the weights to train the model more accurately).

What do auto-regression methods do?

Auto-regression methods model components of a signal serially, each one conditionally to the ones already modeled.

What Is Bagging and Boosting?

Bagging and Boosting are ensemble techniques to train multiple models using the same learning algorithm and then taking a call. With Bagging, we take a dataset and split it into training data and test data. Then we randomly select data to place into the bags and train the model separately. With Boosting, the emphasis is on selecting data points which give wrong output to improve the accuracy.

Which type of learning computes the gradient based on all datapoints?

Batch learning

What is the main objective of batch processing? How does it achieve this?

Batch processing speeds up computations by cutting down the amount of times parameters need to be copied to the cache. Memory transfers to the cache are very slow. Additionally, it reduces the amount of required Python loops.

classification task where each input sample should be categorized into two exclusive categories

Binary classification

What's the relationship between dependency parsing and semantics?

By considering the attachment decisions, several meaning relevant questions about the sentence are answered

How would you get the co-occurence count of a word?

By creating the co-occurance matrix and reading it of

What is the main weakness of the original for seq2seq translation? This weakness is fixed by attention mechanisms.

All the information must flow through a single state v, which must have the capacity for any situation. Additionally, there are no direct channels to transport local information to where it is useful. Attention mechanisms can do this be having such direct channels.

What problem does a conditional GAN fix?

All the models we have seen so far model a density in high dimension and provide means to sample according to it, which is useful for synthesis only. However, most of the practical applications require the ability to sample a conditional distribution. The conditional GAN consists of parameterizing both G and D by a conditioning quantity Y.

De-noising autoencoders.

Capture dependencies between signal components to restore a degraded input. However, the best reconstruction may sometimes be unlikely as it computes an average.

What problem does positional encoding fix when using a causal network to generate pixels?

Causal models alone do not have a way to make a prediction position dependent which leads to locally consistent results, but fragmentation. Positional encoding fixes this by provided information about the position.

What Are the Programming Elements in Tensorflow?

Constants - Constants are parameters whose value does not change. To define a constant we use tf.constant() command. For example: a = tf.constant(2.0,tf.float32) b = tf.constant(3.0) Print(a, b) Variables - Variables allow us to add new trainable parameters to graph. To define a variable, we use the tf.Variable() command and initialize them before running the graph in a session. An example: W = tf.Variable([.3].dtype=tf.float32) b = tf.Variable([-.3].dtype=tf.float32) Placeholders - these allow us to feed data to a tensorflow model from outside a model. It permits a value to be assigned later. To define a placeholder, we use the tf.placeholder() command. An example: a = tf.placeholder (tf.float32) b = a*2 with tf.Session() as sess: result = sess.run(b,feed_dict={a:3.0}) print result Sessions - a session is run to evaluate the nodes. This is called the "Tensorflow runtime." For example: a = tf.constant(2.0) b = tf.constant(4.0) c = a+b # Launch Session Sess = tf.Session() # Evaluate the tensor c print(sess.run(c))

What is the FastText approach?

Embed n-grams => Words are represented as bags of character n-grams => Representation for a word is given by average over its n-gram embeddings

What is end-to-end learning? Give a few of its advantages.

End-to-end learning is usually a model which gets the raw data and outputs directly the desired outcome, with no intermediate tasks or feature engineering. It has several advantages, among which: there is no need to handcraft features, and it generally leads to lower bias.

What Is the Difference Between Epoch, Batch, and Iteration in Deep Learning?

Epoch - Represents one iteration over the entire dataset (everything put into the training model). Batch - Refers to when we cannot pass the entire dataset into the neural network at once, so we divide the dataset into several batches. Iteration - if we have 10,000 images as data and a batch size of 200. then an epoch should run 50 iterations (10,000 divided by 50).

What's information leak?

Every time you are tuning a hyperparameter of your model based on the model's performance on the validation set, some information about the validation data is leaking into your model

How should an advanced optimizer like AdaGrad or Adam be used with SGD (Mini-Batch)?

First train with Adam, fine-tune with SGD

Explain how a GAN is trained.

First, the generator G is fixed while the discriminator D is trained. Then, G is trained with the results from D and D is fixed. The loss of G is high if D is doing a good job at detecting fakes.

What does parameter sharing mean in the context of a CNN?

For each window, we use the same weights and bias values

When is the Rouge Score used?

For evaluating automatic summarization by orienting on the Recall-Score

When is the BLEU Score used?

For evaluating the quality of text which has been machine-translated from one natural language to another. (Quality is comparison between output of machine and human)

Why do we need new measures for higher level NLP tasks?

For higher level NLP tasks, automatic measures may correlate poorly with human evaluation

What are examples of self-learning?

For instance, "autoencoders" are a well-known instance of selfsupervised learning, where the generated targets are the input themselves, unmodified

When is the F1 measure used for evaluation?

For two or more classes, one typically computes the F1- score of each class and then combines this (eg. averaging) in an overall score

What is one advantage of dilation?

Having a dilation greater than one increases the units' receptive field size without increasing the number of parameters.

Batch vs Layer Norm

If our activations are N*C*H*W, then batch norm normalizes the same filter (H*W) in a specific channel over each example in the batch. Layer norm normalizes each filter (H*W) across each filter in the same example (layer) of that batch. Instance norm normalizes each filter for each example respectively. In the end we add learned mean and STD to all forms · In batch normalization, input values of the same neuron for all the data in the mini-batch are normalized. Whereas in layer normalization, input values for all neurons in the same layer are normalized for each data sample. · In ΒΝ, the statistics are computed across the batch and the spatial dims. In contrast, in Layer Normalization (LN), the statistics (mean and variance) are computed across all channels and spatial dims. Thus, the statistics are independent of the batch.

What's the problem of redundancy in your data?

If some data points in your data appear twice, then shuffling the data and splitting it into a training set and a test set will result in redundancy between the training and test set. In effect, you would be testing on part of your training data.

What Keras class is used for data augmentation?

ImageDataGenerator

What are the advantages with n-gram language models?

Implementing an n-gram language model with an MLP is (also) easy

What is the advantage of a bidirectional RNN?

Infinite window size in both directions => Context knowledge from both past and future

What is the advantage of a RNN?

Infinite window size to left => remembers everything (context) from past

Describe two ways to visualize features of a CNN in an image classification task.

Input occlusion — cover a part of the input image and see which part affect the classification the most. For instance, given a trained image classification model, give the images below as input. If, for instance, we see that the 3rd image is classified with 98% probability as a dog, while the 2nd image only with 65% accuracy, it means that the part covered in the 2nd image is more important. Activation Maximization — the idea is to create an artificial input image that maximize the target response (gradient ascent).

What is the Wasserstein distance?

Intuitively, it is the minimum mass displacement to transform one distribution into the other. It solves the problem of the JS divergence regarding the structure of the distributions.

What is a temporal convolutional network?

It is a 1d convolutional network that processes an input of the maximum length. It uses dilated convolutions and the model size is O(log T). The memory footprint is O(T log T). This is the simplest approach to sequence processing.

What is the "maxout" layer?

It is not technically an activation function, as it has trainable parameters, but it takes the max of several linear units. It can encode ReLU as well as other functions, even any convex function.

What is a directed acyclic graph (DAG) required for?

It is required to compute derivatives.

What is neural machine translation?

It is the approach of modeling the entire MT process via one big artificial neural network

What are adversarial networks?

It is where one network tries to spot the mistakes of another one, while this network tries to fool the "spotter" network.

classification task where each input sample should be categorized into more than two categories

Multi-class classification

Does a standard attention layer take into account the absolute locations of the values? If not, what is a solution for this problem?

No, a standard attention layer does not take into account the position of the values. To fix this, we can provide the network with a positional encoding, which can be concatenated with the input batch.

Should we do cross-validation in deep learning?

No. The variance of cross-folds decrease as the samples size grows. Since we do deep learning only if we have samples in thousands, there is not much point in cross validation.

What does sparse connectivity mean in the context of a CNN?

Not every input is connected to every output in the following layer

What assumptions does the gradient descent method make?

It makes a strong assumption about local minimums it the selection of a fixed step size and it assumes isotropy in that the same step size should be used in all directions.

Tranposed Convolutions.

Expands from low dimension to high dimension. EXPAND ON THIS

When approximating a deep network 'f', with a single hidden layer 'g', how does the width of 'g' grow as the depth of 'f' grows?

Exponentially.

What is the idea of pooling?

Extracting relevant features independent of their position in the input

How does transition-based dependency parsing work?

It only builds one tree, in one left-to-right sweep over the input

Why is language modeling to restrictive?

It only considers the left context and not the right context

Name one characteristic of the t-Distributed Stochastic Neighbor Embedding (t-SNE).

It optimizes the SGD with the yis to maintain the distances to close neighbors of each point. This makes it easier to distinguish between clusters. Often used to reduce high dimensional data.

What is the receiver operating curve?

It shows the trade-off between True Positives (TP) and False Positives (FP). A standard measure of this curve is the Area Under the Curve (AUC).

What does a dependency structure show?

It shows which words depend on (modify or are arguments of) which other words

How does BiSkip work?

It uses sentence and word aligned texts, then runs a skip-gram model whose contexts are words from both languages

It consists in applying K-fold validation multiple times, shuffling the data every time before splitting it K-ways (name evaluation technique)

Iterated K-fold validation with shuffling

Split your data into N partitions of equal size. For each partition i, train a model on the remaining N-1 partitions, and evaluate it on partition i. Your final score would then be the averages of the N scores obtained (name evaluation technique)

K-fold validation

What is unsupervised learining?

Learning a model from unlabled data

How are Encoder-Decoder models used in sentence embeddings?

Let the input sequence equal the output sequence, and take the final hidden vector on the input side to be the sentence representation (auto encoder)

What is Overfitting and Underfitting, and How to Combat Them?

Overfitting occurs when the model learns the details and noise in the training data to the degree that it adversely impacts the execution of the model on new information. It is more likely to occur with nonlinear models that have more flexibility when learning a target function. An example would be if a model is looking at cars and trucks, but only recognizes trucks that have a specific box shape. It might not be able to notice a flatbed truck because there's only a particular kind of truck it saw in training. The model performs well on training data, but not in the real world. Underfitting alludes to a model that is neither well-trained on data nor can generalize to new information. This usually happens when there is less and incorrect data to train a model. Underfitting has both poor performance and accuracy. To combat overfitting and underfitting, you can resample the data to estimate the model accuracy (k-fold cross-validation) and by having a validation dataset to evaluate the model.

What's padding?

Padding consists in adding an appropriate number of rows and columns on each side of the input feature map so to as make it possible to fit center convolution windows around every input tile.

What is context attention?

Look it up online.

What are basic clusterings and embeddings (K-means, PCA) used for?

Parameterize and re-parametrize multiple times the input signal into representations that are more and more invariant and free of noise.

What is the most common pooling in NLP?

Max-over-time pooling

common metrics for class-imbalanced problems

Precision-Recall

What are soft alignments in attention-based encoder-decoder models?

Probability distribution over tokens that a current token aligns to

classification task where each input sample can be assigned multiple labels; the number of labels per observation is usually variable

Multi-label classification

Are strided convolutions often used?

No

Can you use in your workflow some quantity computed on your test data?

No

Is square loss a good loss function for multi-class prediction?

No

What is the CBOW task?

Given a context, predict the missing word

What is the Skip-gram task?

Given a word, predict the context words

How many output units does a perceptron have?

One

What is the simplest way to generate an adversarial input?

Optimize the output to maximize the loss of the discriminator. The goal is to have a modified image where the changes are indistinguishable to the human eye, but the image is misclassified.

Is the autograd graph the same as the structure of the network?

No, they are related but may differ.

Why are contextualized embeddings useful?

- Because of abiguity - Embed words with consideration to their context

What are the sources for determining dependency relations?

- Bilexical affinities - Dependency distance - Intervening material - Valency of heads

Which properties do dependency relations have?

- Binary - Asymmetric - Sometimes typed

What are the components and operations of a transition-based system?

- Buffer - Stack - A set of archs - Transitions (left-arc, right-arc, etc.)

What are the advantages of FastText?

- Can embed OOV Words - Naturally works for morphologically rich languages

Why are hidden layers needed?

- Can learn useful intermediate representations of the data - Good organization of layers makes learning much faster

What three key choices you have to make while settling on a first basic network architecture?

- Choice of the last-layer activation - Choice of loss function - Choice of optimization configuration: optimizer, learning rate

What are the disadvantages of Leaky ReLU function?

Not zero centered

What are the components of a Network Structure?

- Input - Weights - Summation and Bias - Activation - Output Neuron

What are the problems with the RNN Encoder-Decoder architecture?

- Long range dependencies - Context vector c has fixed length - Input size is variable

What is the development set used for?

- Optimization of hyperparameters - Early Stopping

What are the problems of pooling?

- Output remains the same if a feature occurs once or multiple times - Order of features is not considered

What are sequence labeling (sequence tagging) tasks for a RNN?

- POS tagging - NER - Grapheme-to-Phoneme Conversion - Lemmatization - Language Modeling

How to avoid overfitting?

- Reduce number of features - Do model selection - Use regularization - Do cross-validation

What are the three types of NLP tasks?

- Senctence / Text Classification - Sequence Tagging - Seq2Seq

What is the main difference between the convolutional layer and the dense layer?

- Sparse connectivity - Parameter sharing - In principle variable-sized inputs (in practice not)

In what sets do we split the input data?

- Training Set - Testing Set - Validation Set

How are Encoder-Decoder models constructed?

- Two RNNs stacked together - The last hidden layer in the input is taken as representation of the input

What are the disadvantages of Mini-Batch learning?

- Unclear of how to choose k (small --> better solutions, large --> more computationally efficient)

What are the differences between all the embedding approches?

- Unit of representation (characters, words, phrases, ...) - Definition of context for training (CBOW, skip-gram, Glove, ...)

What's the difference between Word2Vec and DepEmbeddings?

- Word2Vec finds words that associate with other words (domain similarity) - DepEmbeddings finds words that behave like others (functional similarity)

What is the difference between the tanh function compared to the sigmoid function?

- Zero centered - Interval from (-1, 1) - Also kills gradients

topology and parameters for single-label categorical classification

- ends with Dense layer - number of units at the end is equal to the number of classes - softmax activation - loss: categorical_crossentropy if targets are one-hot encoded; sparse_categorical_crossentropy if they are integers

What are the Toolkits for training word representations?

- word2vec - GloVe

What's fine-tuning?

-unfreeze a few of the top layers of a frozen model base - jointly train both the newly added part of the model and these top layers

dropout rate is usually set between

0.1 and 0.5

How does an n-gram language model work?

1. Split text into fragments (grams) 2. Get probability of a following word by having n-grams (n amount of fragments) as context

How does ELMo work?

1. The language model is pre-trained on a large corpus 2. For a new task, weights for the three representations are learned to get a taskspecific representation 3. This task specific representation is concatenated with standard static word embeddings

what is the best NN type for processing volumetric data?

3D convnets

What is an activation map?

A channel produced by a convolution layer.

When should cross-entropy loss be used?

In classification problems.

When are Encoder-Decoder models usually used?

In machine translation (e.g. translate a sentence from german to english)

What's reinforcement learning?

In reinforcement learning, an "agent" receives information about its environment and learns to pick actions that will maximize some reward.

What is the benefit of a RNN?

In theory, infinite influence from the past

What is an auto-encoder?

A neural network that is trained to attempt to copy its input to its output

What is an NVP network?

A non-volume preserving network

What is the output of a Softmax function?

A propability over all given classes

What is a receptive field?

A sub-area of an input map that influences a component of the output.

What does polysemy (or homonymy) mean?

A word may have many different meanings (eg. fly/Fly)

What does weight initialization aim to control?

Aims to control such that weights evolve at the same rate across layers and no layer becomes saturated before the others.

How could one get a bidirectional RNN?

By running another RNN from right to left and then concatenating the forward and backward hidden states

How are extrinsic word representations evaluated?

By the performance of a model that uses the word representations for solving a task

What rule is used in backpropagation?

Chain rule

Name a few applications of the following: Classification, Regression, Density Estimation

Classification: Image recognition, cancer detection. Regression: Stock predictions. Density Estimation: Outlier detection, data viz.

What is the aim of pooling?

Compute a low dimensional signal from a higher dimensional one. To group several activations into a single more meaningful one. Egs are max-pooling, average pooling etc.

What is the major architectural improvement of BERT?

Computing a word's context in bidirectional fashion instead of unidirectional only

What happens if you don't normalize your data?

It can trigger large gradient updates which will prevent your network from converging

What is the dimensionality of a Word-Vector in the "One Hot" vector representation?

It equals the size of the vocabulary

Intuitively, how does a networks "irregularity" grow with respect to its width and depth?

It grows linearly with its width and exponentially with its depth.

Noise 2 noise

Denoising can be achieved without clean samples if the noise is additive and unbiased. Used for image restoration.

Explain Shattered Gradient.

Depth shatters the relation between the input and the gradient wrt the input. Residual networks limit this effect. Since linear networks avoid this problem, it has been suggested to use Looks Linear initialization, which makes the network linear initially.

What NN should be used for Sentence / Text Classification tasks?

Perceptron / MLP / CNN

What NN should be used for Sequence Tagging tasks?

RNN

Why does the LSTM read words in reverse? (LSTM = Long Short Term Memory Networks)

Reading in reverse introduces many short term dependencies in the data and this makes the optimization problem much easier.

What tends to happen if weights are ill conditioned at initialization?

The gradients go towards zero in deeper layers.

What do each gate of the LSTM cell do?

The input gate decides whether to update the cell. The forget gate decides whether to reset the state to 0. The output gate decides whether or not to use the current cell state.

What is precision and recall?

The metrics used to test an NLP model are precision, recall, and F1. Also, we use accuracy for evaluating the model's performance. The ratio of prediction and the desired output yields the accuracy of the model. Precision is the ratio of true positive instances and the total number of positively predicted instances. Recall is the ratio of true positive instances and the total actual positive instances.

How are GPUs used in PyTorch?

The use of the GPUs in PyTorch is done by creating or copying tensors into their memory. Operations on tensors in a device's memory are done by the said device.

How is an non-projective dependency tree different from a projective one?

There are crossing edges in it

In autoregressive models, why are the best results achieved with cross-entropy?

This is due in large part to the ability of categorical distributions and cross-entropy to deal with exotic posteriors, in particular multi-modal. (A multimodal posterior is when conflicting interpretations of x exist)

What is the vanishing gradient problem for language models/sequence labeling?

Time steps far away are not taken into consideration (not specific to RNNs)

The input into a network can be considered as what?

Training/Test Data

When is an auto-encoder undercomplete?

When the hidden dimensionality is smaller than the input dimensionality

what's RNN?

a neural network that reuses quantities computed during the previous iteration of the loop

At what kind of problem do neural networks especially excel?

perceptual problems

what's underfitting?

there is still progress to be made; the network hasn't yet modeled all relevant patterns in the training data

You evaluate your model on the ... data

validation

how is vanishing gradient problem solved in RNNs?

with LSTM or GRU cells

how is vanishing gradient problem solved in very deep networks?

with residual connections

should you use the same dropout mask at every timestamp when training RNNs?

yes

What is TF-IDF?

TFIDF or Term Frequency-Inverse Document Frequency indicates the importance of a word in a set. It helps in information retrieval with numerical statistics. For a specific document, TF-IDF shows a frequency that helps identify the keywords in a document. The major use of TF-IDF in NLP is the extraction of useful information from crucial documents by statistical data. It is ideally used to classify and summarize the text in documents and filter out stop words. TF helps calculate the ratio of the frequency of a term in a document and the total number of terms. Whereas, IDF denotes the importance of the term in a document. The formula for calculating TF-IDF: TF(W) = (Frequency of W in a document)/(The total number of terms in the document) IDF(W) = log_e(The total number of documents/The number of documents having the term W) When TF*IDF is high, the frequency of the term is less and vice versa. Google uses TF-IDF to decide the index of search results according to the relevancy of pages. The design of the TF-IDF algorithm helps optimize the search results in Google. It helps quality content rank up in search results.

What is the difference between momentum descent and the Adam optimizer?

The Adam updates each coordinate separately based on a moving average applied on each coordinate.

What is the difference between an LSTM and a GRU?

The LSTM cell has 3 gates; the update gate, the forget gate and the output gate. The GRU only has 2 gates; the update gate and the forget gate. GRUs therefore have fewer parameters to train and perform similarly to LSTM.

What is one advantage of the Leaky-ReLU versus the regular ReLU?

The Leaky ReLU solves the vanishing gradient problem sometimes encountered by the ReLU, since its derivative is never zero, as can be the case with the regular ReLU. Leaky ReLU --> max(ax,x) with 0<a<1.

What do bias and variance quantify in a model?

The bias quantifies how much the model fits on average and the variance quantifies how much the model changes across different datasets.

What do you mean by exploding and vanishing gradients?

The key here is to make the explanation as simple as possible. As we know, the gradient descent algorithm tries to minimize the error by taking small steps towards the minimum value. These steps are used to update the weights and biases in a neural network. However, at times, the steps become too large and this results in larger updates to weights and bias terms - so much so as to cause an overflow (or a NaN) value in the weights. This leads to an unstable algorithm and is called an exploding gradient. On the other hand, the steps are too small and this leads to minimal changes in the weights and bias terms - even negligible changes at times. We thus might end up training a deep learning model with almost the same weights and biases each time and never reach the minimum error function. This is called the vanishing gradient. A point to note is that both these issues are specifically evident in Recurrent Neural Networks - so be prepared for follow-up questions on RNN!

What happens to the discriminator if the initial samples generated by G are too unrealistic?

The loss of D will be too far in the tail of the sigmoid function and the gradient will vanish. The response of D will saturate and will not train. This is solved by implementing a non-saturated cost. The loss of D remains unchanged.

What is batch normalization?

The main goal is to maintain proper statistics of the activations and derivatives. This is done by forcing the the activation statistics during the forward pass by re-normalizing them. During training batch normalization shifts and rescales according to the mean and variance estimated on the batch. During test, it simply shifts and rescales according to the empirical moments estimated during training.

What is the difference between a more advanced optimizer like Adam oder AdaGrad and Stochastic Gradient Descent (SGD / mini-batch learning)

They are usually faster but not as good as SGD

Why are ensemble methods superior to individual models?

They average out biases, reduce variance, and are less likely to overfit. There's a common line in machine learning which is: "ensemble and get 2%."

what allows convnets to efficiently learn increasingly complex and abstract visual concepts?

They can learn spatial hierarchies of patterns: a first convolution layer will learn small local patterns such as edges, but a second convolution layer will learn larger patterns made of the features of the first layers. And so on.

How do Support Vector Machines work?

They maximize the distance between samples and the decision boundary. The support vectors are the samples closest to this boundary.

What do ELMo and BERT do?

They use language models to get contextualized word representations

True/False. Constructing deep generative architectures requires layers to increase the signal dimension.

True, unlike what we have done with the feed-forward networks in the earlier portions of the course. This can be done in the forward pass with transposed convolution layers whose forward operation corresponds to a convolution layer's backward pass.

True/False. Convolutions preserve the signal's structure?

True.

True/False. For dropout, the model behaves differently during train and test.

True. As batch normalization.

True/False. During inference, batch normalization performs a component-wise affine transformation.

True. During inference, batch normalization shifts and rescales independently each component of the input x according to statistics estimated during training.

What is the problem with co-occurance counts?

Vectors become very large with real data and need dimensionality reduction

What are the advantages of mini-batch SGD?

Visiting samples in mini-batches and updating parameters every time helps avoid local minima while reducing the computation times

When going deeper in a network, what problems must be avoided concerning the gradient?

We must control the amplitude of the gradient. Ensure that: 1) the gradient does not vanish. 2) gradient amplitude is homogeneous so that all parts of the network train at the same rate. 3) The gradient does not vary too unpredictably.

Why is gating used in RNN?

When unfolding through time, the models get deep and can run into the vanishing gradient problem. TO solve this, "pass-through" gates allow the recurrent state to avoid going repeatedly through squashing non-linearities.

How does Google's Multilingual NMT System indicate the target language?

With an artificial token at the beginning of the input sentence

What is the most popular toolkit for training word representations?

Word2Vec

what's keras callback?

an object that is passed to the model in the call to fit and that is called at various points during training

Your model shouldn't have had access to ... information about the test set

any

how to make the contribution of the different losses more balanced in multi-output tasks?

assign different weights to different losses

Compare batch, mini-batch and stochastic gradient descent.

batch refers to estimating the data by taking the entire data, mini-batch by sampling a few datapoints, and SGD refers to update the gradient one datapoint at each epoch. The tradeoff here is between how precise the calculation of the gradient is versus what size of batch we can keep in memory. Moreover, taking mini-batch rather than the entire batch has a regularizing effect by adding random noise at each epoch.

module that develops a low-dimensional latent space is called ... in the case of VAEs

decoder

what's necessary for classifiers to succeed in ensembling?

diversity; they should be biased in different ways

what is the best NN type for processing sound data?

either 1D convnets (preferred) or RNNs

what is the best NN type for processing text data?

either 1D convnets (preferred) or RNNs

what is the best NN type for processing video data?

either 3D convnets (if you need to capture motion effects) or a combination of frame-level convnet for feature extraction followed by either a RNN or a 1D convnet to process the resulting sequences

the process of using your own knowledge about the data and about the machine learning algorithm at hand to make the algorithm work better by applying hard-coded (non-learned) transformations to the data before it goes into the model

feature engineering

in which applications are dense layers most commonly used?

for categorical data and the final classification or regression stage of most networks

How to notice if hold-out validation suffers from the problem of test/validation sets not being statistically representative?

if different random shuffling rounds of the data before splitting end up yielding very different model performance measures

From what flow does simple hold-out validation suffer?

if little data is available, then your validation and test sets may contain too few samples to be statistically representative of the data at hand

Why does dropout work?

introducing noise in the output values of a layer can break up happenstance patterns that are not significant

what's bi-directional recurrent neural network?

it consists of two regular RNNs, each processing input sequence in one direction (chronologically and antichronologically), then merging their representations

what's the main effect of batch normalization?

it helps with gradient propagation - much like residual connections - and thus it allows for deeper networks

What happens if a network has lots of parameters?

it learns a perfect dictionary-like mapping between training samples and their targets

what's generator network in GANs?

it takes as input a random vector (a random point in the latent space) and decodes it into a synthetic image

what's discriminator network in GANs?

it takes as input an image (real or synthetic), and must predict whether the image come from the training set or was generated by the generator network

computationally tractable operation that maps any two points in your initial space to the distance between these points in your target representation space, completely by-passing the explicit computation of the new representation

kernel function

convolution layers learn ... patterns

local

Takes the predictions of the network and the true target (what you wanted the network to output), and computes a distance score, capturing how well the network has done on this specific example.

loss function

how to preserve style during neural style transfer?

maintain similar correlations within activations for both low-level layers and high-level layers

how to preseerve content during neural style transfer?

maintain similar high-level layer activations between the target content image and the generated image

What does usually perform better: maxpooling, average pooling or strided convolutions?

maxpooling

Do we need to minimize or maximize loss?

minimize

How to fit a model using batch generator?

model.fit_generator(...)

are cycles allowed in NN topologies?

no, they should be Directed Acyclic Graphs

should intermediate layers return only their output at the last timestamp when stacking recurrent layers?

no, they should return their full sequence of outputs

is the optimization minimum in GANs fixed?

no, unlike in other NN topologies

What's usually the best choice of last-layer activation and loss function when dealing with regression to arbitrary values?

none, mse

what do we want to minimize during transfer style learning?

norm(style(reference_image) - style(generated_image)) + norm(content(original_image) - content(generated_image))

Before we fed this data into our network, we had to ... each feature independently so that each feature would have a standard deviation of 1 and a mean of 0

normalize

one-hot hashing trick

one may hash words into vectors of fixed size (typically done with built-in hash)

how are word embedding vectors different from one-hot encoded words?

one-hot encoded vectors are very sparse while word embedding vectors are low-dimensional floating point vectors

the process of adjusting a model to get the best performance possible on the training data

optimization

involves building a large number of specialized decision trees then ensembling their outputs

random forest

When do you need to use Iterated K-fold validation with shuffling?

situations in which you have relatively little data available and you need to evaluate your model as precisely as possible

All inputs and targets in a neural network must be ... of ... point data (this step is called data ...)

tensors, floating, vectorization

What's L1 regularization?

the cost added is proportional to the absolute value of the weights coefficients

What's L2 regularization?

the cost added is proportional to the square of the value of the weights coefficients

what's overfitting?

the model starting to model patterns that are specific to the training data but that are misleading or irrelevant when it comes to new data

What happens if you are expecting missing values in the test data but the network was trained on data without any missing values?

the network will not have learned to ignore missing values

multi-modal inputs

they merge data coming from different input sources, processing each type of data using different kinds of neural layers

why NNs are still extremely easy to fool?

they only perform local generalization not extreme generalization

what's the purpose of 1x1 convolution (poinwise convolution)?

they will compute features that mix together information from the channels of the input tensor, but does not mix information across space at all

the second best solution to prevent a model from learning about misleading or irrelevant patterns found in the training data

to modulate the quantity of information that your model is allowed to store, or to add constraints on what information it is allowed to store (regularization)

when 1D convnets are extremely useful when dealing with sequences?

to preprocess data before RNN, because RNNs are extremely expensive for processing very long sequences

You train your model on the ... data

training

what makes convnets very data-efficient when processing images (they need less training samples)?

translation invariant

When K-fold validation is useful?

when the performance of your model shows significant variance based on your train-test split

What are some advantages in using a CNN (convolutional neural network) rather than a DNN (dense neural network) in an image classification task?

while both models can capture the relationship between close pixels, CNNs have the following properties: It is translation invariant — the exact location of the pixel is irrelevant for the filter. It is less likely to overfit — the typical number of parameters in a CNN is much smaller than that of a DNN. Gives us a better understanding of the model — we can look at the filters' weights and visualize what the network "learned". Hierarchical nature — learns patterns in by describing complex patterns using simpler ones.

What is the concatenated ReLU?

x --> (max(0,x),max(0,−x)), it doubles the amount of activations, but maintains the norm of the signal intact during both the forward and backward passes.

How to avoid temporal leak?

you should always make sure that all data in your test set is posterior to the data in the training set

What can you do if you have no missing values in your training set but expect some in your test data?

you should artificially generate training samples with missing entries: simply copy some training samples several times and drop some of the features that you expect are susceptible to go missing in the test data

autoencoders

a type of network that aims to encode an input to a low-dimensional latent space then decode it back

Set apart some fraction of your data as your test set. Train on remaining data, evaluate on the test set (name evaluation technique)

Simple hold-out validation

What is Spectral Normalization and why is it implemented in GANs?

Spectral Normalization is a layer normalization that estimates the largest singular value of a weight matrix, and re-scale it accordingly. It was proposed to control the Lipschitz constant of D by rescaling its weights, so that all the linear layers have their singular values lesser than 1, and consequently Lipschitz constant, lesser than 1. If the non-linear layers are also Lipschitz of constant lesser than 1 (e.g. ReLU), this is a sufficient condition.

How can a sequence-2-sequence model have a differently sized inputs and outputs?

Such a model can use an encoder-decoder. For example, a translator which translates french to english; if the sentence in french has 10 words, that does not mean that the sentence in english will also have 10 words.

Why do we use pooling?

- to be able to learn hierarchical structure of images - to reduce number of parameters

Is the backward pass more computationally expensive than the forward pass?

Yes, by a factor of about 2.

what is the best NN type for processing image data?

2D convnets

What is a Multi-Layer-Perceptron (MLP)?

A neural network with (multiple) hidden layers and outputs

What is the general idea behind a convolution layer?

A representation meaningful at a certain location can / should be used everywhere. It applies the same linear transformation locally, everywhere, and preserves the signal structure.

Explain the idea behind the Adam optimizer.

Adam, or adaptive momentum, combines two ideas to improve convergence: per-parameter updates which give faster convergence, and momentum which helps to avoid getting stuck in saddle point.

vanishing gradient problem

as one keeps adding layers to a network, the network eventually becomes untrainable

What is replacing static word embeddings?

Contextualized embeddings

to perform well on never-seen before data

generalize

module that develops a low-dimensional latent space is called ... in the case of GANs

generator

What's data augmentation?

Data augmentation takes the approach of generating more training data from existing training samples, by "augmenting" the samples via a number of random transformations that yield believable-looking images.

How does an auto-encoder work?

First, the auto-encoder "encodes" the input onto a space of smaller dimension with convolutional layers (as we have done early in this course). Then, a decoder uses transposed convolutions to go back to the original signal space. However, this new image is approximated with the remaining lower dimensional information from the encoder.

What is the main Idea of Gated Recurrent Units (GRU)?

Keep around memories to capture long distance dependencies

How does Looks linear initialization work?

LLI makes the network linear initially by initializing weights in a mirrored block structure. Often, concatenated rectifiers (CReLU) are used. This method doubles the number of units, hence parameters. W = (W1, W2) becuase to the CReLu. Works best with tanh CNNs. https://stats.stackexchange.com/questions/339054/what-values-should-initial-weights-for-a-relu-network-be

common metric for ranking problems or multi-label classification

Mean Average Precision

What can word embeddings represent?

Semantic and syntactic relations between words in the vector space

When are penalties most useful?

Small models and scarce data.

What your model should ideally have predicted, according to an external source of data

Target

What are the problems with more advanced optimizers like AdaGrad or Adam in comparison to SGD (Mini-Batch)?

The solution found is usually not as good

What are the advantages with more advanced optimizers like AdaGrad or Adam in comparison to SGD (Mini-Batch)?

Usually faster than SGD

What is the advantage of text flow as opposed to images?

Usually only one dimension

Are activation functions hyperparameters?

Yes

How can the capacity of a linear predictor be increased?

You can increase the dimension D of the data in order to make it linearly separable. A simple example of this is the xor example in slides 3.3. Known as feature design.

a GAN is made of two parts

a generator network and a discriminator network

batch renormalization

a recent improvement over regular batch normalization

ensembling

consists in pooling together the predictions of a set of different models, in order to produce better predictions

Configuration of model

hyperparameters

when should we prefer RNNs over 1D convnets?

if data ordering is strongly meaningful

representational bottlenecks

if one layer happens to be too small, then the model will be constrained by how much information can be crammed into the activations of this layer (partially solved by residual connections)

in what cases RNNs should be used preferrentially over 1D convnets?

in the case of sequences where patterns of interest are not invariant by temporal translation (for instance, timeseries data where the recent past is more important than the distant past

How to save a model after training?

model.save(...)

is ensembling a same trained network several times independently from different random initializations useful?

no

are RNNs faster than 1D convnets when dealing with sequences?

no, convnets are faster

should discriminator be trainable?

no, it should be frozen

does a generator see images from the training set directly?

no, the information it has about the data comes from the discriminator

what is the drawback of GANs?

the latent space generated images come from may not have as much structure and continuity

what's optimized in multi-output tasks?

the resulting loss values get summed into a global loss, which is what gets minimized during training

in word embeddings we expect the geometric distance between any two word vectors to relate to

the semantic distance of the associated words

what do VAEs excel at?

they are great for learning latent spaces that are well-structured, where specific directions encode a meaningful axis of variation in the data

what do GANs excel at?

they generate images that can potentially be highly realistic

how are bi-directional recurrent networks are better than regular ones?

they have higher performance

What is a good %-split of the data into Training, Validation and Testing splits?

- 60% Training - 20% Validation - 20% Testing

How to develop a model that overfits?

- Add layers - Make your layers bigger - Train for more epochs

What are the properties of RNNs?

- Are deep MLPs - Weight sharing - Sparse connectivity - Skip connections

In what do we split our training set data?

- Development Set - (Proper) Training Set

What are bi-lingual mappings one can use for BiSkip?

- Dictionaries - Inter-lingual links in Wikipedia - Word alignments learned from parallel corpora

What are auto-encoders traditionally used for?

- Dimensionality reduction - Representation learning

Why is bilinguality better for embeddings?

- Second language may act as an additional "signal" => Make Monolingual Embeddings better - If words are projected in a common space ("shared features") => this may allow for Direct Transfer (zero-shot transfer)

How can a text be split into fragments (grams) for n-gram langage models?

- Sentences - Words - Characters - etc.

What are three examples of CNN sentence classification tasks?

- Sentiment classification of e.g. movie reviews - Question classification into category e.g. Person - Classifying whether a sentence is ironic

What is the basic idea of the RNN Encoder-Decoder architecture?

- Two RNNs - One to encode the input sentence - One to decode the sentence embedding (context vector)

how a variational autoencoder works?

- an encoder module turns the input samples into two parameters in a latent space of representations - we randomly sample a point z from the latent normal distribution that is assumed to generate the input image (z = z_mean + exp(z_log_variance) * epsilon) - a decoder module will map this point in the latent space back to the original input image

What is the Masked-LM pre-training objective of BERT?

A selection of tokens are randomly masked in the input sequence. The model needs to infer the masked word from the context.

What is Xavier's initialization?

A weight initialization that normalizes the variance of activations and variance of gradient wrt activations. It is a compromise between controlling the variance of the activations and that of the gradients.

What is the 'natural' loss for softmax?

Cross-Entropy Loss

What NN should be used for Seq2Seq tasks?

Encoder-Decoder (RNN)

What is the Next Sentence Prediction pre-training objective of BERT?

The model is fed a sequence of two sentences, where the second sentences either is the follow-up sentence to the first or a different random sentence. The model learns a binary classifier deciding between the two cases.

What is the main use of an attention mechanism?

Their main use was to provide deep learning networks with memory-like modules, and now it is to provide long-term dependency for sequence to sequence translation.

What Are the Different Layers on CNN?

There are four layers in CNN: Convolutional Layer - the layer that performs a convolutional operation, creating several smaller picture windows to go over the data. ReLU Layer - it brings non-linearity to the network and converts all the negative pixels to zero. The output is a rectified feature map. Pooling Layer - pooling is a down-sampling operation that reduces the dimensionality of the feature map. Fully Connected Layer - this layer recognizes and classifies the objects in the image.

What do Sequential Denosing Autoencoders (SDAE) do?

They are autoencoders which denoise (corrupt) the input a little

What do L1 and L2 penalty terms do?

They are both convex penalties and help avoid over-fitting of complex models.

What is the Homoscedascity assumption?

Two populations are Gaussian and have the same covariance matrix.

What's stride?

Using stride n means that the width and height of the feature map get downsampled by a factor n (besides any changes induced by border effects)

task where the target is a set of continuous values, e.g. a continuous vector

Vector regression

Variational Autoencoders

WATCH VIDEO

Can Autoregressive models be generative?

Yes.

neural style transfer

consists in applying the "style" of a reference image to a target image, while conserving the "content" of the target image

what's wide and deep category of models?

consists in jointly training a deep neural network with a large linear model

what is the best NN type for processing timeseries data?

either RNNs (preferred) or 1D convnets

... consists in using the representations learned by a previous network to extract interesting features from new samples

feature extraction

what's Xception?

it takes the idea of separating the learning of channel-wise and space-wise features to its logical extreme, and replaces Inception modules with depthwise separable convolutions, consisting in a depthwise convolution followed by a poinwise convolution - effectively an extreme form of an Inception module where spatial features and channel-wise features are fully separated

where does embedding layer reside in keras?

keras.layers.Embedding

how to merge branches wia keras functional API?

keras.layers.add, keras.layers.concatenate, etc.

which keras class is used to tokenize sentences?

keras.preprocessing.Tokenizer

depthwise separable convolution

performs a spatial convolution on each channel of its input, independently, before mixing output channels via a pointwise convolution

What are nonstationary problems?

problems in which data change over time

what are the best choices for hyperparameter optimization?

random search and Hyperopt/Hyperas libraries

What's usually the safest choice of optimizer and learning rate?

rmsprop and its default learning rate

How do we sometimes call axis 0?

sample or batch axis

At what kind of problem do gradient boosting excel?

shallow learning problems

What's usually the best choice of last-layer activation and loss function when dealing with binary classification?

sigmoid, binary_crossentropy

What's usually the best choice of last-layer activation and loss function when dealing with multi-class, multi-label classification?

sigmoid, binary_crossentropy

What's usually the best choice of last-layer activation and loss function when dealing with regression to values between 0 and 1?

sigmoid, mse or binary_crossentropy

What's temporal leak?

situation when you train your model on data from the future

What's usually the best choice of last-layer activation and loss function when dealing with multi-class, single-label classification?

softmax, categorical_crossentropy

What is the correct lemmatization of the word: gespielt?

spielen

What is the formular for recall?

tp / (tp + fn)

What is the formular for precision?

tp / (tp + fp)

How to make sample from images in directory?

train_datagen = ImageDataGenerator(rescale=1./255) train_datagen.flow_from_directory(...)

one-hot encoding

vector of size N all coordinates are 0 except for the i'th entry, which is 1

what's layer weights sharing?

when you call a layer instance twice, instance of instantiating a new layer for each call, you are reusing the same weights with every call

Brief summary of techniques that allow deep networks.

• rectifiers to prevent the gradient from vanishing during the backward pass • dropout to force a distributed representation • batch normalization to dynamically maintain the statistics of activations • identity pass-through to keep a structured gradient and distribute representation • smart initialization to put the gradient in a good regime.

Why does the .backward function need to be set to zero every time it is used?

It accumulates past results. Hence, to be used a new time it must be set to zero.

What is the goal of GloVe?

It aims at reconciling the advantages of global co-occurrence counts and local context windows

What is the goal of the stochastic depth network?

It aims to reduce the depth of a network during training, but not during testing. It achieves this by bypassing entire ResBlocks during training. Essentially, you train a "shallower" network than the one you use. Helps with vanishing gradient problem and computation time.

What does Pytorch's autograd mechanism do?

It automatically constructs the DAG of operations to compute the gradient of any quantity with respect to any involved tensor. Thus, the user is only concerned with the forward pass and the dynamic nature of the graph allows the forward pass to be modulated.

How does Siamese CBOW work?

It directly averages word embeddings for sentences, so that it learns that words with little semantic impact have a low vector norm

What is achieved by data normalization?

It helps keep the activation variations constant through the layers.

What is the Jenson-Shannon divergence?

It is a measure of the similarity between two distributions. However, it does not account much for the metric structure of the space.

What is gradient norm clipping?

It is a threshold set of the gradient to prevent it from growing excessively large. Any value above the threshold will be replaced by the threshold itself.

Is the empirical risk a biased or an unbiased estimator of the risk?

It is an unbiased estimator of the risk.

Does over-parametrizing a deep model improve or reduce test performance?

It often improves the test performance, contrary to what the bias-variance decomposition predicts.

What is the advantage of the sigmoid function?

It's a differentiable approximation ==> optimizable

specific instance of a class annotation in a classification problem

Label

Is it necessary to shuffle the training data when using batch gradient descent?

No, because the gradient is calculated at each epoch using the entire training data, so shuffling does not make a difference.

What is the Boltzmann Machine?

One of the most basic Deep Learning models is a Boltzmann Machine, resembling a simplified version of the Multi-Layer Perceptron. This model features a visible input layer and a hidden layer -- just a two-layer neural net that makes stochastic decisions as to whether a neuron should be on or off. Nodes are connected across layers, but no two nodes of the same layer are connected.

What is mini-batch learning with k=1?

Online learning

Which type of learning approximates the gradient by computing it on one datapoint?

Online learning

What's convolutional base?

Part of convolutional network that can be reused in the next learning. It consists of convolutional and pooling layers and doesn't comprise dense layers.

a measure of the distance between you model's prediction and the target

Prediction error, or loss value

what goes out of your model

Prediction, or output

What is an approach that determines the meaning of a sentence using e.g. phrase structure (or similarly dependency structure)?

RNN (Recursive Neural Nets)

What do all the embedding approaches have in common?

Represent natural language input with real-valued vectors

one data points that goes into your model

Sample, or input

task where the target is a continuous scalar value

Scalar regression

What are the components of a convolutional layer?

1. Pooling 2. Non-linear activation 3. Convolution

What is the extrinsic evaluation scheme of sentence embeddings?

1. Take your sentence embedding model 2. Embed your sentences in an extrinsic task 3. Train classifier on embedded sentences 4. Repeat with different sentence embedding model and compare performances

What are the steps of layerwise training of a deep auto-encoder?

1. Train a sparse auto-encoder on the input 2. Use the hidden layer (the features) of step 1 as input for a second auto-encoder 3. calculate a softmax classifier (only) on the last layer

what are feature maps?

3D tensors that convolutions operate over

RNN vs FFNN

A Feedforward Neural Network signals travel in one direction from input to output. There are no feedback loops; the network considers only the current input. It cannot memorize previous inputs (e.g., CNN). A Recurrent Neural Network's signals travel in both directions, creating a looped network. It considers the current input with the previously received inputs for generating the output of a layer and can memorize past data due to its internal memory.

topology and parameters for multi-label categorical classification

- ends with Dense layer - number of units at the end is equal to the number of classes - sigmoid activation - loss: binary_crossentropy - targets are one-hot encoded

topology and parameters for regression towards a vector of continuous values

- ends with Dense layer - number of units at the end is the number of values you want to predict (usually one) - no activation - loss: mean_squared_error, mean_absolute_error and so on

what are two ways to use word embeddings?

- learn word embeddings jointly with the main task you care about - load pre-trained word embeddings into your model

By which three attributes tensor is defined?

- the number of axes it has - it's shape - it's data type

callbacks can be used for:

-model checkpointing -early stopping - adjusting the value of certain parameters - log the training and validation metrics during training or visualize representations

Benefits of the ReLU?

1) Derivative non vanishing (above 0) 2) Result in sparse coding 3) Steeper slope in loss surface to speed up training.

What is the feature representation in a CNN?

The convolved representation

What method generates batches of randomly transformed images (it will loop infinitely)?

.flow()

In general with neural networks it is safe to input missing values as ...

0

What are hard alignments in attention based encoder-decoder models?

"Degenerate prob. distribution" with 0/1 values

What does NP => DET A** N PP** mean?

"NP" (noun phrase) expands to determiner, zero or more adjectives, noun, and zero or more prepositional phrases

What are the advantages and disadvantages of a character-based machine translation encoder-decoder?

+ Can predict OOV and rare Words + Can better deal with morphological variants - state-space may explode - long range dependencies

What are some techniques you should try while regularizing your model and tuning hyperparameters?

- Add dropout - Try different architectures, add or remove layers - Add L1 / L2 regularization - Try different hyperparameters (such as the number of units per layer, the learning rate of the optimizer) to find the optimal configuration - Optionally iterate on feature engineering: add new features, remove features that do not seem to be informative

Steps for fine-tuning:

- Add your custom network on top of an already trained base network - Freeze the base network - Train the part you added - Unfreeze some layers in the base network - Jointly train both these layers and the part you added

How can words be represented?

- As a dictionary entry - By their relation to other words (Taxonomy) - As a one-hot vector (Word vectors)

What does macro-averaging do and when should it be used?

- Averages on the class level (classes with less instances still have influence) - When performance on small classes is of importance

What does micro-averaging do and when should it be used?

- Averages on the level of test instances (classes with more instances have more influence) - When performance on largest classes is of importance

What simple but important algorithmic improvements were made in early 2010s?

- Better "activation functions", such as "rectified linear units". - Better "weight initialization" schemes. It started with layer-wise pre-training, which was quickly abandoned. - Better optimization schemes, such as RMSprop and Adam.

How can we model the distributional hypothesis?

- By calculating co-occurence count - Context is modeled using a window over the words

What are the two auxiliary task for language models (word2vec)?

- CBOW (Continuous Bag of Words) - Skip-gram

What are two ways to specify the grammar of sentences?

- Constuent tree - Dependency tree

the universal workflow of machine learning

- Define the problem and assemble a dataset - Pick a measure of success - Decide on an evaluation protocol - Prepare your data - Develop a model that does better than a baseline - Scale up: develop a model that overfits - Regularize your model and tune your hyperparameters

The typical Keras workflow:

- Define your training data: input tensors and target tensors - Define a network of layers (a "model") that will map your inputs to your targets - Configure the learning process by picking a loss function, an optimizer, and some metrics to monitor - Iterate on your training data

What are the options of word vectors for given tasks?

- Fixed word representations - Adjust the word representations to the task

What are the methods to evaluate word representations?

- Intrinsic Evaluation - Extrinsic Evaluation

What are the disadvantages of ReLU function?

- Kills gradients in -Region - Not zero centered

What helps against exploding gradients?

- L1 or L2 regularization on recurrent weights - Gradient clipping

For what tasks can encoder-decoder models be used?

- Machine Translation - Lemmatization - Spelling correction - POS tagging

What are the problems with evaluating sentence embeddings?

- Many different sizes - Different models trained on different datasets - Which classifier to use on top of embeddings in extrinsic tasks?

What are RNNs used for?

- Sequence Tagging - Classification

What are the problems with n-gram language models?

- They are inherently limited in the past window that they can take into consideration - Not commonly used anymore

Why do you still need feature engineering (even having deep learning algorithms)?

- to solve problems more elegantly while using less resources - to solve a problem with much less data

Describe different ways to attempt to understand what a network is doing?

1) Look at the parameters: 2) Look at the activations: 3) Look at how the network behaves "around the image; with Occlusion sensitivity, saliency maps, GradCam

Name two short comings of basic clustering and embeddings such as K-means and PCA that lead to wanting DNN.

1) Objects in background of images not taken into account. 2) Translations and deformations are very bad on results.

What are two standard performance measures for image classification?

1) The error rate, or conversely accuracy. 2) The balanced error rate (BER)

What are the steps of Backward Propagation?

1. Forward Propagation 2. Backward Propagation

What is a deep network?

A network with multiple hidden layers

all targets for a dataset, typically collected by humans

Ground-truth, or annotations

What is the assumption of co-occurance counts?

If we collect over thousands of sentences, the vectors for "enjoy" and "like" will have similar vector representations

Describe an LSTM network.

An LSTM is a type of gated RNN network. They were made to mitigate the exploding and vanishing gradient problem. Each hidden layer has an LSTM cell that remembers not only the hidden state vector but also the cell state. Each LSTM cell has 3 gates: the input gate, the forget gate and the output gate.

What do attention mechanisms do?

Attention mechanisms aggregate features with an important score that (ie. focus on specific parts of the input): 1) Depends on the features themselves, not their position in the tensor. 2) Relax locality constraints. This is done by modulating dynamically the weighting of different parts of a signal and allow the representation and allocation of information channels to be dependent on the activations themselves. (Compare regions of input with input context and calculate the weights accordingly) Their main use is to provide long-term dependency for sequence-to-sequence translation.

What is backpropagation?

Backpropagation is a recursive algorithm for determining error derivatives

Why are stacked auto-encoders loosing importance in NLP?

Because researchers have gained better intuition for initializing the parameters and bigger datasets are becoming available

Why do we split the input data?

Because we're interested in generalization performance

How are intrinsic word representations evaluated?

By using the representations directly

What is an example task for a character-based CNN approach?

Classifying articles into topics

What is CBOW?

Continuous Bag of Words is a common word embedding where the embedding vectors are chosen such that a word can be linearly predicted based on the sum of the embeddings of the surrounding words.

what is the best NN type for processing vector data?

Dense layers

What is dropout?

Dropout, applied to a layer, consists in randomly "dropping out" (i.e. setting to zero) a number of output features of the layer during training

Auto-encoders as self training

FILL IN FROM SLIDES

How does forward propagation and backpropagation work in deep learning?

Now, this can be answered in two ways. If you are on a phone interview, you cannot perform all the calculus in writing and show the interviewer. In such cases, it best to explain it as such: Forward propagation: The inputs are provided with weights to the hidden layer. At each hidden layer, we calculate the output of the activation at each node and this further propagates to the next layer till the final output layer is reached. Since we start from the inputs to the final output layer, we move forward and it is called forward propagation Backpropagation: We minimize the cost function by its understanding of how it changes with changing the weights and biases in a neural network. This change is obtained by calculating the gradient at each hidden layer (and using the chain rule). Since we start from the final cost function and go back each hidden layer, we move backward and thus it is called backward propagation

Describe what padding, stride and dilation do?

Padding: adds zeroes to the edges of the filter. Stride: Determines the step sizes of the filter. Dilation: Dilutes the filter by adding rows and columns of zeroes between its values.

The weights and activation function of a network can be considered as what?

Parameters (we learn this)

What is the problem of the "One Hot" Word-Vector representation?

Relations between words are not represented

What is the difference between stochastic descent and mini-batch stochastic descent?

SGD consists of updating the parameters after each sample. Mini-batch SGD updates parameters after visiting the samples in small batches ie. update weights after batchs of 100 samples for example. This tends to speed up the training process.

How to make data representative when picking evaluation protocol?

Shuffle data randomly

What does "You shall know a word by the company it keeps" mean?

That word meanings can be infered by context

What is the difference between the Adam optimizer and the Momentum optimizer?

The Adam optimizer rescales the lr for each coordinate separately using information from the previous steps while the momentum does not do this for each coordinate separately.

How should syntactic relations between words be represented?

In vectors

What are the pros and cons of Layer normalization versus batch normalization?

It behaves similarly in training and test and processes samples individually.

What is a Recursive Neural Network (RNN)?

It is a MLP with additional feedback loop (which has a time delay)

What is POS tagging?

Label each token in a sentence with its part-of-speech (= word class)

What are exploding gradients?

Large error gradients result in large updates to the NNs weights

How does an attention-based encoder-decoder model work?

Learn an attention vector that indicates which part of the input should be in the focus

What is supervised learning?

Learning a model from labled data

How Does an LSTM Network Work

Long-Short-Term Memory (LSTM) is a special kind of recurrent neural network capable of learning long-term dependencies, remembering information for long periods as its default behavior. There are three steps in an LSTM network: Step 1: The network decides what to forget and what to remember. Step 2: It selectively updates cell state values. Step 3: The network decides what part of the current state makes it to the output.

Are word embeddings trained on "supervised" or "un-supervised" data?

Un-supervised. This is a big benefit as it allows to train on potentially huge corpora of data.

When are to sets linear separable?

When a weight vector of a perceptron can classify each instance in the sets correctly

What is the goal of word embedding?

Word embedding represents words that are of similar classes/meanings closely. The idea is that each word can be embedded into a vector of 10s to 100s of features will create much more dense vectors than say using one-hot encoding for each word which would be magnitudes larger in terms of dimension. Essentially, word embedding is necessary to make the vectors function well in a deep learning application.

What is the problem with the representation of syntactic relations between words?

Word order matters

What is the cross-lingual objective of bilingual embeddings?

Words that are translations of each other should be close in the projected space

What is the mono-lingual objective of bilingual embeddings?

Words that occur in monolingually similar contexts should be close to each other in vector space

Is it possible to convert a constuent tree into an dependency tree or the other way around?

Yes, with the use of heuristic techniques or machine learning, but they are not equivalent

Is the output of an auto-encoder the same size as the input?

Yes. However, the complexity of the output is reduced as the auto-encoder reduced the dimension of the data of the input. It is compressed.

Why you shouldn't use bottleneck layers (number of neurons in a hidden layer is less than the number of classes?)

You can loose some information during training

What's class activation map technique?

This general category of techniques consists in producing heatmaps of "class activation" over input images

True/False. Loss functions (ie. nn.MSE()) do not accept targets with requires_grad = True

True.

common metrics for balanced classification problems

accuracy and ROC-AUC

what's translation invariant (regarding patterns that convnets learn)?

after learning a certain pattern, a convnet is able to recognize it anywhere

How the weights of the neural network are assigned initially?

at random

why are GANs so difficult to train?

because training a GAN is a dynamic process rather than a simple descent process with a fixed loss landscape

which networks suffer from vanishing gradient problem?

both very deep networks and RNNs over very long sequences

batch normalization

capable of adaptively normalizing data even as its mean and varianse change over time during training

residual connection

consists simply in reinjecting previous representations into the downstream flow of data, by adding a past output tensor to later output tensor

what's the key idea of image generation?

develop a low-dimensional "latent space" of representations, where any point can be mapped to a realistic-looking image

Why do we need validation set instead of simply using training and test sets?

developing a model always involve tuning its hyperparameters

how well the trained model would perform on data it has never seen before

generalization

the best solution to prevent a model from learning about misleading or irrelevant patterns found in the training data

get more training data

concept vector

given a latent space of representations, or an embedding space, certain directions in the space may encode interesting axes of variation in the original data (for example, a smile vector)

inception modules

he input is processed by several parallel convolutional branches whose outputs then get merged back into a single tensor

In which keras module do pretrained models reside?

keras.applications

What's overfitting?

performance of the model on never-seen-before data started stalling (or even worsening) compared to its performance on the training data

At train time, ... units are dropped out

randomly selected

Once your model is ready for prime time, you test it one final time on the ... data

test

Three common evaluation protocols:

- hold-out validation set - K-fold cross-validation - iterated K-fold validation

set of possible labels to choose from in a classification problem

Classes

what's output feature map?

The convolution operation extracts patches from its input feature map, and applies the same transformation to all of these patches, producing an output feature map.

the fraction of the features that are being zeroed-out during dropout

dropout rate

At test time, ... units are dropped out

no

What are the disadvantages Maxout function?

Doubles the number of parameters

Describe dropout.

Dropout is a regularization technique which consists of removing units at random during the forward pass of training and putting them all back during testing. This method increases independence between units, and distributes the representation. It generally improves performance. A network with dropout can be interpreted as an ensemble of 2^N models with heavy weight sharing.

What is the idea of Paragraph Vector Models?

To assign a vector to a paragraph (one sentence or several) so that we can predict words in a text

Why is input data split into training and testing data in Machine Learning?

To check generalization performance

What is the goal of Siamese CBOW?

To embed each word so that the averaged word embeddings of "similar" sentences are close

What is the goal of linear discriminant analysis?

To find a linear combination of features that separate two or more classes of objects or events. This can then be used as a linear classifier or to reduce dimensionality.

What is the goal of Machine Learning?

To generalize well to unseen data

Why do we need a bias unit?

To increase the capacity of the statistical model

What is Deep Learning?

Deep Learning involves taking large volumes of structured or unstructured data and using complex algorithms to train neural networks. It performs complex operations to extract hidden patterns and features (for instance, distinguishing the image of a cat from that of a dog).

What does arc-standard dependency parsing mean?

Dependency parsing with arc-standard operations: Left-arc, right-arc, shift

Why does gradient descent not always lead to a good solution?

Depending on the starting point it may find a local minima not a global minima

Dilation, stride and padding in transposed convolutions.

Dilation: Same as for convolution Stride and padding: defined in the output map

What is the Monte Carlo integrator?

The Monte Carlo integrator is constructed such that the intensity of each pixel is the expectation of the random path sampling process, i.e., the sampling noise is zero-mean. It is used in de-noising applications. It generates physically accurate renderings of virtual environments.

In an MLP, should the activation function be linear or non-linear? Why?

The activation function should be non-linear because if it is linear, the entire MLP will be an affine transformation with a peculiar parametrization.

elMo / ULMfit vs BERT

The reason you're seeing BERT and its derivatives as benchmarks is probably because it is newer than the other models mentioned and shows state-of-the-art performance on many NLP tasks. Thus, when researchers publish new models they normally want to compare them to the current leading models out there (i.e BERT). I don't know if there has been a study on the strengths of BERT compared to the other methods but looking at their differences might give some insight: Truly BidirectionalBERT is deeply bidirectional due to its novel masked language modeling technique. ELMo on the other hand uses an concatenation of right-to-left and left-to-right LSTMs and ULMFit uses a unidirectional LSTM. Having bidirectional context should, in theory, generate more accurate word representations. Model InputBERT tokenizes words into sub-words (using WordPiece) and those are then given as input to the model. ELMo uses character based input and ULMFit is word based. It's been claimed that character level language models don't perform as well as word based ones but word based models have the issue of out-of-vocabulary words. BERT's sub-words approach enjoys the best of both worlds. Transformer vs. LSTMAt its heart BERT uses transformers whereas ELMo and ULMFit both use LSTMs. Besides the fact that these two approaches work differently, it should also be noted that using transformers enables the parallelization of training which is an important factor when working with large amounts of data. This list goes on with things such as the corpus the model was trained on, the tasks used to train and more. So while it is true that BERT shows SOTA performance across a variety of NLP tasks, there are times where other models perform better. Therefore, when you're working on a problem it is a good idea to test a few of them a see for yourself which one suits your needs better.

Whats the difference between a RNN that is used for sequence tagging and one that is used for classification?

The sequence tagging RNN has multiple output units the classification one only one

What does the Skip Gram model do?

The skip gram model is an algorithm which optimizes word embeddings so that a word can be predicted by a single word in its context. It is the opposite of CBOW, where a context is used to predict a word. Here, a word is essentially used to predict a context.

What is the stride?

The steps size for moving over the sentence (in NLP usually 1)

What is the idea of a CNN?

To identify indicative local predictors in a large structure, and combine them to produce a fixed size vector representation of the structure, capturing these local aspects that are most informative for the prediction task at hand

What are Linguistig Probing Tasks used for?

To interpret sentence embeddings

Is an auto-encoder supervised or unsupervised?

Traditionally unsupervised

How can you get a different vector representation for each sense of a word?

Train word vectors on sense-disambiguated corpora


Conjuntos de estudio relacionados

EXP 4604- Chapter 3 Test: Visual Perception

View Set