Deep Learning with Python by Francois Chollet
What's information leak?
Every time you are tuning a hyperparameter of your model based on the model's performance on the validation set, some information about the validation data is leaking into your model
It consists in applying K-fold validation multiple times, shuffling the data every time before splitting it K-ways (name evaluation technique)
Iterated K-fold validation with shuffling
Split your data into N partitions of equal size. For each partition i, train a model on the remaining N-1 partitions, and evaluate it on partition i. Your final score would then be the averages of the N scores obtained (name evaluation technique)
K-fold validation
specific instance of a class annotation in a classification problem
Label
common metric for ranking problems or multi-label classification
Mean Average Precision
What are examples of self-learning?
For instance, "autoencoders" are a well-known instance of selfsupervised learning, where the generated targets are the input themselves, unmodified
classification task where each input sample should be categorized into more than two categories
Multi-class classification
classification task where each input sample can be assigned multiple labels; the number of labels per observation is usually variable
Multi-label classification
Are strided convolutions often used?
No
Can you use in your workflow some quantity computed on your test data?
No
all targets for a dataset, typically collected by humans
Ground-truth, or annotations
What are some techniques you should try while regularizing your model and tuning hyperparameters?
- Add dropout - Try different architectures, add or remove layers - Add L1 / L2 regularization - Try different hyperparameters (such as the number of units per layer, the learning rate of the optimizer) to find the optimal configuration - Optionally iterate on feature engineering: add new features, remove features that do not seem to be informative
How to develop a model that overfits?
- Add layers - Make your layers bigger - Train for more epochs
Steps for fine-tuning:
- Add your custom network on top of an already trained base network - Freeze the base network - Train the part you added - Unfreeze some layers in the base network - Jointly train both these layers and the part you added
What simple but important algorithmic improvements were made in early 2010s?
- Better "activation functions", such as "rectified linear units". - Better "weight initialization" schemes. It started with layer-wise pre-training, which was quickly abandoned. - Better optimization schemes, such as RMSprop and Adam.
What three key choices you have to make while settling on a first basic network architecture?
- Choice of the last-layer activation - Choice of loss function - Choice of optimization configuration: optimizer, learning rate
the universal workflow of machine learning
- Define the problem and assemble a dataset - Pick a measure of success - Decide on an evaluation protocol - Prepare your data - Develop a model that does better than a baseline - Scale up: develop a model that overfits - Regularize your model and tune your hyperparameters
The typical Keras workflow:
- Define your training data: input tensors and target tensors - Define a network of layers (a "model") that will map your inputs to your targets - Configure the learning process by picking a loss function, an optimizer, and some metrics to monitor - Iterate on your training data
What are three technical forces that are driving advances in machine learning?
- Hardware - Datasets and benchmarks - Algorithmic advances
What three things do we need to do machine learning?
- Input data points - Examples of the expected output - A way to measure if the algorithm is doing a good job
To make learning easier for your network, your data should:
- Take "small" values: typically most values should be in the 0-1 range. - Be homogenous, i.e. all features should take values roughly in the same range.
By which two key parameters are convolutions defined?
- The size of the patches that are extracted from the inputs - The depth of the output feature map, i.e. the number of filters computed by the convolution
how a variational autoencoder works?
- an encoder module turns the input samples into two parameters in a latent space of representations - we randomly sample a point z from the latent normal distribution that is assumed to generate the input image (z = z_mean + exp(z_log_variance) * epsilon) - a decoder module will map this point in the latent space back to the original input image
What you need to do first to find the border between underfitting and overfitting?
- build a model that overfits - monitor the training loss and validation loss - when you see that the performance of the model on the validation data starts degrading, you have achieved overfitting
How to deal with nonstationary problems?
- constantly retrain your model on data from the recent past - gather data at a timescale where the problem is stationary
What four steps does training loop consist of?
- draw a batch of training samples x and corresponding targets y - run the network on x (this is called "forward pass"), obtain predictions y_pred - compute the "loss" of the network on the batch, a measure of the mismatch between y_pred and y - update all weights of the network in a way that slightly reduces the loss on this batch
What two options of features representation do you have when dealing with multilabel classification?
- encoding the labels via "categorical encoding" (also known as "one-hot encoding") and using categorical_crossentropy as your loss function. - encoding the labels as integers and using the sparse_categorical_crossentropy loss function.
topology and parameters for multi-label categorical classification
- ends with Dense layer - number of units at the end is equal to the number of classes - sigmoid activation - loss: binary_crossentropy - targets are one-hot encoded
topology and parameters for single-label categorical classification
- ends with Dense layer - number of units at the end is equal to the number of classes - softmax activation - loss: categorical_crossentropy if targets are one-hot encoded; sparse_categorical_crossentropy if they are integers
topology and parameters for regression towards a vector of continuous values
- ends with Dense layer - number of units at the end is the number of values you want to predict (usually one) - no activation - loss: mean_squared_error, mean_absolute_error and so on
Three common evaluation protocols:
- hold-out validation set - K-fold cross-validation - iterated K-fold validation
what are two ways to use word embeddings?
- learn word embeddings jointly with the main task you care about - load pre-trained word embeddings into your model
What are some ways to fight overfitting in neural networks?
- reduce the network's size (number of layers and units per layer) - Adding weight regularization - Adding dropout
What evaluation techniques do exist?
- simple hold-out validation - k-fold validation - iterated K-fold validation with shuffling
By which three attributes tensor is defined?
- the number of axes it has - it's shape - it's data type
Why do we use pooling?
- to be able to learn hierarchical structure of images - to reduce number of parameters
Why do you still need feature engineering (even having deep learning algorithms)?
- to solve problems more elegantly while using less resources - to solve a problem with much less data
what you can do with TensorBoard?
- visually monitor your metrics during training - visualize your model architectures - visualize histograms of activations and gradients - exploring embeddings in 3D
callbacks can be used for:
-model checkpointing -early stopping - adjusting the value of certain parameters - log the training and validation metrics during training or visualize representations
What's fine-tuning?
-unfreeze a few of the top layers of a frozen model base - jointly train both the newly added part of the model and these top layers
What method generates batches of randomly transformed images (it will loop infinitely)?
.flow()
In general with neural networks it is safe to input missing values as ...
0
dropout rate is usually set between
0.1 and 0.5
what is the best NN type for processing image data?
2D convnets
what is the best NN type for processing volumetric data?
3D convnets
what are feature maps?
3D tensors that convolutions operate over
What's the problem of redundancy in your data?
If some data points in your data appear twice, then shuffling the data and splitting it into a training set and a test set will result in redundancy between the training and test set. In effect, you would be testing on part of your training data.
What Keras class is used for data augmentation?
ImageDataGenerator
What's reinforcement learning?
In reinforcement learning, an "agent" receives information about its environment and learns to pick actions that will maximize some reward.
classification task where each input sample should be categorized into two exclusive categories
Binary classification
set of possible labels to choose from in a classification problem
Classes
What happens if you don't normalize your data?
It can trigger large gradient updates which will prevent your network from converging
What's data augmentation?
Data augmentation takes the approach of generating more training data from existing training samples, by "augmenting" the samples via a number of random transformations that yield believable-looking images.
what is the best NN type for processing vector data?
Dense layers
What is dropout?
Dropout, applied to a layer, consists in randomly "dropping out" (i.e. setting to zero) a number of output features of the layer during training
What's padding?
Padding consists in adding an appropriate number of rows and columns on each side of the input feature map so to as make it possible to fit center convolution windows around every input tile.
What's convolutional base?
Part of convolutional network that can be reused in the next learning. It consists of convolutional and pooling layers and doesn't comprise dense layers.
common metrics for class-imbalanced problems
Precision-Recall
a measure of the distance between you model's prediction and the target
Prediction error, or loss value
what goes out of your model
Prediction, or output
one data points that goes into your model
Sample, or input
task where the target is a continuous scalar value
Scalar regression
How to make data representative when picking evaluation protocol?
Shuffle data randomly
Set apart some fraction of your data as your test set. Train on remaining data, evaluate on the test set (name evaluation technique)
Simple hold-out validation
What your model should ideally have predicted, according to an external source of data
Target
what's output feature map?
The convolution operation extracts patches from its input feature map, and applies the same transformation to all of these patches, producing an output feature map.
what allows convnets to efficiently learn increasingly complex and abstract visual concepts?
They can learn spatial hierarchies of patterns: a first convolution layer will learn small local patterns such as edges, but a second convolution layer will learn larger patterns made of the features of the first layers. And so on.
What's class activation map technique?
This general category of techniques consists in producing heatmaps of "class activation" over input images
What's stride?
Using stride n means that the width and height of the feature map get downsampled by a factor n (besides any changes induced by border effects)
task where the target is a set of continuous values, e.g. a continuous vector
Vector regression
Why you shouldn't use bottleneck layers (number of neurons in a hidden layer is less than the number of classes?)
You can loose some information during training
a GAN is made of two parts
a generator network and a discriminator network
what's RNN?
a neural network that reuses quantities computed during the previous iteration of the loop
batch renormalization
a recent improvement over regular batch normalization
autoencoders
a type of network that aims to encode an input to a low-dimensional latent space then decode it back
common metrics for balanced classification problems
accuracy and ROC-AUC
what's translation invariant (regarding patterns that convnets learn)?
after learning a certain pattern, a convnet is able to recognize it anywhere
what's keras callback?
an object that is passed to the model in the call to fit and that is called at various points during training
Your model shouldn't have had access to ... information about the test set
any
vanishing gradient problem
as one keeps adding layers to a network, the network eventually becomes untrainable
how to make the contribution of the different losses more balanced in multi-output tasks?
assign different weights to different losses
How the weights of the neural network are assigned initially?
at random
why are GANs so difficult to train?
because training a GAN is a dynamic process rather than a simple descent process with a fixed loss landscape
which networks suffer from vanishing gradient problem?
both very deep networks and RNNs over very long sequences
batch normalization
capable of adaptively normalizing data even as its mean and varianse change over time during training
neural style transfer
consists in applying the "style" of a reference image to a target image, while conserving the "content" of the target image
what's wide and deep category of models?
consists in jointly training a deep neural network with a large linear model
ensembling
consists in pooling together the predictions of a set of different models, in order to produce better predictions
residual connection
consists simply in reinjecting previous representations into the downstream flow of data, by adding a past output tensor to later output tensor
module that develops a low-dimensional latent space is called ... in the case of VAEs
decoder
what's the key idea of image generation?
develop a low-dimensional "latent space" of representations, where any point can be mapped to a realistic-looking image
Why do we need validation set instead of simply using training and test sets?
developing a model always involve tuning its hyperparameters
what's necessary for classifiers to succeed in ensembling?
diversity; they should be biased in different ways
the fraction of the features that are being zeroed-out during dropout
dropout rate
what is the best NN type for processing sound data?
either 1D convnets (preferred) or RNNs
what is the best NN type for processing text data?
either 1D convnets (preferred) or RNNs
what is the best NN type for processing video data?
either 3D convnets (if you need to capture motion effects) or a combination of frame-level convnet for feature extraction followed by either a RNN or a 1D convnet to process the resulting sequences
what is the best NN type for processing timeseries data?
either RNNs (preferred) or 1D convnets
the process of using your own knowledge about the data and about the machine learning algorithm at hand to make the algorithm work better by applying hard-coded (non-learned) transformations to the data before it goes into the model
feature engineering
... consists in using the representations learned by a previous network to extract interesting features from new samples
feature extraction
in which applications are dense layers most commonly used?
for categorical data and the final classification or regression stage of most networks
how well the trained model would perform on data it has never seen before
generalization
to perform well on never-seen before data
generalize
module that develops a low-dimensional latent space is called ... in the case of GANs
generator
the best solution to prevent a model from learning about misleading or irrelevant patterns found in the training data
get more training data
concept vector
given a latent space of representations, or an embedding space, certain directions in the space may encode interesting axes of variation in the original data (for example, a smile vector)
inception modules
he input is processed by several parallel convolutional branches whose outputs then get merged back into a single tensor
Configuration of model
hyperparameters
when should we prefer RNNs over 1D convnets?
if data ordering is strongly meaningful
How to notice if hold-out validation suffers from the problem of test/validation sets not being statistically representative?
if different random shuffling rounds of the data before splitting end up yielding very different model performance measures
From what flow does simple hold-out validation suffer?
if little data is available, then your validation and test sets may contain too few samples to be statistically representative of the data at hand
representational bottlenecks
if one layer happens to be too small, then the model will be constrained by how much information can be crammed into the activations of this layer (partially solved by residual connections)
in what cases RNNs should be used preferrentially over 1D convnets?
in the case of sequences where patterns of interest are not invariant by temporal translation (for instance, timeseries data where the recent past is more important than the distant past
Why does dropout work?
introducing noise in the output values of a layer can break up happenstance patterns that are not significant
what's bi-directional recurrent neural network?
it consists of two regular RNNs, each processing input sequence in one direction (chronologically and antichronologically), then merging their representations
what's the main effect of batch normalization?
it helps with gradient propagation - much like residual connections - and thus it allows for deeper networks
What happens if a network has lots of parameters?
it learns a perfect dictionary-like mapping between training samples and their targets
what's generator network in GANs?
it takes as input a random vector (a random point in the latent space) and decodes it into a synthetic image
what's discriminator network in GANs?
it takes as input an image (real or synthetic), and must predict whether the image come from the training set or was generated by the generator network
what's Xception?
it takes the idea of separating the learning of channel-wise and space-wise features to its logical extreme, and replaces Inception modules with depthwise separable convolutions, consisting in a depthwise convolution followed by a poinwise convolution - effectively an extreme form of an Inception module where spatial features and channel-wise features are fully separated
In which keras module do pretrained models reside?
keras.applications
where does embedding layer reside in keras?
keras.layers.Embedding
how to merge branches wia keras functional API?
keras.layers.add, keras.layers.concatenate, etc.
which keras class is used to tokenize sentences?
keras.preprocessing.Tokenizer
computationally tractable operation that maps any two points in your initial space to the distance between these points in your target representation space, completely by-passing the explicit computation of the new representation
kernel function
convolution layers learn ... patterns
local
Takes the predictions of the network and the true target (what you wanted the network to output), and computes a distance score, capturing how well the network has done on this specific example.
loss function
how to preserve style during neural style transfer?
maintain similar correlations within activations for both low-level layers and high-level layers
how to preseerve content during neural style transfer?
maintain similar high-level layer activations between the target content image and the generated image
What does usually perform better: maxpooling, average pooling or strided convolutions?
maxpooling
Do we need to minimize or maximize loss?
minimize
How to fit a model using batch generator?
model.fit_generator(...)
How to save a model after training?
model.save(...)
At test time, ... units are dropped out
no
is ensembling a same trained network several times independently from different random initializations useful?
no
are RNNs faster than 1D convnets when dealing with sequences?
no, convnets are faster
should discriminator be trainable?
no, it should be frozen
does a generator see images from the training set directly?
no, the information it has about the data comes from the discriminator
are cycles allowed in NN topologies?
no, they should be Directed Acyclic Graphs
should intermediate layers return only their output at the last timestamp when stacking recurrent layers?
no, they should return their full sequence of outputs
is the optimization minimum in GANs fixed?
no, unlike in other NN topologies
What's usually the best choice of last-layer activation and loss function when dealing with regression to arbitrary values?
none, mse
what do we want to minimize during transfer style learning?
norm(style(reference_image) - style(generated_image)) + norm(content(original_image) - content(generated_image))
Before we fed this data into our network, we had to ... each feature independently so that each feature would have a standard deviation of 1 and a mean of 0
normalize
one-hot hashing trick
one may hash words into vectors of fixed size (typically done with built-in hash)
how are word embedding vectors different from one-hot encoded words?
one-hot encoded vectors are very sparse while word embedding vectors are low-dimensional floating point vectors
the process of adjusting a model to get the best performance possible on the training data
optimization
At what kind of problem do neural networks especially excel?
perceptual problems
What's overfitting?
performance of the model on never-seen-before data started stalling (or even worsening) compared to its performance on the training data
depthwise separable convolution
performs a spatial convolution on each channel of its input, independently, before mixing output channels via a pointwise convolution
What are nonstationary problems?
problems in which data change over time
involves building a large number of specialized decision trees then ensembling their outputs
random forest
what are the best choices for hyperparameter optimization?
random search and Hyperopt/Hyperas libraries
At train time, ... units are dropped out
randomly selected
What's usually the safest choice of optimizer and learning rate?
rmsprop and its default learning rate
How do we sometimes call axis 0?
sample or batch axis
At what kind of problem do gradient boosting excel?
shallow learning problems
What's usually the best choice of last-layer activation and loss function when dealing with binary classification?
sigmoid, binary_crossentropy
What's usually the best choice of last-layer activation and loss function when dealing with multi-class, multi-label classification?
sigmoid, binary_crossentropy
What's usually the best choice of last-layer activation and loss function when dealing with regression to values between 0 and 1?
sigmoid, mse or binary_crossentropy
What's temporal leak?
situation when you train your model on data from the future
When do you need to use Iterated K-fold validation with shuffling?
situations in which you have relatively little data available and you need to evaluate your model as precisely as possible
What's usually the best choice of last-layer activation and loss function when dealing with multi-class, single-label classification?
softmax, categorical_crossentropy
All inputs and targets in a neural network must be ... of ... point data (this step is called data ...)
tensors, floating, vectorization
Once your model is ready for prime time, you test it one final time on the ... data
test
What's L1 regularization?
the cost added is proportional to the absolute value of the weights coefficients
What's L2 regularization?
the cost added is proportional to the square of the value of the weights coefficients
what is the drawback of GANs?
the latent space generated images come from may not have as much structure and continuity
what's overfitting?
the model starting to model patterns that are specific to the training data but that are misleading or irrelevant when it comes to new data
What happens if you are expecting missing values in the test data but the network was trained on data without any missing values?
the network will not have learned to ignore missing values
what's optimized in multi-output tasks?
the resulting loss values get summed into a global loss, which is what gets minimized during training
in word embeddings we expect the geometric distance between any two word vectors to relate to
the semantic distance of the associated words
what's underfitting?
there is still progress to be made; the network hasn't yet modeled all relevant patterns in the training data
what do VAEs excel at?
they are great for learning latent spaces that are well-structured, where specific directions encode a meaningful axis of variation in the data
what do GANs excel at?
they generate images that can potentially be highly realistic
how are bi-directional recurrent networks are better than regular ones?
they have higher performance
multi-modal inputs
they merge data coming from different input sources, processing each type of data using different kinds of neural layers
why NNs are still extremely easy to fool?
they only perform local generalization not extreme generalization
what's the purpose of 1x1 convolution (poinwise convolution)?
they will compute features that mix together information from the channels of the input tensor, but does not mix information across space at all
the second best solution to prevent a model from learning about misleading or irrelevant patterns found in the training data
to modulate the quantity of information that your model is allowed to store, or to add constraints on what information it is allowed to store (regularization)
when 1D convnets are extremely useful when dealing with sequences?
to preprocess data before RNN, because RNNs are extremely expensive for processing very long sequences
How to make sample from images in directory?
train_datagen = ImageDataGenerator(rescale=1./255) train_datagen.flow_from_directory(...)
You train your model on the ... data
training
what makes convnets very data-efficient when processing images (they need less training samples)?
translation invariant
You evaluate your model on the ... data
validation
one-hot encoding
vector of size N all coordinates are 0 except for the i'th entry, which is 1
When K-fold validation is useful?
when the performance of your model shows significant variance based on your train-test split
what's layer weights sharing?
when you call a layer instance twice, instance of instantiating a new layer for each call, you are reusing the same weights with every call
how is vanishing gradient problem solved in RNNs?
with LSTM or GRU cells
how is vanishing gradient problem solved in very deep networks?
with residual connections
should you use the same dropout mask at every timestamp when training RNNs?
yes
How to avoid temporal leak?
you should always make sure that all data in your test set is posterior to the data in the training set
What can you do if you have no missing values in your training set but expect some in your test data?
you should artificially generate training samples with missing entries: simply copy some training samples several times and drop some of the features that you expect are susceptible to go missing in the test data