Deep Learning with Python by Francois Chollet

What's information leak?

Every time you are tuning a hyperparameter of your model based on the model's performance on the validation set, some information about the validation data is leaking into your model

It consists in applying K-fold validation multiple times, shuffling the data every time before splitting it K-ways (name evaluation technique)

Iterated K-fold validation with shuffling

Split your data into N partitions of equal size. For each partition i, train a model on the remaining N-1 partitions, and evaluate it on partition i. Your final score would then be the averages of the N scores obtained (name evaluation technique)

K-fold validation

specific instance of a class annotation in a classification problem


common metric for ranking problems or multi-label classification

Mean Average Precision

What are examples of self-learning?

For instance, "autoencoders" are a well-known instance of selfsupervised learning, where the generated targets are the input themselves, unmodified

classification task where each input sample should be categorized into more than two categories

Multi-class classification

classification task where each input sample can be assigned multiple labels; the number of labels per observation is usually variable

Multi-label classification

Are strided convolutions often used?


Can you use in your workflow some quantity computed on your test data?


all targets for a dataset, typically collected by humans

Ground-truth, or annotations

What are some techniques you should try while regularizing your model and tuning hyperparameters?

- Add dropout - Try different architectures, add or remove layers - Add L1 / L2 regularization - Try different hyperparameters (such as the number of units per layer, the learning rate of the optimizer) to find the optimal configuration - Optionally iterate on feature engineering: add new features, remove features that do not seem to be informative

How to develop a model that overfits?

- Add layers - Make your layers bigger - Train for more epochs

Steps for fine-tuning:

- Add your custom network on top of an already trained base network - Freeze the base network - Train the part you added - Unfreeze some layers in the base network - Jointly train both these layers and the part you added

What simple but important algorithmic improvements were made in early 2010s?

- Better "activation functions", such as "rectified linear units". - Better "weight initialization" schemes. It started with layer-wise pre-training, which was quickly abandoned. - Better optimization schemes, such as RMSprop and Adam.

What three key choices you have to make while settling on a first basic network architecture?

- Choice of the last-layer activation - Choice of loss function - Choice of optimization configuration: optimizer, learning rate

the universal workflow of machine learning

- Define the problem and assemble a dataset - Pick a measure of success - Decide on an evaluation protocol - Prepare your data - Develop a model that does better than a baseline - Scale up: develop a model that overfits - Regularize your model and tune your hyperparameters

The typical Keras workflow:

- Define your training data: input tensors and target tensors - Define a network of layers (a "model") that will map your inputs to your targets - Configure the learning process by picking a loss function, an optimizer, and some metrics to monitor - Iterate on your training data

What are three technical forces that are driving advances in machine learning?

- Hardware - Datasets and benchmarks - Algorithmic advances

What three things do we need to do machine learning?

- Input data points - Examples of the expected output - A way to measure if the algorithm is doing a good job

To make learning easier for your network, your data should:

- Take "small" values: typically most values should be in the 0-1 range. - Be homogenous, i.e. all features should take values roughly in the same range.

By which two key parameters are convolutions defined?

- The size of the patches that are extracted from the inputs - The depth of the output feature map, i.e. the number of filters computed by the convolution

how a variational autoencoder works?

- an encoder module turns the input samples into two parameters in a latent space of representations - we randomly sample a point z from the latent normal distribution that is assumed to generate the input image (z = z_mean + exp(z_log_variance) * epsilon) - a decoder module will map this point in the latent space back to the original input image

What you need to do first to find the border between underfitting and overfitting?

- build a model that overfits - monitor the training loss and validation loss - when you see that the performance of the model on the validation data starts degrading, you have achieved overfitting

How to deal with nonstationary problems?

- constantly retrain your model on data from the recent past - gather data at a timescale where the problem is stationary

What four steps does training loop consist of?

- draw a batch of training samples x and corresponding targets y - run the network on x (this is called "forward pass"), obtain predictions y_pred - compute the "loss" of the network on the batch, a measure of the mismatch between y_pred and y - update all weights of the network in a way that slightly reduces the loss on this batch

What two options of features representation do you have when dealing with multilabel classification?

- encoding the labels via "categorical encoding" (also known as "one-hot encoding") and using categorical_crossentropy as your loss function. - encoding the labels as integers and using the sparse_categorical_crossentropy loss function.

topology and parameters for multi-label categorical classification

- ends with Dense layer - number of units at the end is equal to the number of classes - sigmoid activation - loss: binary_crossentropy - targets are one-hot encoded

topology and parameters for single-label categorical classification

- ends with Dense layer - number of units at the end is equal to the number of classes - softmax activation - loss: categorical_crossentropy if targets are one-hot encoded; sparse_categorical_crossentropy if they are integers

topology and parameters for regression towards a vector of continuous values

- ends with Dense layer - number of units at the end is the number of values you want to predict (usually one) - no activation - loss: mean_squared_error, mean_absolute_error and so on

Three common evaluation protocols:

- hold-out validation set - K-fold cross-validation - iterated K-fold validation

what are two ways to use word embeddings?

- learn word embeddings jointly with the main task you care about - load pre-trained word embeddings into your model

What are some ways to fight overfitting in neural networks?

- reduce the network's size (number of layers and units per layer) - Adding weight regularization - Adding dropout

What evaluation techniques do exist?

- simple hold-out validation - k-fold validation - iterated K-fold validation with shuffling

By which three attributes tensor is defined?

- the number of axes it has - it's shape - it's data type

Why do we use pooling?

- to be able to learn hierarchical structure of images - to reduce number of parameters

Why do you still need feature engineering (even having deep learning algorithms)?

- to solve problems more elegantly while using less resources - to solve a problem with much less data

what you can do with TensorBoard?

- visually monitor your metrics during training - visualize your model architectures - visualize histograms of activations and gradients - exploring embeddings in 3D

callbacks can be used for:

-model checkpointing -early stopping - adjusting the value of certain parameters - log the training and validation metrics during training or visualize representations

What's fine-tuning?

-unfreeze a few of the top layers of a frozen model base - jointly train both the newly added part of the model and these top layers

What method generates batches of randomly transformed images (it will loop infinitely)?


In general with neural networks it is safe to input missing values as ...


dropout rate is usually set between

0.1 and 0.5

what is the best NN type for processing image data?

2D convnets

what is the best NN type for processing volumetric data?

3D convnets

what are feature maps?

3D tensors that convolutions operate over

What's the problem of redundancy in your data?

If some data points in your data appear twice, then shuffling the data and splitting it into a training set and a test set will result in redundancy between the training and test set. In effect, you would be testing on part of your training data.

What Keras class is used for data augmentation?


What's reinforcement learning?

In reinforcement learning, an "agent" receives information about its environment and learns to pick actions that will maximize some reward.

classification task where each input sample should be categorized into two exclusive categories

Binary classification

set of possible labels to choose from in a classification problem


What happens if you don't normalize your data?

It can trigger large gradient updates which will prevent your network from converging

What's data augmentation?

Data augmentation takes the approach of generating more training data from existing training samples, by "augmenting" the samples via a number of random transformations that yield believable-looking images.

what is the best NN type for processing vector data?

Dense layers

What is dropout?

Dropout, applied to a layer, consists in randomly "dropping out" (i.e. setting to zero) a number of output features of the layer during training

What's padding?

Padding consists in adding an appropriate number of rows and columns on each side of the input feature map so to as make it possible to fit center convolution windows around every input tile.

What's convolutional base?

Part of convolutional network that can be reused in the next learning. It consists of convolutional and pooling layers and doesn't comprise dense layers.

common metrics for class-imbalanced problems


a measure of the distance between you model's prediction and the target

Prediction error, or loss value

what goes out of your model

Prediction, or output

one data points that goes into your model

Sample, or input

task where the target is a continuous scalar value

Scalar regression

How to make data representative when picking evaluation protocol?

Shuffle data randomly

Set apart some fraction of your data as your test set. Train on remaining data, evaluate on the test set (name evaluation technique)

Simple hold-out validation

What your model should ideally have predicted, according to an external source of data


what's output feature map?

The convolution operation extracts patches from its input feature map, and applies the same transformation to all of these patches, producing an output feature map.

what allows convnets to efficiently learn increasingly complex and abstract visual concepts?

They can learn spatial hierarchies of patterns: a first convolution layer will learn small local patterns such as edges, but a second convolution layer will learn larger patterns made of the features of the first layers. And so on.

What's class activation map technique?

This general category of techniques consists in producing heatmaps of "class activation" over input images

What's stride?

Using stride n means that the width and height of the feature map get downsampled by a factor n (besides any changes induced by border effects)

task where the target is a set of continuous values, e.g. a continuous vector

Vector regression

Why you shouldn't use bottleneck layers (number of neurons in a hidden layer is less than the number of classes?)

You can loose some information during training

a GAN is made of two parts

a generator network and a discriminator network

what's RNN?

a neural network that reuses quantities computed during the previous iteration of the loop

batch renormalization

a recent improvement over regular batch normalization


a type of network that aims to encode an input to a low-dimensional latent space then decode it back

common metrics for balanced classification problems

accuracy and ROC-AUC

what's translation invariant (regarding patterns that convnets learn)?

after learning a certain pattern, a convnet is able to recognize it anywhere

what's keras callback?

an object that is passed to the model in the call to fit and that is called at various points during training

Your model shouldn't have had access to ... information about the test set


vanishing gradient problem

as one keeps adding layers to a network, the network eventually becomes untrainable

how to make the contribution of the different losses more balanced in multi-output tasks?

assign different weights to different losses

How the weights of the neural network are assigned initially?

at random

why are GANs so difficult to train?

because training a GAN is a dynamic process rather than a simple descent process with a fixed loss landscape

which networks suffer from vanishing gradient problem?

both very deep networks and RNNs over very long sequences

batch normalization

capable of adaptively normalizing data even as its mean and varianse change over time during training

neural style transfer

consists in applying the "style" of a reference image to a target image, while conserving the "content" of the target image

what's wide and deep category of models?

consists in jointly training a deep neural network with a large linear model


consists in pooling together the predictions of a set of different models, in order to produce better predictions

residual connection

consists simply in reinjecting previous representations into the downstream flow of data, by adding a past output tensor to later output tensor

module that develops a low-dimensional latent space is called ... in the case of VAEs


what's the key idea of image generation?

develop a low-dimensional "latent space" of representations, where any point can be mapped to a realistic-looking image

Why do we need validation set instead of simply using training and test sets?

developing a model always involve tuning its hyperparameters

what's necessary for classifiers to succeed in ensembling?

diversity; they should be biased in different ways

the fraction of the features that are being zeroed-out during dropout

dropout rate

what is the best NN type for processing sound data?

either 1D convnets (preferred) or RNNs

what is the best NN type for processing text data?

either 1D convnets (preferred) or RNNs

what is the best NN type for processing video data?

either 3D convnets (if you need to capture motion effects) or a combination of frame-level convnet for feature extraction followed by either a RNN or a 1D convnet to process the resulting sequences

what is the best NN type for processing timeseries data?

either RNNs (preferred) or 1D convnets

the process of using your own knowledge about the data and about the machine learning algorithm at hand to make the algorithm work better by applying hard-coded (non-learned) transformations to the data before it goes into the model

feature engineering

... consists in using the representations learned by a previous network to extract interesting features from new samples

feature extraction

in which applications are dense layers most commonly used?

for categorical data and the final classification or regression stage of most networks

how well the trained model would perform on data it has never seen before


to perform well on never-seen before data


module that develops a low-dimensional latent space is called ... in the case of GANs


the best solution to prevent a model from learning about misleading or irrelevant patterns found in the training data

get more training data

concept vector

given a latent space of representations, or an embedding space, certain directions in the space may encode interesting axes of variation in the original data (for example, a smile vector)

inception modules

he input is processed by several parallel convolutional branches whose outputs then get merged back into a single tensor

Configuration of model


when should we prefer RNNs over 1D convnets?

if data ordering is strongly meaningful

How to notice if hold-out validation suffers from the problem of test/validation sets not being statistically representative?

if different random shuffling rounds of the data before splitting end up yielding very different model performance measures

From what flow does simple hold-out validation suffer?

if little data is available, then your validation and test sets may contain too few samples to be statistically representative of the data at hand

representational bottlenecks

if one layer happens to be too small, then the model will be constrained by how much information can be crammed into the activations of this layer (partially solved by residual connections)

in what cases RNNs should be used preferrentially over 1D convnets?

in the case of sequences where patterns of interest are not invariant by temporal translation (for instance, timeseries data where the recent past is more important than the distant past

Why does dropout work?

introducing noise in the output values of a layer can break up happenstance patterns that are not significant

what's bi-directional recurrent neural network?

it consists of two regular RNNs, each processing input sequence in one direction (chronologically and antichronologically), then merging their representations

what's the main effect of batch normalization?

it helps with gradient propagation - much like residual connections - and thus it allows for deeper networks

What happens if a network has lots of parameters?

it learns a perfect dictionary-like mapping between training samples and their targets

what's generator network in GANs?

it takes as input a random vector (a random point in the latent space) and decodes it into a synthetic image

what's discriminator network in GANs?

it takes as input an image (real or synthetic), and must predict whether the image come from the training set or was generated by the generator network

what's Xception?

it takes the idea of separating the learning of channel-wise and space-wise features to its logical extreme, and replaces Inception modules with depthwise separable convolutions, consisting in a depthwise convolution followed by a poinwise convolution - effectively an extreme form of an Inception module where spatial features and channel-wise features are fully separated

In which keras module do pretrained models reside?


where does embedding layer reside in keras?


how to merge branches wia keras functional API?

keras.layers.add, keras.layers.concatenate, etc.

which keras class is used to tokenize sentences?


computationally tractable operation that maps any two points in your initial space to the distance between these points in your target representation space, completely by-passing the explicit computation of the new representation

kernel function

convolution layers learn ... patterns


Takes the predictions of the network and the true target (what you wanted the network to output), and computes a distance score, capturing how well the network has done on this specific example.

loss function

how to preserve style during neural style transfer?

maintain similar correlations within activations for both low-level layers and high-level layers

how to preseerve content during neural style transfer?

maintain similar high-level layer activations between the target content image and the generated image

What does usually perform better: maxpooling, average pooling or strided convolutions?


Do we need to minimize or maximize loss?


How to fit a model using batch generator?


How to save a model after training?

At test time, ... units are dropped out


is ensembling a same trained network several times independently from different random initializations useful?


are RNNs faster than 1D convnets when dealing with sequences?

no, convnets are faster

should discriminator be trainable?

no, it should be frozen

does a generator see images from the training set directly?

no, the information it has about the data comes from the discriminator

are cycles allowed in NN topologies?

no, they should be Directed Acyclic Graphs

should intermediate layers return only their output at the last timestamp when stacking recurrent layers?

no, they should return their full sequence of outputs

is the optimization minimum in GANs fixed?

no, unlike in other NN topologies

What's usually the best choice of last-layer activation and loss function when dealing with regression to arbitrary values?

none, mse

what do we want to minimize during transfer style learning?

norm(style(reference_image) - style(generated_image)) + norm(content(original_image) - content(generated_image))

Before we fed this data into our network, we had to ... each feature independently so that each feature would have a standard deviation of 1 and a mean of 0


one-hot hashing trick

one may hash words into vectors of fixed size (typically done with built-in hash)

how are word embedding vectors different from one-hot encoded words?

one-hot encoded vectors are very sparse while word embedding vectors are low-dimensional floating point vectors

the process of adjusting a model to get the best performance possible on the training data


At what kind of problem do neural networks especially excel?

perceptual problems

What's overfitting?

performance of the model on never-seen-before data started stalling (or even worsening) compared to its performance on the training data

depthwise separable convolution

performs a spatial convolution on each channel of its input, independently, before mixing output channels via a pointwise convolution

What are nonstationary problems?

problems in which data change over time

involves building a large number of specialized decision trees then ensembling their outputs

random forest

what are the best choices for hyperparameter optimization?

random search and Hyperopt/Hyperas libraries

At train time, ... units are dropped out

randomly selected

What's usually the safest choice of optimizer and learning rate?

rmsprop and its default learning rate

How do we sometimes call axis 0?

sample or batch axis

At what kind of problem do gradient boosting excel?

shallow learning problems

What's usually the best choice of last-layer activation and loss function when dealing with binary classification?

sigmoid, binary_crossentropy

What's usually the best choice of last-layer activation and loss function when dealing with multi-class, multi-label classification?

sigmoid, binary_crossentropy

What's usually the best choice of last-layer activation and loss function when dealing with regression to values between 0 and 1?

sigmoid, mse or binary_crossentropy

What's temporal leak?

situation when you train your model on data from the future

When do you need to use Iterated K-fold validation with shuffling?

situations in which you have relatively little data available and you need to evaluate your model as precisely as possible

What's usually the best choice of last-layer activation and loss function when dealing with multi-class, single-label classification?

softmax, categorical_crossentropy

All inputs and targets in a neural network must be ... of ... point data (this step is called data ...)

tensors, floating, vectorization

Once your model is ready for prime time, you test it one final time on the ... data


What's L1 regularization?

the cost added is proportional to the absolute value of the weights coefficients

What's L2 regularization?

the cost added is proportional to the square of the value of the weights coefficients

what is the drawback of GANs?

the latent space generated images come from may not have as much structure and continuity

what's overfitting?

the model starting to model patterns that are specific to the training data but that are misleading or irrelevant when it comes to new data

What happens if you are expecting missing values in the test data but the network was trained on data without any missing values?

the network will not have learned to ignore missing values

what's optimized in multi-output tasks?

the resulting loss values get summed into a global loss, which is what gets minimized during training

in word embeddings we expect the geometric distance between any two word vectors to relate to

the semantic distance of the associated words

what's underfitting?

there is still progress to be made; the network hasn't yet modeled all relevant patterns in the training data

what do VAEs excel at?

they are great for learning latent spaces that are well-structured, where specific directions encode a meaningful axis of variation in the data

what do GANs excel at?

they generate images that can potentially be highly realistic

how are bi-directional recurrent networks are better than regular ones?

they have higher performance

multi-modal inputs

they merge data coming from different input sources, processing each type of data using different kinds of neural layers

why NNs are still extremely easy to fool?

they only perform local generalization not extreme generalization

what's the purpose of 1x1 convolution (poinwise convolution)?

they will compute features that mix together information from the channels of the input tensor, but does not mix information across space at all

the second best solution to prevent a model from learning about misleading or irrelevant patterns found in the training data

to modulate the quantity of information that your model is allowed to store, or to add constraints on what information it is allowed to store (regularization)

when 1D convnets are extremely useful when dealing with sequences?

to preprocess data before RNN, because RNNs are extremely expensive for processing very long sequences

How to make sample from images in directory?

train_datagen = ImageDataGenerator(rescale=1./255) train_datagen.flow_from_directory(...)

You train your model on the ... data


what makes convnets very data-efficient when processing images (they need less training samples)?

translation invariant

You evaluate your model on the ... data


one-hot encoding

vector of size N all coordinates are 0 except for the i'th entry, which is 1

When K-fold validation is useful?

when the performance of your model shows significant variance based on your train-test split

what's layer weights sharing?

when you call a layer instance twice, instance of instantiating a new layer for each call, you are reusing the same weights with every call

how is vanishing gradient problem solved in RNNs?

with LSTM or GRU cells

how is vanishing gradient problem solved in very deep networks?

with residual connections

should you use the same dropout mask at every timestamp when training RNNs?


How to avoid temporal leak?

you should always make sure that all data in your test set is posterior to the data in the training set

What can you do if you have no missing values in your training set but expect some in your test data?

you should artificially generate training samples with missing entries: simply copy some training samples several times and drop some of the features that you expect are susceptible to go missing in the test data

