Hands on Machine Learning with Scikit-Learn, Keras & TensorFlow

Ace your homework & exams now with Quizwiz!

How can you drop values entirely?

.drop()

How can you drop empty values?

.dropna()

How can you fill empty values?

.fillna()

Neurons produce what?

Action potentials which are electrical signals called neurotransmitters

If you want to build a deep seq-to-seq RNN, which RNN layers should have return_sequences=True?

All layers.

In supervised learning, what do you feed the system?

Answers called labels or targets.

What is an estimator?

Any object that can estimate some parameters based on a dataset.

What is the loss function in K-Means?

Average distance from each instance to it's centroid.

What are the most important types of unsupervised learning?

Clustering Anomaly/Novelty Detection Visualization and Dimensionality Reduction Association Rule Learning

Can you name four common unsupervised tasks?

Clustering Visualization Dimensionality Reduction Association Rule Learning

What does an embedding layer do?

Converts word IDs into embeddings

What are two clustering algorithms that look for regions of high density?

DBSCAN and Mean-Shift.

What are generally the best regularization techniques in Neural Networks?

Early Stopping and Dropout

What is the width of the street in SVM regression controlled by?

Epsilon

What is a peephole connection?

FOLLOW BACK

What can we use OneClassSVM to do?

Find outliers

What is one of the reasons that batch learning can be impractical?

For large datasets, it can be impractical to train the system frequently.

What does GRU stand for?

Gated Recurrent Unit

In a decision tree, what does a nodes gini attribute state?

How pure the node is... that is, it is pure if all training instances it applies to belong to the same class. This would be gini=0.

What is another S shaped activation function that is often used in MLPs?

Hyperbolic Tangent Functions

What are the best libraries for optimizing hyperparameters of MLPs?

Hyperopt Hyperas Keras Tuner Scikit-Optimize Spearmint Hyperband Sklearn-Deap

Why does SELU help to solve exploding/vanishing gradients?

If all hidden layers use the SELU activation function, the network will self-normalize, meaning the layer will tend to preserve a mean of 0 and a standard deviation of 1 during training.

How is RL different from regular supervised or unsupervised learning?

In supervised and unsupervised learning, the goal is to find patterns to make predictions. In RL the goal is to find a good policy.

What does IPCA stand for?

Incremental Principal Component Analysis

What does LOF stand for?

Local Outlier Factor

What is Machine Learning (ML)?

Machine learning is the science (and art) of programming computers so they can learn from data.

What is the most common pooling layer?

Max Pooling Layers

What is the most common strategy that model-based learning algorithms use to succeed?

Minimizing a cost function measuring how bad the system is at making predictions on training data, plus a penalty for model complexity if the model is regularized.

Can Gradient Descent get stick in a local minimum when training a Logistic Regression model?

No, because the cost function is convex.

How many output neurons are used in regression DNNs?

One

How many output neurons are in a classification based MLP?

One for every class

How many neurons do you need in the output layer if you want to classify email into spam or ham? What activation function should you use in the output layer?

One neuron with a logistic activation function.

When predicting housing prices, how many neurons do you need in the output layer, and which activation function should you use?

One neuron with no activation function.

How can you split the dataset?

One option is to use sklearn.model_selection.train_test_split(data,test_size=0.2)

What is a greedy algorithm?

One that (for instace, in Decision trees) only seeks to reduce the gini impurity in it's own node.

Manifold Learning is essentially a non-linear version of what?

PCA

What are the main approaches of Dimensionality Reduction?

Projection and Manifold Learning

What are the two most common supervised tasks?

Regression and Classification

What activation is better than ELU?

SELU--which is a scaled variant of ELU.

Because GD works best "independently", what do we need to do a Time Series dataset before training it?

Shuffle the data

What does SAMME stand for?

Stagewise Additive Modeling Using a Multiclass Exponential Loss Function

How does Mean Shift work for clustering?

Start by placing a circle centered on each instance, then for each circle it computes the mean of all the instances located within it, and shifts the circle so that it is centered on the mean. It iterates until all the circles stop moving.

When you are going through kernels for SVMs, what is usually the best strategy?

Start with linear, since it is so much faster, and graduate up to other types of SVMs if the dataset is large.

How is the activation function different in RNNs from DNNs?

The activation function flows both forwards and backwards

What may happen if you set the momentum hyperparameter too close to 1 (e.g., 0.99999) when using an SGD optimizer?

The algorithm will likely pick up a lot of speed, hopefully moving roughly toward the global minimum.

What are Telodendria?

The branches that split off of an axon in biological neurons

How does K-Means clustering work at a high level?

The centroids are initially placed randomly, and then each of the instances is placed/labeled in respect to those centroids, and then the algorithm iterates until it converges on a solution close to optimal.

What is the quadratic programming problem?

The fact that hard margin and soft margin problems are convex quadratic optimization problems with linear constraints.

What is the problem that makes CNNs necessary?

The fact that if you had purely fully connected layers, there would be far too many parameters.

What is the unsupervised process of association rule learning?

The process of associating attributes with one another, and therefore making them closer to one another.

What is irreducible error?

This is due to noise in the data that can be reduced by cleaning up the data.

What is instance based learning?

This type of system learns the examples by heart, and then generalizes to new cases by using a similarity metric to compare them.

What is the purpose of a Deep and Wide Neural Network?

To learn deep patterns, as well as simple ones.

Why do Regularized Linear Models exist?

To reduce overfitting

Layers close to the output layer are called what?

Upper layers

What does WOR and WR stand for?

Without Replacement and With Replacement

How does RMSProp work?

Works like AdaGrad but only looks at recent interaction. optimizer = keras.optimizers.RMSprop(lr=0.001,rho=0.9)

What is the optimized version of Gradient Boosting called?

XGBoost

When chaining transformations, how do we apply a particular preprocessing transformation to the whole dataset?

apply() method

A single recurrent neuron, or a layer of recurrent neurons is _________ and can learn _________ patterns (about _____ steps long).

basic, short, ten

If your MC Model contains BatchNormalization, how should you use MC Dropout?

class MCDropout(keras.layers.Dropout): def call(self,inputs): return super().call(inputs,training=True)

When a classifier is trained, where are the target classes stored?

classes_

How do you shuffle a dataset and create a batch where a specified number of instances are held while the others are trained on?

dataset = dataset.shuffle(buffer_size=5,seed=42).batch(7)

When reading a compressed file, what must you do?

dataset = tf.data.TFRecordDataset(["my_compressed.tfrecord"],compression_type="GZIP") Specify the compression type.

How can we build a simple correlation matrix?

dataset.corr() Switch 'dataset' with whichever dataset is instantiated in an object.

How do you put in validation data (if you have carved a part of your dataset for it) in an MLP?

history = model.fit(X_train, y_train, epochs=30,validation_data=(X_valid,y_valid))

What is the easiest ways to save and load data?

import joblib # Save joblib.dump(my_model, "mymodel.pkl") # Load my_model_loaded = joblib.load("my_model.pkl")

What are the most important supervised learning algorithms?

k-Nearest Neighbors Linear Regression Logistic Regression Support Vector Machines (SVMs) Decision Trees and Random Forests Neural Networks

How do you compute Leaky ReLU?

keras.layers.LeakyReLU(alpha=0.2)

How do we have multiple inputs for an MLP?

model = keras.Model(inputs=[input_A,input_B],outputs=[output]) If we want, we can pass each of these inputs to separate nets, by specifying the hidden layers, and then specifying in parentheses which layer those hidden layers should take as input. We will want to specify in the fit method all of the different training data's to fit to. history = model.fit((X_train_A,X_train_B),y_train,epochs=20,validation_data=((X_valid_A,X_valid_B),y_valid))

How do we seet Out-of-Bag in and ensemble method?

oob_score=True

How do you implement gradient clipping?

optimizer = keras.optimizers.SGD(clipvalue=1.0) model.compile(loss="mse",optimizer=optimizer)

Which common library is equipped to use evolutionary hyperparameters?

sklearn-deap

How can you reshape an instance X[0] to be 28 x 28?

some_digit = X[0] some_digit_image = some_digit.reshape(28,28)

Any layer that supports masking must have _________________ attribute equal to True.

supports_masking

What are the Deployment and Optimization API's?

tf.distribute tf.saved_model tf.autograph tf.graph_util tf.lite tf.quantization tf.tpu tf.xla

What are the high-level deep learning APIs?

tf.keras tf.estimator

What is a simple Text Vectorization tool?

tf.keras.layers.experimental.preprocessing.TextVectorization()

What are the special data structure API's?

tf.lookup tf.nest tf.ragged tf.sets tf.sparse tf.strings

What are the Mathematics (including Linear Algebra and Signal Processing) API's?

tf.math tf.linalg tf.signal tf.random tf.bitwise

What is the decision function for SVM?

w = weight x = prediction of new instance b = bias

What is the perceptron learning equation?

w sub i,j (next step) = w sub i,j + gamma(y sub j minus y hat sub j) time x of instance (or sub i) w sub i,j = the connection weight between the ith neuron and the jth output neuron. x sub i = the ith input value of the current training instance y hat sub j = the output of the jth neuron for the current training instance y sub j = the target output of the jth neuron for the current training instance gamma = the learning rate

How do you write a TFRecord?

with tf.io.TFRecordWriter("my_dat.tfrecord") as f: f.write(b"This is the first record") f.write(b"This is the second record")

How might you get the shape of the data?

x.shape

How do you apply a MC Dropout in a model without batch normalization?

y_probas = np.stack([model(X_test_scaled,Training=True) for sample in range(100)]) y_proba = y_probas.mean(axis=0) y_std = y_probas.std(axis=0) np.round(y_std[:1],2)

How do you print the OOB score?

Bag_clf.oob_score_ In this case, it is a bagging classifier, but this could be any type of ensemble model.

What does BIRCH stand for?

Balanced Iterative Reducing and Clustering using Hierarchies

Which Gradient Descent algorithm will actually converge?

Batch Gradient Descent

Why do people use Encoder-Decoder RNNs rather than plain seq-to-seq RNNs for automatic translations?

Because typically, translating between languages one word at a time is not effective.

Adding more layers helps the model to perform what?

Better abstraction

Generalization error can be expressed as the sum of what three different types of errors?

Bias Variance Irreducible Error

What is boosting, and how does it work?

Boosting combines several weak learners into a strong learner by making every predictor better, and then making it's predecessor better.

How do Isolation Forests find outliers?

Builds a random forest with random thresholds, and the anomalies tend to get isolated from the pack.

How does BIRCH work?

Built most for large sets, but works similar to K-Means.

What is the regularization method in an SVM that changes the softness of the margin of instances allowed on the street.

C The model becomes more regularized the lower that C is.

What does a Decision Tree Regression do?

Calculate the average value of instances in that region using MSE.

How can you get a feel for the distribution of the data.

Call the .hist() method to return a histogram of the data.

How would you create a bagging classifer with Random Forests?

# Bagging Classifier from sklearn.ensemble import BaggingClassifier from sklearn.tree import DecisionTreeClassifier bag_clf = BaggingClassifier(DecisionTreeClassifier(),n_estimators=500,max_samples=100,bootstrap=True,n_jobs=-1) bag_clf.fit(X_train,y_train) y_pred = bag_clf.predict(X_test)

How do we compute the importance of variables using Random Forests?

# Compute Random Forest Feature Importance from sklearn.datasets import load_iris from sklearn.ensemble import RandomForestClassifier iris = load_iris() rnd_clf = RandomForestClassifier(n_estimators=500,n_jobs=-1) rnd_clf.fit(iris["data"],iris["target"]) for name, score in zip(iris["feature_names"],rnd_clf.feature_importances_): print(name,score)

What is the package for linear regression?

# Select a Linear Model model = sklearn.linear_model.LinearRegression() # Train the Model model.fit(X,y)

How can you set the decision threshold to be precision above 90%?

# Threshold for 90% Precision threshold_90_precision = thresholds[np.argmax(precisions >= 0.90)]

How do you build an example voting classifier?

# Voting Classifiers from sklearn.ensemble import VotingClassifier from sklearn.ensemble import RandomForestClassifier from sklearn.linear_model import LogisticRegression from sklearn.svm import SVC log_clf = LogisticRegression() rnd_clf = RandomForestClassifier() svm_clf = SVC() voting_clf = VotingClassifier(estimators=[('lr',log_clf),('rf',rnd_clf),('svc',svm_clf)],voting='hard') voting_clf.fit(X_train,y_train)

What does the code look like for a Deep and Wide Neural Network?

# Wide and Deep Model input = keras.layers.Input(shape=X_train.shape[1:]) hidden1 = keras.layers.Dense(30,activation="relu")(input_) hidden2 = keras.layers.Dense(30,activation="relu")(hidden1) concat = keras.layers.Concatenate()([input_,hidden2]) output = keras.layers.Dense(1)(concat) model = keras.Model(inputs=[input_],outputs=[output])

What are three possible solutions to poor generalization?

- Get More Data - Simplify the Model (reducing the number of parameters of features to force generalization) - Reducing Noise in the Training Data

What are the parts of a biological neuron?

Cell Body Dendrites Axon Telodendria Synaptic Terminals

What in TensorFlow is similar to a Pipeline in scikit-learn?

Chaining Transformations

Rather than using gini impurity for decision trees, you can also use what?

Change the criterion hyperparameter to "entropy". Entropy seeks to get everything to fall into one class. It is more computationally complex, and generally gets the same results, so gini is usually better.

What is Checkpointing, and why is it useful?

Checkpointing will save the data every so many epochs to insure that if the data is training for a long time, the process on training isn't lost if there is a problem.

What are the most common callbacks in keras models?

Checkpoints and Early Stopping

What is the most immediate way to choose the right number of dimensions?

Choose the number of dimensions that accounts for a sufficiently large portion of the variance. Usually 95% or more.

Unlike Logitistic Regression, SVM classifiers do not output what?

Class probabilities

Can you think of a few applications of a Sequence-to-vector RNN?

Classifying music samples by music genre, analyzing the sentiment of a book review, predicting what word an aphasic patient is thinking of based on readings from brain implants, predicting the probability that a user will want to watch a movie based on their watch history.

What is best loss for classification based MLPs?

Cross-Entropy (Categorical for Multi-Class and Binary for Two Classes)

What are the most common uses of clustering?

Customer Segmentation For separating data that can be more easily analyzed Dimensionality Reduction Anomaly Detection Semi-Supervised Learning For search engines To segment an image

What are some of the main applications of clustering algorithms?

Data Analysis, Customer Segmentation, Recommender Systems, Search Engines, Image Segmentation, Semi-Supervised Learning, Dimensionality Reduction, Anomaly Detection, and Novelty Detection

What is a bias error?

Data error due to wrong assumptions

What kind of layer should SELU be used for?

Dense Layers

What is variance error?

Due to excessive sensitivity to variations in the data.

How do you choose the value of the regularization hyperparameters?

Holdout Validation. You holdout part of the training data to evaluate the hyperparameter values.

In a decision tree, what does a nodes samples attribute state?

How many training instances it applies to

In a decision tree, what does a nodes value attribute state?

How many training instances of each class this node applies to.

What doess the coef parameter in a kernel SVC specify?

How much the model is influenced by high-degree versus low-degree polynomials

How does Nesterov Accelerated Gradient optimizer work?

It measures the gradient of the cost function. optimizer = keras.optimizers.SGD(lr=0.001,momentum=0.9,nesterov=True)

What happens if the learning rate is too fast?

It might jump over the local minimum.

How does the AdaGrad optimizer work?

It recognizes if it isn't headed towards the global optimum. Doesn't work as well with neural nets.

What are the activation functions best for classification?

Logistic for Binary Predictions, or Softmax for Multi-Class Predictions

What does LSTM stand for?

Long Short Term Memory

What is the easiest way to find the appropriate number of clusters in k-means?

Look for the inflection point (or the "elbow) of the data, with respect to distance.

What is the dual problem?

Look more into this.

What does tf.RaggedTensor allow you to do?

Represent static lists of lists of tensors where very tensor has the same shape and data type.

What are the different types of Regularized Linear Models?

Ridge Regression Lasso Regression Elastic Net

What activation functions would you use LeCun initalization for?

SELU

What is online working better for?

Systems that receive data as a continuous flow (such as the stock market).

How do you calculate the True Positive Rate?

TPR = (True Positives) / (True Positive + False Negatives)

RNN's can simultaneously do what?

Take inputs in and spit outputs out.

What is a common form of data augmentation in image processing?

Taking the same image and making duplicate instances of it, but rotating the image.

With supervised regression tasks, what are the labels called?

Targets

What does TPR calculate?

The percentage of times that a certain class came up that it was able to recognize it.

When using predict_proba in logistic regression, what will be returned?

The probability that it belongs to a particular class.

What is reinforcement learning?

The process of an agent learning from rewards and punishment.

What is min_weight_fraction?

The same as the min_samples_leaf by expressed as a fraction of the total number of weighted instances

What is a common struggle for CNNs with reference to object detection, and how do you solve this?

The same object may be detected more than once, leading to eroneous bounding boxes. Add an "objectness" output to your CNN that uses a sigmoid activation function to estimate the probability that the object is present in the image, then train using binary cross-entropy loss. Get rid of all bounding boxes that don't reach a certain objectness score threshold. Find bounding boxes with highest objectness score and get rid of all of the other bounding boxes that overlap a lot with it using an IoU greater than 60%. Repeat step two till there are no more bounding boxes to get rid of.

What does Non-Repesentative Data Mean?

The sample cases must be representative of what we are trying to measure.

What does the ROC curve measure?

The true positive rate against the false positive rate?

What is feature extraction?

The unsupervised process of removing features that are highly correlated with other features to reduce dimensions.

What are the main innovations in Xception?

The use of depthwise separable convolutional layers, which look at spatial patterns and depthwise patterns separately.

What symbol is used to represent the weight for each given variable in a linear regression model?

Theta

What are Random Forests?

They are essentially a bagging technique on many decision trees.

Why don't perceptrons output a class probability?

They are hard predictors

What are pooling layers?

They are layers very similar to convolutional layers (where they look at a dimensional matrix) and only keep certain qualities of the data matrix that it is observing. It might keep the mean of the matrix, or it might only keep the max value of the matrix.

What are decision trees?

They are systems that ask a series of questions to eventually reach a conclusion.

How does backpropagation work?

They feed a mini-batch called and epoch forward (and determine through and the chain rule, how much each input contributed to the error) and then backward, adjusting weights and biases to reduce error using gradient descent.

Can custom Keras components contain arbitrary Python code, or must they be convertible to TF Functions?

They should be convertible to TF Functions. If you need to use arbitrary Python code, wrap it in tf.py_function() or set dynamic=True when creating the custom layer or model.

How do Random Forest Regressors work?

They train multiple decision trees and then find the average response.

How does Affinity Propagation work?

This allows instances to vote for instances that are similar to them to be their representatives. It is very computationally complex.

What is model based learning?

This builds a model around what you want to predict, and uses utility and / or cost functions.

What is gradient clipping?

This entails clipping gradients during backpropagation so they never exceed a certain threshold.

What is dilation_rate?

This intentionally puts "holes" in the data, by creating buffers of zero in between datapoints, such that [1,2,3] with a dilation rate of 4 becomes [1,0,0,0,2,0,0,0,3]

What is instance segmentation?

This involves assigning all pixels part of the same object type to the same object--such as assigning all the chair in an image to one object, rather than each chair to it's own object.

What is semantic segmentation in clustering?

This involves assigning all pixels that are part of the same object to the same segment

What is active learning?

This involves having a human label the instances that the system is uncertain about

What is a Dropout layer?

This involves randomly dropping a neuron for that iteration alone during every iteration.

What is the No Free Lunch Theorem?

A theory that states that there is no way of knowing which model will work best beforehand.

Each individual training example is called what?

A training instance or sample

Pixel intensity for each color channel is represented how?

As a byte from 0 to 255. We scale this to between 0 and 1.

What is the process of automatically computing gradients in neural networks called?

Automatic Difference of Autodiff

What does BPTT stand for?

Back Propagation Through Time

________________________ as a function has essentially become a given. You are assumed to have done it after every layer.

BatchNormalization

What is one of the difficulties in assessing the abilities of a model utilizing dropout?

Because dropout is only utilized during training, it can make it difficult to compare to the comparisons in makes on predictions when the dropout layer has been turned off

What is feature scaling?

Because models don't operate well when features very greatly in size (ex. sq ft = 0-1000 but bedrooms = 0-5), scaling rescales the values.

Why might we choose non-linear SVM over linear?

Because not all datasets are linear.

How do we create our own metric?

By instantiating it in an object, and calling it at any time in the iterations using my_metric.results() to see the current value of the metric.

How does AdaBoost work?

By training a base classifier such as a Decision Tree, and then looking at its mistakes and increasing the relative weight of misclassified training instances, then training a second classifier using the updated weights.

How can you tweak a Perceptron to make it equivalent to a Logistic Regression classifier?

By using a logistic or softmax activation with gradient descent.

How can regression be used for classification?

By utilizing logistic regression to predict the probability that an instance belongs to a given category.

What type of model is typically used for object detection?

CNN

What is one of the features of lasso regression?

It diminishes the weight of the least important variables to close to zero.

How does PCA work?

It finds the closest hyperplane to the data, and projects the data onto it.

When a CNN learns about a feature in one part of the image, can it recognize it elsewhere in the image?

Yes. This is part of the appeal of CNNs.

Can you have multiple output layers in an MLP?

Yes. You will do the exact same thing that you do to apply multiple inputs, but instead will switch the strategy for outputs. You will also have to pass a list of losses to the compile method.

What does it mean that you are measuring the gradient?

You are measuring the partial derivative of the cost function by observing how much the cost function changes for each tiny change in theta, and pursuing the fastest path to convergence.

How do you calculate the silhouette coefficient?

(b-a) / max(a,b) a = the mean distance to the other instances in the same cluster b = the mean nearest-cluster distance

What can OOB be used for?

Testing in Ensemble Learning Methods, since these instances have not been trained on

What is min_samples_leaf?

The minimum number of samples a leaf node must have in a decision tree

How do you use a train-dev set?

The model is training on the rest of training set and evaluated on the train-dev and validation set.

What are the most important Association Rule Learning algorithms?

Apriori Eclat

Can you list all the hyperparameters you can tweak in a basic MLP?

1. # of hidden layers. 2. # of neurons in each layer. 3. Activation function used in each hidden layer and in the output layer. * Generally ReLU is a good default for hidden layers. For the output you will want to use logistic activation for binary classification, softmax for multiclass, and no activation for regression.

How do you fix underfitting?

1. A more powerful model with more parameters. 2. Feed better features. 3. Reduce the constraints on the model.

What are a few tasks where GANs can shine?

1. Advanced Image Processing 2. Colorization 3. Image Editing (replacing objects with realistic background) 4. Turning a simple sketch into a photorealistic image. 5. Predicting the Next Frames in a Video 6. Augmenting Datasets 7. Generating text, audio, and time series.

Name three ways you can produce a sparse model (with most weight equal to zero).

1. Apply l sub 1 regularization during training. 2. Train normally and then zero out small weights. 3. Use the TensorFlow Model Optimization Toolkit.

What are the main tasks that autoencoders are used for?

1. Feature Extraction. 2. Unsupervised Pretraining 3. Dimensionality Reduction 4. Generative Models 5. Anomaly Detection (Autoencoders are usually bad a reconstructing outliers)

What are solutions for overfitting the data?

1. Use a model with less parameters such as a linear model. 2. Gather more training data. 3. Reduce the noise in the training data

When there are multiple hidden layers, this is called what?

A Deep Neural Net (DNN)

What is a Deep and Wide Neural Network?

A Deep and Wide Neural Network concatenates the input layer into the output layer of the last hidden layer, so that it learns the deep patterns, but also maintains the data from the inputs.

What are the time steps called in RNNs?

A Frame

What is the fastest model for object detection?

A Fully Convolutional Neural Network. Then you will compute the mAP (Mean Average Precision).

All the transformations of the data can be made to be transformed in order by utilizing what?

A Pipeline

What is a Stateful RNN?

A Stateful RNN preserves the final state after processing a training batch and uses it as the initial state for the next training batch to learn long term patterns.

What is One-vs-One?

A binary model where you decide whether the prediction falls into a specific class or into another one of the classes.

What is One-vs-the-Rest?

A binary model where you decide whether the prediction falls into the specific class or into the rest of the classes.

Why is it generally preferable to use a Logistic Regression classifier rather than a classical perceptron?

A classical perceptron only converges if the dataset is linearly separable, and won't be able to estimate class probabilities.

If the model performs well on both the training set and the train-dev set, but not on the validation set, then what is occurring?

A data mismatch between the training data and the validation + test data. Improve the training data to make it look more like the validation + test data.

What is a GAN?

A gnerative adversarial network, which is a neural network architecture composed of two parts, the generator and the discriminator, which have opposing objectives. One to generate images, and the other to discern whether the image is real or not.

What is a labeled training set?

A labeled training set is a training set that contains the desired solution for each instance.

What is a Linear SVM?

A line is drawn between points to separate the classes. This line is referred to as the street.

When would you want to add a local response normalization layer?

A local response normalization layer makes the neurons that most strongly activate inhibit neurons at the same location, but in neighboring eature maps, which encourages different feature maps to specialize and pushes them apart, forcing them to explore a wider range of features. It is typically used in the lower layers to have a larger pool of low-level features that the upper layers can build upon.

What is the output of a logistic regression called?

A logit

What is cross entropy?

A loss function that describes the distance between the actual class assignment, and what the algorithm assumes that it was.

What is a manifold?

A manifold is the surface of any shape.

What is a memory cell?

A part of a neural net that preserves some state across time steps.

What is DESCR?

A part of the dataset that describes the dataset

What is MNIST?

A popular dataset contain handwritten numbers and letters.

What is logistic regression?

A regression that outputs the probability that something belongs to a certain class by printing a value between 0 and 1.

What is a pipeline?

A sequence of data processing components.

What is a GRU?

A simplified version of the LSTM

Suppose you have a daily univariate time series, and you want to forecast the next seven days. Which RNN architecture should you use?

A stack of RNNs with return_sequences=True at every layer except the top. You can then have seven neurons in the output.

What is a softmax regression?

A type of regression better suited for outputting multiple class probabilities.

What is a type of generative autoencoder?

A variational autoencoder

What is an Axon?

A very long extension similar to a dendrite in biological neurons

What does momentum optimization refer to?

A way of building friction into the learning rate. You might set: optimizer = keras.optimizers.SGD(lr=0.001,momentum=0.9) Zero is high friction, while 1 is no friction.

What is Mercer's Theorem?

According to this theorem, if a function K(a,b) respects a few mathematical conditions called Mercer's Conditions, then there exists a function that maps a and b into another space (possibly with much higher dimensions) such that K(a,b) = Phi(a) Phi(b)

What are the two most common forms of Boosting?

AdaBoost Gradient Boosting

What does AdaBoost stand for?

Adaptive Boosting

What are the honorable mentions for clustering algorithms?

Agglomerative Clustering BIRCH Mean Shift Affinity Propagation Spectral Clustering

What do model-based learning algorithms search for?

An optimal value for the model parameters such that the model will generalize well to new instances.

If you want to build a deep seq-to-vector RNN, which RNN layers should have return_sequences=True?

All layers except the top layer, which should have return_sequences=False

What is hard margin classification in an SVM?

All of the instances must be "off the street".

If you want to use both SELU as the activation function, but you also want to be able to use a dropout technique, what sort of dropout layer should you use?

Alpha Dropout

What is batch learning?

Also known as offline learning--this type of machine learning is trained on a data set, and then put into implementation without any further learning.

What is a target key?

An array containing labels

What is a Data Key?

An arrayed part of a dataset containing one row per instance and one column per feature.

What is the tradeoff between variance error and bias error?

An increase in complexity within a model will reduce bias error, but it will increase variance error.

What is the point of using a replay buffer?

As an agent can get locked into a specific location for a while, it will optimize it's strategy for that location and forget strategies learned elsewhere. A replay buffer pulls from past experiences to keep these fresh.

How does DBSCAN define clusters?

As continuous regions of high density

How does LOF work for anomaly detection?

Compares density of instances around a given instance to the density around its neighbors. Anomalies are more isolated than it's k-nearest neighbors.

If it takes one hour to train a Decision Tree on a training set containing 1 million instances, roughly how much time will it take to train another Decision Tree on a training set containing 10 million instances?

Complexity is equal to O( n * m log(m)). So multiplying a training set by 10, the training set will be multipled by K = (n * 10m * log(10m)) / (n * m * log(m)). So 11.7 hours.

When you don't know what values to use for the hyperparameters in the grid search, what should you try?

Consecutive powers of ten.

What are Cell Bodies?

This contains the nucleus and most of the cells complex components in biological neurons

If your model performs great on the training data but generalizes poorly to new instances, what is happening?

It is overfitting the data

Which neural network architecture could you use to classify videos?

Could take one frame per second for the RNN input, or a Seq-to-vector model.

What is one effective way of setting similarity features in the data?

Create a landmark at every instance.

How can you evaluate data mismatch?

Create a separate dataset of the data that the system is trained on versus the data that it is deployed on (train-dev set), and evaluate the mismatch in performance.

What does tf.TensorArray allow you to do?

Create lists of tensors. All tensors they contain must have the same shape and data type

What are the typical features of a dataset?

DESCR, Data Key, and Target Key

What does tf.SparseTensor allow you to do?

Efficiently represent tensors containing mostly zeros.

When should you use masking layers and automatic mask propagation?

For more complex models such as when you need to mix Conv1D layers with recurrent layers.

In biology, neurons that trigger another neuron tend to do what?

Form a strong connection

In transfer learning, in order to give time for the upper layers to adjust, what do you usually do?

Freeze the lower layers (the ones that were transferred) to give the lower layers time to adjust.

What are the main differences between TensorFlow and NumPy?

Function names are not always the sam,e and they don't behave the exact same way. Also, NumPy arrays are mutable while TF Tensors are not.

By default, how many outputs does a recurrent layer output?

One. The last output in a series.

What is the most common stepfunction in ANNs?

Heaviside Step Function

What type of learning algorithm relies on a similarity measure to make predictions?

Instance based learning systems. These memorize the data by heart, then makes a prediction about the new instance by looking at its similarity to the memorized instances.

How can you create a tensor from a NumPy array?

Instantiate an array in an object such as "a" and then state tf.constant(a)

What does prefetching allow you to do?

It allows the CPU and GPU to work in parallel, where while one batch is training, another is being fetched.

What does the kernel trick accomplish?

It allows you to get the results of high degree polynomial features without having to add features.

What is Ordinal Encoding?

It assumes that the categories are numbered in order (best to worst, etc.)

What is the basic premise of how LSTM works?

It chooses what to forget and what to remember.

How does standarization work?

It subtracts the mean from all values, such that anything below the mean is a negative value, and anything above the mean is a positive value--then it divides by the standard deviation

What does keras.layers.Discretization do?

It takes continuous variables and chops it into discrete variables like low, middle, and high.

What does the n_neighbors parameter in LLE do?

It tells you how many neighbors to find similarity amongst

Can you name four of the main challenges in ML?

Lack of Data Poor Data Quality Nonrepresnetative Data Uninformative Features Excessively Simple Models Excessively Complex Models

Why won't ANN's die out this time in history?

Large Amounts of Data Better Compute Power Improved Algorithms Diminished Fear of Local Optima Problems Lots of Funding and Progress

What is the typical form of normalization in RNNs and how does it work?

Layer Normalization. It works by normalizing across features, rather than across the batch.

What is manifold learning?

Learn more about

What is one of the most important parameters for online learning?

Learning Rate

What does Lasso in lasso regression stand for?

Least Absolute Shrinkage and Selection Operator Regression

How does Gradient Boosting work?

Like AdaBoost, Gradient Boosting sequentially adds predictors to an ensemble, each one correcting its predecessor, but tries instead to fit the new predictor to the residual errors made by the previous predictor.

If the model is performing well on the training set but not on the train-dev set, what is occurring?

Likely overfitting the training set

What does LLE stand for?

Locally Linear Embedding

What are examples of algorithms that can only handle two classes?

Logistic Regression and Support Vector Machines

What activation function is typically used by MLPs?

Logistic Sigmoid Functions

Layers close to the input layer are called what?

Lower layers

How would you define Machine Learning?

ML is about building systems that can learn from data. Learning means getting better at some task, given some performance measure.

Can you name four types of problems where Machine Learning shines?

ML is perfect for complex problems for which we have no algorithmic solution To replace long lists of hand-tuned rules To build systems that adapt to fluctuating environments To help humans learn

If performing transfer learning, what will you want to do with the original model?

Make a clone--or else it will change the original model. model_a_clone = keras.models.clone_model(model_a)

What are the benefits of splitting a large dataset into multiple files?

Makes it easier to shuffle, and allows you to handle large datasets that can't fit in memory. Also allows you to download from multiple servers simultaneously since the file can be spread amongst several servers.

What does mAP stand for?

Mean Average Precision

What are the best ways to perform scaling?

Min-Max Scaling (Normalization) Standarization

Is it a good idea to stop Mini-batch Gradient Descent immediately when the validation error goes up?

Neither SGD or MGD are guarenteed to make progress at every epoch, so no. However, you can checkpoint, then stop when it hasn't improved, and us the model that performed the best.

Are decision trees computationally complex?

No

If an auto-encoder perfectly reconstructs the inputs, it is necessarily a good auto-encoder?

Not necessarily. It could be overcomplete, and therefore be copying the inputs straight over to the outputs.

Can the optimal policy change if you modify the discount factor?

Of course. If you value far down the line, you are likely to act differently in the immediate term.

What are generally good defaults for an MLP?

On the left is each hyperparameter, and the right is the default value for it. Kernel Initializer: He Initialization Activation Function: ELU Normalization: None if shallow; BatchNormalization if Deep Regularization: Early Stopping Optimizer: Momentum Optimization (or RMSProp or Nadam) Learning Rate Schedule: 1Cycle

How do you get around the destructive elements of using Convolutional Layers in Neural Networks?

One way of getting around this is to take a pre-trained CNN and turn it into a FCN. The CNN applies a stride of 32 to the input image, and then they add a single upsampling layer that multiplies the resolution by 32 to account for the stride. So the stride defines how much the input will be stretched, not the size of the filter steps. You can initialize it to perform something close to linear interpolation with Conv2DTranspose.

What is the most popular machine learning algorithm?

PCA

What are common unsupervised techniques for anomaly detection on the ML side?

PCA Fast MCD (Minimum Covariance Determinant) Isolation Forest LOF One-Class SVM

Can PCA be used to reduce the dimensionality of a highly nonlinear dataset?

PCA can be used to reduce the dimensionality of most non-linear datasets. However, if there is no useless information, dimensionality reduction will cause information loss that harms the model.

Why might manifold learning be superior to PCA in some respects?

PCA is going to project the data onto linear surfaces--but the shape of the data might not be planar. It might be the case that the data has a bizarre shape, like the shape of a swiss roll.

What is a common form of regularization in Decision Trees that isn't represented in the hyperparameters?

Prune any nodes that don't stand up to a chi squared test where the p-values are not found to be statistically significant.

How would you define Reinforcement Learning?

RL is an area of ML where agents take actions in an environment to maximize rewards.

What are afterthought Dimensionality Reduction techniques?

Random Projections Multidimensional Scaling (MDA) Isomap t-Distributed Stochastic Neighbor Embedding (t-SNE) Linear Discriminant Analysis (LDA)

What activation functions would you use He initalization for?

ReLU and variants

What is the main (default) activation function used by DNN layers?

ReLU. This is because it is the fastest.

True Positive Rate is another name for what?

Recall

If want to lever our model, we should focus on what?

Recall. If we want to derisk the model, focus on precision.

What does the ROC curve stand for?

Receiver Operating Characteristic

What does ReLU stand for?

Rectified Linear Units

Suppose you are using Ridge Regression and you notice that the training error and the validation error are almost equal and fairly high. Should you increase the regularization hyperparameter (alpha) or reduce it?

Reduce alpha

If an MLP overfits the training data, how could you tweak these hyperparameters to try to solve the problem?

Reduce the number of hidden layers, and the number of neurons per hidden layer.

What is it called when you constrain the model to make it simpler and reduce the risk of overfitting?

Regularization

What is the process of avoiding overfitting called?

Regularization

What type of ML algorithm would you use to allow a robot to walk in various unknown terrains?

Reinforcement Learning

What algorithm does AdaBoost utilize?

SAMME or SAMME.R

Which linear regression training algorithm can you use if you have a training set with millions of features?

SGD, GD, MGD, or BGD. The "Normal Equation" or SVD approach won't work because of computational complexity grows quickly (more than quadratically) with the number of features.

What are examples of multiclass algorithms?

SGD, Random Forest, Naive Bayes

How can we turn SGD into a Stochastic SVM?

SGDClassifier(loss="hinge",alpha=1/(m*C))

What are the pros and cons of using a stateful RNN versus a stateless RNN?

Stateless RNNs can only capture patterns whose length is less than or equal to the size of the windows the RNN is trained on. Stateful RNNs can capture longer-term patterns. Statefull RNNs are harder to implement and don't necessarily perform better because batches are not independent and identically distributed (IID) and GD is not fond of non-IID.

How does Mini-Batch Gradient Descent differ from other types of GD?

Stochastic trains on each instance, batch trains on the entire set, but Mini-Batch trains on random subsets of the dataset.

How do you measure the performance of an RL agent?

Sum up the rewards it gets. You can also run multiple times and look at the average rewards it gets per round.

What is it called when you use a smaller stride, and then to use a skip connection to input the output of a lower layer into one of the upper layers to increase the resolution.

Super Resolution

What is the algorithm that Decision Trees use?

The CART Algorithm. This allows it to split trees into binary decisions (although it can have multiple children using something like ID3).

What is the most important layer in the Transformer architecture, and what is it's purpose?

The Multi-Head Attention Layer. It allows the model to identify which words are most aligned with each other, and then improve each word's representation using these contextual clues.

What is the simplest form of ANN?

The Perceptron

How do you read the ROC Curve?

The ROC Curve goes from bottom left corner to the top right corner, and the area under the curve tells you how good your model actually is.

The size of the matrix of inputs at each layer in a CNN is called what?

The Receptive Field

What will you want to perform before using a ridge regression?

The StandardScaler. If polynomial, make sure to expand the degree's before applying the scaler.

The distance between each step in a CNN is called what?

The Stride

How does SVM regression work?

The complete opposite of SVM classification. Rather than trying to keep things off the street, it attempts to keep as many instances on the street as possible.

What is the approximate depth of a Decision Tree trained (without restrictions) on a training set with one million instances?

The depth of a binary tree containing m leaves is equal to log base 2 of m squared leaves. So when m = 1,000,000 this should be about 20 layers.

What is the easiest way to create a custom layer?

The easiest way to create a layer type is to create a function and then wrap it in a keras.layers.Lambda ex: keras.layers.Lambda(lambda x: tf.exp(x))

Describe two techniques to select the right number of clusters when using K-Means.

The elbow rule plots the inertia (the mean squared distance from each instance to the nearest centroid) as a function of the number of clusters, and finds the point in the curve where the inertia stops dropping fast ("the elbow"). Another approach plots the silhouette score (the mean silhouette coefficient over all instances) as a function of the number of clusters.

How do filters in a CNN work?

The filters create an arbitary shape that it multiples each of the inputs by either 1 or 0, and by changing these filters, the image can focus on different features in the image (such as learning edge detection)

If we are not using a CNN on images, what do we want to do to the pixels in a basic Fully Connected Neural Network?

The first layer should be a keras.layers.Flatten(input_shape=[28,28]) to convert the 28 x 28 array to a single array.

The error rate on new cases is called what?

The generalization error or out-of-sample error

What does precision measure?

The number of times the model predicted that something belonged to a certain class, that it actually did.

In an MLP, what is the only layer that does not have a bias neuron?

The output layer

What is standard deviation?

The square root of the variance, which is the average of the distance between all the predictions from the mean.

What is the boundary in SVM referred to as?

The street

How long must the input length of an RNN be?

There is no defined length. The input can be as long as you wish at every prediction.

What are Dendrites?

These are the branching extensions that communicate to the cell body in biological neurons

What symbol is used to represent the constant in a regression model?

Theta sub zero

What does it mean to pretrain on auxilary data?

This involves training a network on another task which will help it to do the task you truly want to do. For instance, if you wanted it to recognize specific faces, but you only had a few photos, it might make sense to make a neural network that recognizes whether two faces are the same face, as this will give it a good grounding in recognizing facial features. You can then train it for your desired task.

How does a Voting Classifier work?

This is a hard coded solution that predicts the class that is most often predicted from the classifiers decided.

What kind of models is Gradient Clipping usually used in?

This is mostly used in recurrent neural networks as Batch Normalization is usually enough.

What is the TFRecord Format?

This is the preferred format for storing large amounts of data and reading it efficiently.

What is a Dying ReLU?

This is where some neurons stop outputting anything other than 0.

What is semi-supervised learning?

This is where the system traings on partially labeled data. Deep Belief Networks (DBNs) are examples.

What is the fundamental idea behind Support Vector Machines (SVM)?

To fit the widest possible "street" between the classes. In other words, the goal is to have the largest possible margin between the decision boundary that separates the two classes and the training instances. With soft margin, it looks for a compromise between separating all instances and having widest possible street.

What are the samples that a system uses to learn called?

Training Set

What is a simple way of vectorizing text?

Transform the word into the log of the words frequency.

Suppose you perform PCA on a 1,000-dimensional dataset, setting the explained variance ratio to 95%. How many dimensions will the resulting dataset have?

Trick question. It depends on the dataset.

Suppose you want to classify pictures as outdoor/indoor and daytime/nighttime. Should you implement two Logistic Regression classifier or one Softmax Regression classifier?

Two logistic regression since these are not exclusive classes.

What is a Dense Layer?

When all neurons in a layer are connected to all neurons from the previous layer.

When does k-means perform poorly?

When the data is non-spherical in shape.

If your pipeline is the bottleneck, what can you do to fix it?

You can fix by making sure it reads and preprocesses the data in multiple threads in parallel, and prefetches a few batches.

By turning bootstrap_features=True in a bagging classifier, what can be accomplished?

You can have it sample a random set of features, rather than a random set of instances.

How can you evaluate the performance of an autoencoder?

You can measure the reconstruction loss (e.g, compute the MSE, or the mean square of the outputs minus the inputs)

Suppose you want to train a classifier, and you have plenty of unlabled training data but only a few thousand labeled instances. How can autoencoders help? How would you proceed?

You can train on labeled and unlabeled data then reuse the lower layers, then train using the labeled data.

What is Semantic Segmentation?

A form of classification that seeks to classify each pixel by what class / object it belongs to.

What is a kernel function?

A function capable of computing the dot product based only on the original vectors a and b, without having to compute (or even know about) the transformation.

What does GMM stand for?

Gaussian Mixture Modeling

What is the algorithmic process in neural nets (very similair to gradient descent) that corrects weights and biases?

Backpropagation

Name a few common techniques you can use to encode text.

Bag-of-words, n-grams (taking a sequence of words), word embeddings.

Instead of running fit and then transform, what should you run?

.fit_transform()

Some sklearn packages can predict with what method?

.predict()

What is the decision threshold in an SVC?

0

What is the typical architecture of Regression MLPs?

1. An input neuron for every feature. 2. 1 - 5 Hidden Layers with ReLU or SELU Activation and 10 - 100 Neurons Per Layer 3. One Output Neuron per Prediction Dimension with No Activation Function or ReLU/Softplus (if positive outputs) or logistic/tanh (if bounded outputs) Loss Function: RMSE, MAE / Huber Loss (if outliers)

What are three challenges in Decision Trees?

1. Angle of the data thresholds. 2. Sensitivity to small variations in the data. 3. That unless you set the random state, you will get a different prediction every time. This is where random forests can be useful for averaging all of these predictions.

What are the advantages of a CNN over a fully connected DNN for image classification?

1. Fewer parameters make it faster to train, reduces the risk of overfitting, and requires less training. 2. When the kernel has learned a feature, it knows how to read it anywhere in the images 3. Has no prior knowledge of pixel organization. The layers combine, which works well with most natural images.

How do you use SELU for vanishing/exploding gradients?

1. Input features must be standardized (mean 0 and standard deviation 1) 2. Every hidden layer's weights must be initialized with LeCun normali initialization kernel_initializer="lecun_normal" 3. Network must be sequential. There can be no skipping connections such as Wide & Deep nets. 4. All layers must be dense.

What are the two conflicting goals in SVM?

1. Making the weights as large as possible to increase the margin 2. Making the weight as small as possible to reduce margin violations. This tradeoff is moderated by C.

What are the main difficulties when training GANs?

1. Mode collapse, where the generator produces outputs with very little diversity. 2. Training instability. 3. Sensitivity to hyperparameters.

Can you think of three possible application of RL not mentioned in the book?

1. Music personalization 2. Marketing 3. Product Delivery

If your GPU runs out of memory while training a CNN, what are the five things you could try to solve the problem?

1. Reduce the mini-batch size. 2. Reduce dimensionality using a larger stride in one or more layers. 3. Remove one or more layers. 4. Use 16-bit floats instead of 32-bit floats. 5. Distribute the CNN across multiple devices.

Can you name six other data structures available in TensorFlow, beyond regular tensors?

1. Sparse Tensors 2. Tensor Arrays 3. Ragged Tensors 4. Queues 5. String Tensors 6. Sets Last two represented as regular tensors.

What are the two main problems faced by RNNs?

1. Unstable gradients, which can be relieved using things like recurrent dropout and current layer normalization. 2. A limited short term memory which can be extended using LSTM and GRU cells.

What are the regularization hyperparameters in decision trees?

1. max_depth 2. min_samples_split 3. min_samples_leaf 4. min_weight_fraction_lead 5. max_lead_nodes 6. max_features

When were the first ANNs created?

1943

If you have an input layer of 20 neurons, a hidden layer with 10, and an output of 1, and all the layers are fully connected, how many parameters are in the model?

20 x 10 x 1 = 200

What is a good breakdown of training data and test data?

80% Training Data and 20% Test Data

What is a Stateless RNN?

A Stateless RNN starts fresh at each training iteration.

Each parameter is known as what?

A degree-of-freedom

What does projection refer to?

A dimensionality reduction technique, where you squash higher dimensional data into lower dimensions, such as taking 3D data and making it 2D.

What is the discount factor?

A discount factor is a measure of how much the model values victory in the future versus victory now.

What is the main technical difficulty of semantic segmentation?

A lot of the spatial information gets lost in a CNN as the signal flows through each layer.

Why would you want to add a max pooling layer rather than a convolutional layer with the same stride?

A max pooling layer has no parameters, whereas a convolutional layer has quire a few.

What is a hyperparameter?

A parameter of the learning algorithm rather than the model.

What are the undercomplete and overcomplete autoencoders?

An undercomplete autoencoder is one whose codings layer is smaller than the input and output layers. If larger then it is overcomplete.

What does ANN stand for?

Artificial Neural Network

How do RNNs work at a basic level?

At each time step (also called a frame) the recurrent neuron receives the inputs as well as it's own output from the previous time step. Each neuron contains two sets of weights--one for the inputs and one for the outputs of the previous time step.

How are Batch Gradient Descent and Stochastic Gradient Descent different?

BGD is trained on the entire dataset (computing the gradient at every step), whereas SGD computes the gradient for each instance.

How does BPTT work?

BPTT first propogates forward, which is then evaluated using a cost function. Then the gradients of that cost function are propagated backward through the unrolled network, and model parameters are updated using the gradients computed during BPTT. The gradients flow backwards through all the outputs used by the cost function.

What is the difference between backpropagation and reverse-mode autodiff?

Back propagation refers to the whole process of computing gradients and applying them, where autodiff is just an effective method of computing gradients.

ReLU suffers from something known as ______________________.

Dying ReLUs

What the broadest overview of a CNN architecture?

Each convolutional layer is followed by a pooling layer. We generally double the number of filters at each pooling layer. After several convolutional layers, is then fed to a flattening layer and then we go about a typical fully connected architecture from this point forward. It is common to use dropout layers in the dense part to regularize.

What is Early Stopping?

Early stopping will stop training a model if it sees that it is no longer making progress to avoid overfitting. If you set best_weights=True, it will only keep the iteration with the best performance.

Generally which is preferred--Lasso regression or Elastic Net?

Elastic Net, because Lasso can become too erradic when the number of features is greater than the number of training instances, or several features are strongly correlated.

What is IPCA used for?

Essentially batch dimensionality reduction, for datasets that won't fit in memory.

What does ELU stand for?

Exponential Linear Unit

How does exponential scheduling work for learning rate decay?

Exponential Scheduling It drops by a factor of 10 every s steps. def exponential_decay_fn(epoch): return 0.01*0.01**(epoch/20)

How do we calculate F Score?

F = 2 * (precision*recall)/(precision + recall)

What are Decision Trees well suited for?

Finding complex non-linear relationships

What does the window.batch function do?

Flattens the dataset into tensors with a dimension size equal to the window_length.

When you split the training set into small subsets, these are known as what?

Folds

What is one strategy for getting around the computational complexity of similarity features?

Gaussian RBF Kernel

What does Gaussian RBF stand for?

Gaussian Radial Basis Function

Should you preprocess before or during training?

Generally it is a better option to do it before to speed up training.

Why would you want to use Elastic Net instead of Lasso?

Generally less erratic than Lasso. If you want less erratic Lasso, use l1_ratio close to 1.

Does Dropout apply to predictions?

Generally no--unless you make it apply.

If you call repeat() on a shuffled dataset, it will automatically do what?

Generate a new order at every iteration.

What are ways you might want to clean up "poor data"?

Get rid of outliers and instances where a few features are missing.

How do we generally solve the Vanishing/Exploding Gradients Problem?

Glorot, He, or LeCun Initialization

Why would you want to use 1D convolutional layers in an RNN?

Good for parallelization. Also, since not recurrent, it suffers less from exploding or vanishing gradients.

What is Gradient Descent?

Gradient descent is a first-order iterative optimization algorithm. To find a local minimum of a cost function using gradient descent, one takes steps proportional to the negative of the gradient (partial derivative or tangent) of the function at the current point.

What does hard clustering vs soft clustering do in K-Means?

Hard clustering assigns each prediction to a specific cluster, whereas soft clustering calculates the distance from the center of a given cluster. Soft Clustering can also be used as a dimesionality reduction technique.

When generating new text, the closer the temperature is to zero in the RNN, the more it will favor what?

High probability characters

How does DBSCAN define a core?

If it exceeds min_samples within a specified distance.

Say you've trained an SVM classifier with an RBF kernel, but it seems to underfit the training set. Should you increase or decrease C?

If it underfits, there might be too much regularization. Decrease gamma, or C, or both.

Say you've trained an SVM classifier with an RBF kernel, but it seems to underfit the training set. Should you increase or decrease y (gamma)?

If it underfits, there might be too much regularization. Decrease gamma, or c, or both.

Why might you compress a record?

If it will need to be called over a network connection

What is the basic operation of a Heaviside Step Function?

If the weighted sum of the outputs is equal to the threshold then the output is zero, if it is less than the threshold then the output is -1, and if it is greater than the threshold then the output is 1. There is also a bias feature.

When might you use Mean Absolute Error (MAE) instad of Root Mean Squared Error (RMSE)?

If there are a lot of outliers.

How might you perform unsupervised pretraining?

If you don't have a lot of labeled data, you can build an autoencoder, then keep the encoder layers, and perform the same freeze on the lower layers that you would perform in a transfer learning task once you take the encoder and implement it as the lower layers of your new MLP.

What are the risks of an overcomplete autoencoder?

It may just copy the inputs to the outputs, without learning any useful features.

Can you think of a few applications of a Vector-to-Sequence RNN?

Image captioning, a music plaulist based on an embedding of the current artist, generating a melody based on a set of parameters, locating pedestrians in a picture.

How does cross entropy work?

Imagine there are three classes: A,B, and C. There is a 100% probability that the item is B, because that's what the instance is, but the algorithm predicts that the probability of it being B is 70%. The loss is 30%. Cross entropy attempts to minimize this error.

How do you change the name of a model?

In the Sequential function, put in name='Name of Model'

In lieu of splitting off a portion of the data for validation, what can you do in an MLP?

In the model.fit() pass validation_split=0.1, which will take the last ten percent of the training data, and will turn it into validation data.

What are the main challenges of Machine Learning?

Insufficient Quantity of Data Nonrepresentative Training Data Poor Quality Data Irrelevant Features Overfitting the Training Data Underfitting the Training Data

What are the main innovations in ResNet?

Introduction of skip connections, which make it possible to go well beyond 100 layers.

What are some of the approaches to manifold learning?

Isomap Locally Linear Embedding Laplacian Eigenmaps Semidefinite Embedding

In a ridge ression, what happens the more that you elevate alpha?

It brings the weights of the variables closer and closer to zero.

What are the benefits of dropout layers?

It can actually boost resiliency within the network, and dramatically increase the predictive capabilities of the model because it learns to generalize better when it has to use other neurons.

What is one of the problems with Similarity Features?

It can be very computationally expensive.

How can feature importance in Random Forests be useful?

It can help you perform feature selection.

Why do Random Forests Make it easier to predict which variables are most important?

It can look at how much the tree nodes that use that feature reduce impurity.

Why does AdaBoost not scale well?

It can't be parallelized, meaning that it can only be trained after the previous predictor has been trained.

How does an SVM make it's decision about a boundary line?

It chooses the line that provides the max distance between the support vectors (the two closest points of opposing classes).

Why would you tie weights in a stacked autoencoder?

It cuts the parameters in half, and helps it to converge faster.

Does MC dropout slow down training and inference?

It does slow down training, but it is not the dropout that slow's down inference--rather, it is the fact that you must make the prediction multiple times (usually at least 10 times).

How did GoogLeNet efficiently process images?

It had multiple convolutional layers as inputs for the model that were then fed to a concatenation layer.

What is early stopping?

It is a form of regularization that works by stopping training when the loss ceases to be significantly reduced.

What is a confusion matrix?

It is a matrix that looks at the number of times that one class is confused for another class.

What is Elastic Net?

It is a mix of both ridge and lasso regression. It is equivalent to ridge when r=0 and lasso when r=1.

What is a generative model?

It is a model capable of randomly generating outputs that resemble the training instances.

What is a grid search / cv search?

It is a model slection tool that allows you to try different hyperparameters to see what works best.

What is Monte Carlo Dropout?

It is a prediction technique that involves stacking multiple predictions on the same instance by a Dropout Model, as the dropout will force different neurons to be activated at each iteration, leading to a broader array of predictions. This can also lead to capturing information like the standard deviation.

What is the silhouette coefficient?

It is a score between -1 and +1 indicating how good a fit, a given node is for the cluster that it has been assigned to. -1 being an indicator that it was wrongly assigned to a cluster, and +1 being an indicator that it was a very good fit for the cluster.

What is an attention mechanism and how does it help?

It is a technique initially used in Encoder-Decoder models to give the decoder more direct access to the input sequence, allowing it to deal with longer input sequences. The main benefit is that the model can successfully process longer input sequences, and the alighnment scores make the model easier to debug and if find what the model was focusing on when it made a mistake.

What is a decision function?

It is a threshold score that states that if the confidence level is above that threshold, it will assign it to the specific category.

What is soft margin classification in an SVM?

It is a version of SVM that allows instances on the street as a regularization technique to avoid overfitting.

What is an OOV Bucket?

It is an Out-of-Vocabulary bucket, where words that are not included in our dictionary are classified as being "out of the models vocabulary".

What is bagging?

It is an ensemble method, where you train the same algorythm on different subsets of the data, such that each one becomes it's own independent model.

What is a drawback of LLE?

It is computationally complex, and therefore, doesn't scale well to larger datasets.

What is a drawback of Mean Shift for clustering?

It is computationally expensive and doesn't scale well to large datasets.

How does Agglomerative Clustering works?

It is essentially Hierarchical Clustering where you continue to merge the nearest nodes into a super node.

Is scaling required for Gradient Descent?

It is highly advised.

Can you name the main innovations in AlexNet, comapred to LeNet-5?

It is much larger and deeper, and it stacks convolutional layers directly on top of each other instead of stacking a pooling layer on top of each convolutional layer.

What does the learning trajectory for SGD look like?

It is much more erradic since it is computing the gradient for each instance--causing the loss to jump all over the place in pursuit of the minimum.

What is the credit assignment problem? When does it occur?

It is the fact that when a RL agent recieves a reward, it has no direct way of knowing which of its previous actions contributed to this reward. It typically occurs when there is a large delay between an action and the resulting reward.

What is a downside of max pooling layers?

It is very destructive. When you only keep the max input from the matrix, you are losing a lot of the information carried by the other pixels.

What are ensemble learning methods, and why are they useful?

It is when you group predictions by different models in order to aggreagate the opinion from all the models. Much like how when you ask 1000 people a question, you will often get a reliable answer the same is true of stacking models.

What is an off-policy RL algorithm?

It learns the value of the optimal policy while the agent follows a different policy. Q-Learning is an example. On policy algorithms learn the value of the policy that the agent actually executes, including both exploration and exploitation.

How does the CART algorithm work?

It makes a decision threshold between two classes, by attempting to minimize the gini impurity. It chooses a feature (k) and sets a threshold (t).

What is the risk of an excessively undercomplete autoencoder?

It may fail to reconstruct the weights.

How does a lasso regression work?

It regularizes the the cost function using l1 norm of the weight vector instead of half the square of the l2 norm.

What is one of the drawbacks of PCA against IPCA?

It requires that the dataset be able to fit in memory

How does ridge regression work?

It seeks to make model weights as small as possible. This regularization is only applied during training.

With RNNs, how should you set return_sequences?

It should be set as true for every layer except for the last.

What is a good rule of thumb for the number of neurons at each layer of an MLP?

It should get less and less--though practice shows that technically speaking, all of the hidden layers could have the same number of neurons. Some practitioners say that the total model should have 2/3 the number of nodes in the input layer.

What is data shape?

It shows you the shape of the array in terms of dimensions. For instance, MNIST is 28 x 28, meaning that it is an array containing 28 arrays of 28 dimensions (as it is 28 rows or pixels that are 28 pixels each.

How does WaveNet work?

It stacks convolutional layers and doubles the dilation rate, such that at the first layer it sees two time steps, then four, then eight, and so on.

What is one of the main challenges of online learning?

It there is bad data in the system, its performance will gradually decline. It can be helpful to implement anomlay detection to avoid this.

What is One-Hot-Encoding?

It treats every category as it's own variable, and codes 0 if it is not present and 1 if it is.

What does SVM Classification try to accomplish?

It tries to make the street as wide as possible while avoiding margin violations.

How is SAMME.R different than SAMME?

It uses soft predictions rather than hard predictions. This generally performs better.

What is the main limitation for decision trees with respect to the angle of predictions?

It usually makes decisions along a parallel threshold, which means that when the decision should be made along a 45 degree angle, it is very difficult for the system to read. This can usually be fixed using PCA.

What are the characteristics of online learning with high learning rates?

It will adapt quickly to new data but will forget old data.

What are the characteristics of online learning with low learning rates?

It will adapt slowly, but will be less sensitive to noise in the data.

What will happen if you elevate the degree in polynomial regression too high?

It will overfit.

Why wouldn't you make all Keras models dynamic?

It will slow down training and inference, and you will not have the possibility to export the computation graph, which will limit your mode's portability.

In Gradient Descent. what happens if the variables are very different in scale?

It will take a long time to converge. This can be solved by using the StandardScaler().

What happens if the rate of gradient decent is too slow?

It will take too long to converge to the local minimum

How does l1 and l2 regularization work in neural networks?

It works by adding the l1 or l2 regularization term to the cost function.

How does LLE work?

It works by measuring how each training instance linearly relates to its closest neighbor and then looks for a low dimensional representation of the training where these local relationships are best preserved.

Why not use your own protobuf definition?

Its more complicated.

What is the simplest clustering algorithm?

K-Means

What are the most important clustering algorithms?

K-Means DBSCAN Hierarchical Clustering Analysis

What are the most popular MLP Packages?

Keras, TensorFlow (Google), and PyTorch (Facebook)

How do GMMs work?

Learn more about

What are white box models?

Models that make decisions based on things we understand--such as Decision Trees.

What are black box models?

Models where it is very difficult to explain why the model made the decision, even if the decision was correct. Neural Networks and Random Forest are examples of this.

What does MC Dropout stand for?

Monte Carlo Dropout

What ended up qualming the fears scientists had about the problems related to Perceptrons?

Multi-Layer Perceptrons. They dismissed Perceptrons not realizing that the MLP would fix almost all of those fears.

What sort of system might you use to remove noise from images?

Multi-Output Classification

What is a fully convolutional network?

Nets made of only convolutional layers. Useful for object detection and semantic segmentation.

What was the original activation function / logic of ANNs?

Neuron C might be activated if Neuron A was activated, or if Neurons A and B were activated.

Do Decision Trees require feature scaling?

No

Do we use a Flatten function in a CNN?

No

Does the Normal Equation regression operate well with a large n?

No

Does the normal equation offer out of core support?

No

Can an SVM classifier output a probability when it classifies an instance?

No directly. But by setting probabilities=True, it will calibrate the probabilities using Logistic Regression on the SVM's scores, which allows you to use the predict_proba() and predict_log_proba() methods to the SVM.

Is ReLU a good activation function for RNNs?

No, because the increasing weights at every step will lead to very unstable gradients.

Does dropout slow down inference (.i.e, making predictions on new instances)?

No, since it is only turned on during training.

Is TF a drop-in replacement for NumPy?

No.

Can a stateful RNN have overlapping windows?

No. Therefore the shift should be equal to n_steps. (shift=n_steps)

Are hyperparameters altered by the ML algorithm?

No. They stay constant throughout the entire training process.

What activation function is used for the output neuron of a regression DNN?

None

What activation functions would you use Glorot initialization for?

None, tanh, logistic, softmax

What is replacement in bagging and pasting?

Replacement refers to a models ability to train on the same elements. If a model is trained with replacement, the same element may be used more than once (like counting fish caught in a pond, but catching the same fish multiple times because you catch and release), and if trained without replacement (WOR), it cannot use the same element more than once.

What are the most important Anomaly/Novelty Detection algorithms?

One-Class SVM Isolation Forest

Is it possible to use Batch Normalization in RNNs?

Only between recurrent layers.

Name a few common techniques you can use to encode categorical features.

Ordinal (Worst to best), or One-Hot Encoding.

What are the different ways that you can encode categorical variables?

Ordinal Encoding One-Hot-Encoding

When samples are not sampled at all during training of an ensemble learning method, what are these samples called?

Out of Bag (OOB)

How can you deal with variable-length input sequences?

Pad sequences so that all sequences in a batch have the same length, and use masking to ensure the RNN ignores the padding token.

What do we call it when we add zeros around an image in order to allow the CNN to scan the entire image?

Padding

If certain classes are overrepresented in the MLP data, what can you do?

Pass the model.fit() a class_weight setting which allows you to change how it focused on specific classes that are over or under represented.

What is pasting?

Pasting is just like bagging, with the exception that it is used without replacement.

In order for a linear regression to read polynomial features, what must we do?

Perform preprocessing on x by increasing the degree on each feature. The degrees of each variable will create features that are the relation between each feature.

How do we calculate the precision from the confusion matrix?

Precision = (True Positives) / (True Positive + False Positives)

Can you think of a few applications for a sequence-to-sequence RNN?

Predicting the weather, or any other time series, machine translation, video captioning, speech to text, music generation, identifying the chords of a song.

What does PCA stand for?

Principal Component Analysis

What are the most important Visualization and Anomaly Detection Algorithms?

Principal Component Analysis (PCA) Kernel PCA Locally Linear Embedding (LLE) t-Distributed Stochastic Neighbor Embedding (t-SNE)

Why would you want to use the Data API?

Proprocessing large datasets can be a complex engineering challenge. The Data API makes it fairly simple.

How can you alleviated the credit assignment problem?

Provide the agent with shorter term rewards when possible.

What are typically the best errors for Regression MLPs?

RMSE MAE Huber Loss

How can you convert a dense layer into a convolutional layer?

Replace the lowest dense layer with a convolutional layer with a kernel size equal to the layer's input size, with one fulter per neuron in the dense layer, and using "valid" padding. The same conversion process should take place for every other layer.

What does SVM stand for?

Support Vector Machine

In which cases would you want to use each of the following activation functions: SELU, leaky ReLU, tanh, logistic, and softmax?

SELU is a good default. Leaky ReLU good for quick training. ReLU is simple, and therefore often used. Also useful for outputting precisely zero. Hyperbolic Tangent can be useful in output layer if you need to output a number between -1 and 1. Logistic activation useful in the output layer to estimate probability. Softmax useful in output layer for mutually exclusive class probabilities.

What type of models does early stopping not work as well on?

SGD and MBGD as the curve isn't smooth, and it's more difficult to determine when you've reached the minimum.

How does one implement the Gaussian RBF Kernel?

SVC(kernel="rbf",gamma=5,C=0.001)

What class performs SVM regression?

SVR

What is generally a good activation function for an RNN?

Saturating activation functions such as Hyperbolic Tangent

What are the types of clustering in image segmentation?

Semantic Segmentation Instance Segmentation CNN Color Segmentation

How do you implement a dynamic Keras model?

Set dynamic=True when creating it. Alternatively, set run_eagerly=True when calling the model's compile method().

How can we choose to make the model ignore all padding when that padding is set to zero?

Set mask_zero=True in the embedding layer

To return sequence to sequence, what must we do?

Set return_sequences=True at every layer, and then keras.layers.TimeDistributed(keras.layers.Dense(10))

What is one way of regularizing a decision tree?

Setting the max depth.

Do you get the same result with tf.range(10) and tf.constant(np.arange(10))?

Similar--but the former returns 32-bit response, while the latter returns 64-bit.

What are similarity features?

Similarity features are features that we add to the dataset that specify how similar the instance is to that ladnmark.

How do you tie weights in a stacked autoencoder?

Simply make the decoder weights equal to the transpose of the encoder weights to reduce the number of parameters in the model by half.

How can you make SGD more likely to converge to the global minima?

Slow the learning rate over time.

What are synaptic terminals?

Small structuress on the tip of a branch in biological neurons

The bagging classifier automatically uses what kind of voting system?

Soft

When you are using a voting classifier, it is often better to specify what kind of voting method?

Soft. When you return hard voting, it will only choose the class with the highest probability. When you choose soft, it will return an array of all the probabilities.

What is a transformer?

Some estimators (such as an imputer) can also transform a dataset these are called transformers. This is performed by transform() method.

How do you create a polynomial kernel in an SVC?

Specify the kernel parameter as 'poly'. SVC(kernel="poly",degree=3,coef0=1,C=5)

What are the supervision categories?

Supervised Unsupervised Semi-Supervised Reinforcement Learning

How would you describe TensorFlow in short sentence? What are it's main features?

TensorFlow is an open-source library for numerical computation, particularly well suited and fine-tuned for large-scale Machine Learning. Offers support for distributed computing, graph analysis, and deep learning libraries.

Why would you go through the hassle of converting all your data to the Example protobuf format?

TensorFlow provides some convenient operations to parse it.

What is it called when we vectorize text using the term frequency log?

Term-Frequency x Inverse-Document Frequency

When using the predict method in logistic regression, what will be returned?

The class that it believes the item belongs to. The decision boundary is 50% probability.

What is stacking?

The general idea of stacking, is that you train an ensemble of learning methods, but rather than using that ensemble to make a prediction, you train a model on the aggregated outputs of all the models to create a meta model. Imagine feeding a reinforcement neural net (QNN) the outputs of several different models, and asking it to make it's decision based on what it heard from all of those models.

The amount of regularization is controlled by what?

The hyperparameters

What are the main innovations in SENet?

The idea of using an SE block (a two-layer dense network) after every inception module in an inception network or residual unit in a ResNet to recalibrate the relative importance of feature maps.

What are the main innovations in GoogLeNet?

The introduction of inception modules which make it possible to have much deeper net than previous CNN architectures, with fewer parameters.

What might we use to keep instances that meet a particular criteria in a TF dataset?

The lambda function of tf.Transform

When chaining transformations, how do we apply a particular preprocessing transformation to a particular instance?

The lambda function of tf.Transform

Suppose you use Batch Gradient Descent and you plot the validation error at every epoch. If you notice that the validation error consistently goes up, what is likely going on? How can you fix this?

The learning rate might be too high and the algorithm is diverging. If the training error also goes up, then this is clearly the problem. Solve by reducing the learning rate. If training error is not going up, then you model is overfitting, and you should stop training.

What basics do you enter in when you are compiling an MLP?

The loss function, the optimizer, and any other potential metrics you want it measure.

What is max_features?

The maximum number of features that are evaluated for splitting at each node in a decision tree

What is max_leaf_nodes?

The maximum number of leaf nodes in a decision tree?

What is min_samples_split?

The minimum number of samples a decision tree node must have before splitting

What tool can you use to implement beam search?

The parameter k called beam width.

How do you calculate mAP?

This new strategy involves computing the max precision at a recall of 0 and then at 10, then 20, and so on up, and then calculating the mean of these maximum precisions.. When there are more than two classes, we can compute the Average Precision (AP) for each class.

What is the Vanishing/Exploding Gradients Problem?

This occurs when deep neural networks suffer from unstable gradients--with different layers learning at widely different speeds, leading to either very large gradients or very small gradients.

What happens in data mismatch?

This occurs when the data that the model was trained on is not the same as the data that it is deployed on.

What is the ExtraTreesClassifier?

This operates by setting random thresholds in each node of a decision tree, rather than the ideal threshold, which can actually lead to lower variance (and unsurprisingly, shorter training time).

What is 1Cycle Scheduling?

This reduces the learning rate, then elevates it, then reduces it again.

How do Adam and Nadam optimizers work?

This takes the ideas of momentum optimization and RMSProp and puts them together.

How does online working work?

This type of system works incrementally by feeding data instances sequentially in mini-batches.

What is piecewise constant scheduling?

This uses one learning rate for several epochs, then another for several epochs.

How are image tensors typically represented?

Three Dimensional (Height, Width, Color Channels) Occasionally four dimensional (Batch Size, Height, Width, Color Channels)

How many dimensions must the inputs of an RNN layer have? What does each dimension represent? What about its outputs?

Three. First is the size of the batch. Second is the size of the number of time steps. Third holds the inputs at each time step. (For instance if you want to process a batch containg 5 time series of 10 time steps each with 2 values per time step, the shape will be [5, 10, 2]). The same is true of its outputs, but the last dimension is equal to the number of neurons.

ANNs were based on what?

Threshold Logic Unit (TLU) sometimes called a Linear Threshold Unit (LTU). The TLU computes a weighted sum of its inputs, then applies a step function to output results.

What does TPR stand for?

True Positive Rate

What are the main difficulties when training RNNs and how can you handle them?

Unstable gradients and limited short-term memory. To solve the gradients, use a slower learning rate and a saturating activation function like hyperbolic tangent with gradient clipping. To solve the memory, use LSTM or GRU Layers.

Because CNN's require a large amount of RAM, what can be a good idea to relieve this slightly?

Use 16-bit floats instead of 32-bit.

How do you solve Dying ReLU?

Use ReLU Variants such as Leaky Relu. You can also use randomized leakly ReLU, parametric leaky ReLU, and Exponential Linear Unit (ELU). Leaky ReLU always outperforms ReLU, and ELU always outperform all ReLUs.

How do you decide on a decision threshold?

Use cross_val_predict() to plot precision and accuracy. You will then set the threshold right before their is a drop in one value or the other, depending on what you are focused on.

What is Performance Scheduling?

Use the ReduceLROnPlateau callback This will (as you might guess) reduce the learning rate when it reaches a plateau.

What do you want to do with the data for MLP regression models (especially if the values are very different)

Use the StandardScaler. scaler = StandardScaler() X_train = scaler.fit_transform(X_train) X_valid = scaler.transform(X_valid) X_test = scaler.transform(X_test)

How can we use PCA for anomaly detection?

Use the reconstruction error. Typically the reconstruction error in anomalies is much higher in anomalies.

What is beam search and why would you use it?

Used to improve the performance of a trained Encoder-Decoder model. The algorithm keeps track of a short list of the k most promising output sentences, and at each decoder step it tries to extend them by one word. Allows the machine to explore several promising sentences simultaneously. Lends itself well to parallelization.

When would you need to use sampled softmax?

Used when training a classification model when there are many classes (e.g thousands). It computes the cross entropy loss based on the logit predicted by the model for the correct class, and the predicted logits for a sample of incorrect words. This speeds up training considerably.

When would you need to create a dynamic Keras model?

Useful for debugging, as it will compile any custom component to a TF Function, and you can use any Python debugger to debug your code. Can also be useful if you want to include arbitrary Python code.

Because DBSCAN has no predict method, how do we make predictions?

Using K-Nearest Neighbors knn.fit(dbscan.components_,dbscan.labels_[dbscan.core_sample_indices_])

During training, how can you tell that your input pipeline is the bottleneck?

Using TensorBoard you can see if the GPU is not fully utilized, which is likely to bottleneck.

How does Min-Max Scaling work?

Usually it takes the values and rescales the values between 0 and 1. It doesn't have to be 0 to 1 though.

How can you deal with variable-length output sequences?

Usually, if the length is known, you will configure the loss function to ignore tokens that come after the end of the sentence. If it is not known, you train the model so that it outputs an end-of-sequence token at the end of each sentence.

How might you increase the memory of an LSTM?

Utilize a 1D Convolutional Layer with a stride greater than one to break up the data. No pooling is required.

RNNs, especially Simple RNNs, are especially subject to what?

Vanishing and Exploding Gradients

PCA seeks to choose the plane that preserves the greatest what?

Variance in the data

Are SVM's sensitive to scaling?

Very. Make sure that you use the StandardScaler before training.

What are the most popular ensemble learning methods?

Voting Classifiers Bagging and Pasting Random Patches and Random Subspaces Random Forests Boosting Stacking

How do we make a classification based MLP spit out class names rather than the number of the class?

We create a list of class names containing strings for every class, and then when we call the predict method with the number of the prediction as the list call so that it prints out the corresponding item in the list.

How do we usually measure how well an RNN is performing?

We perform naive forecasting to make it predict the last value in a sequence.

How do we add polynomial features to an SVM?

We put the PolynomialFeatures method from preprocessing in sklearn to the pipeline.

When does overfitting occur?

When the model is too complex relative to noise and starts to fit the noise in the data

When does underfitting occur?

When the model is too simple to learn the underlying structure

Why might we want to minimize the weights in an SVM?

When we consider the slope of the decision function, and understand that every variable w1x1 has to be equal to +1 or -1 on the y axis, decreasing the slope / weight of the variable means that the distance of the variable from the decision boundary found in the middle of the chart will be further away. For instance, it would be like asking whether you climbed one mile in elevation over a mile long distance, or over two. The slope decreases the lower the weight of the variable is. So if we are trying to get a large margin between the groups, we want to minimize w as much as we can. Naturally, this means that the more variables are contributing to the overall weight, the less that each weight will contribute, and naturally, the greater the margin will be.

Can decision trees be used for classification and regression?

Yes

Do decision trees use greedy algorithms?

Yes

Does batch normalization help with vanishing / exploding gradients?

Yes

Can you specify a kernel in an SVR?

Yes. For instance, one might choose to use a polynomial kernel

Does dropout slow down training?

Yes. In general by a factor of two.

Does the Normal Equation regression operate well with a large m?

Yes. It is quite fast.

When should you create a custom layer versus a custom model?

You should distinguish internal components of your model (layers, using keras.layers.Layer) from the model itself (keras.models.Model).

What are some use cases that require writing your own custom training loop?

You should only do it if you really need to, as it is quite advanced.

How do you concatenate layer outputs?

You specify keras.layers.Concatenate() and then follow in parentheses with the object instantiated layers that you want in a list. Ex. keras.layers.Concatenate()([layer1,layer2])

When using TFRecords, when would you want to activate compression?

You will want to use compression if the TFR files need to be downloaded by the training script, as compression makes them smaller and reduces download time.

If you are looking to use a GPU for machine learning, what package will you want to download?

tensorflow-gpu

How can you get a description of the numberical values of a dataset?

describe() method, followed by whatever aspect of the data you want to return, such as mean() or max().

How might we code to create word embeddings?

embedding = keras.layers.Embedding(input_dim=len(vocab)+num_oov_buckets,output_dim=embedding_dim) embedding(cat_indices) regular_inputs = keras.layers.input([8]) categories = keras.layers.Input(shape=[],dtype=tf.string) cat_indices = keras.layers.Lambda(lambda cats: table.loolip(cats))(categories) cat_embed = keras.layers.Embedding(input_dim=6,output_dim=2)(cat_indices) encoded_inputs = keras.layers.concatenate([regular_inputs,cat_embed]) outputs = keras.layers.Dense(1)(encoded_inputs) model = keras.models.Model(inputs=[regular_inputs,categories],outputs=[outputs])

How would you load multiple TFRecord Filepaths?

filepaths = ["my_data.tfrecord"] dataset = tf.data.TFRecordDataset(filepaths) for item in dataset: print(item) Obviously, the more filepaths you add to the list, the more filepaths will be loaded.

As our RNN system will not be expecting a dataset, but tensors, you will have to use what method?

flat_map() Ex: dataset = dataset.flat_map(lambda window: window.batch(window_length))

How might you freeze the layers in a transfer learning system?

for layer in model_b_on_a.layers[:-1]: layer.trainable = False model_b_on_a.compile(...) You will train for a couple epochs, and then set the layer.trainable = True

What is a simple way to fill empty values with specific strategies?

from sklearn.impute import SimpleImputer imputer = SimpleImputer(strategy="median")

How do you compute LLE?

from sklearn.manifold import LocallyLinearEmbedding lle = LocallyLinearEmbedding(n_components=2,n_neighbors=10) X_reduced = lle.fit_transform(X)

What is the easiest way to compute precision and recall?

from sklearn.metrics import precision_score, recall_score

What is the equation for a fully connected layer?

h sub w,b of X = phi(XW + b) X = the matrix of input features W = connection weights except for the ones from the bias neuron. b = All the connection weights between the bias neuron and the artificial neurons phi = the activation function, and when the artificial neurons are TLUs, it is a step function

How would you compute He initalization?

he_avg_init = keras.initializers.VarianceScaling(scale=2,mode='fan_avg',distribution="uniform") keras.layers.Dense(10,activation="sigmoid",kernel_initializer=he_avg_init)

What are the Autodiff API's?

tf.GradientTape tf.gradients()

What is the input shape of an RNN that outputs a single point?

input_shape=[None,1]

What exact method is the keras analog of scikit learns pipeline for MLPs?

keras.layers.PreprocessingStage([])

What function should be use to implement a feed forward neural network?

keras.models.Sequential([])

Where is the distance in k-means stored?

kmeans.inertia_

How do you compute SELU?

layer = keras.layers.Dense(10,activation="selu",kernel_initalizer="lecun_normal")

How do you implement the custom scheduler that you have created?

lr_scheduler = keras.callbacks.LearningRateScheduler(exponential_decay_fn) history = model.fit(X_train_scaled, y_train, [...], callbacks=[lr_scheduler])

How would you implement an average pooling layer that looks at a 2 x 2 matrix?

max_pool = keras.layers.AvgPool2D(pool_size=2)

How would you implement a max pooling layer that looks at a 2 x 2 matrix?

max_pool = keras.layers.MaxPool2D(pool_size=2)

How might you implement a standardization layer that subtracts the mean and divides by its standard deviation and then smoothes the term to avoid division by zero.

means = np.mean(X_train,axis=0,deepdims=True) stds = np.std(X_Train,axis=0,keepdims=True) eps = keras.backend.epsilon() model = keras.models.Sequential([keras.layers.Lambda(lambda inputs: (inputs - means)/(stds + eps)) # Other Layers])

How do you load a keras model?

model = keras.models.load_model("my_keras_model.h5")

How would you load a model that had custom features, such as a custom Huber loss?

model = keras.models.load_model("my_model_with_a_custom_loss.h5",custom_objects={"huber_fn":huber_fn}) Whenever you have custom objects like huber_fn, you need to map it in a dictionary.

After the MLP is created, what must you do before the model is fitted?

model.compile()

How do you call an individual layer specifically?

model.layers[number of layer] or model.layers[number of layer].name or model.layers[number of layer].get_weights()

How do you save a keras model?

model.save("my_keras_model.h5")

How do print out all the models layers, layer name, shape, and parameters?

model.summary()

How might you take the lower layers of a model and transfer them over to a new model?

model_a = keras.models.load_model("my_model_a.h5") model_b_on_a = kears.models.Sequential(model_a.layers[:1]) model_b_on_a.add(keras.layers.Dense(1,activation="sigmoid"))

How do you define the number of components you want the data to be reduce to in PCA?

n_components= # of Number of components you want

How do you code to build the window in an RNN Dataset?

n_steps = 100 window_length = n_steps + 1 # target = input shifted 1 character ahead dataset = dataset.window(window_length,shift=1,drop_remainder=True) dataset = dataset.flat_map(lambda window: window.batch(window_length))

How might you create a pipeline with discretization?

normalization = keras.layers.Normalization() discretization = keras.layers.Discretization([...]) pipeline = keras.layers.PreprocessingStage([normalixation,discretization]) pipeline.adapt()

In general, it is best to increase the __________ rather than the _____________.

number of layers, number of neurons per layer

How would you compress a TFRecord?

options = tf.io.TFRecordOptions(compression_type="GZIP") with tf.io.TFRecordWrite("my_compressed.tfrecord",options) as f: [...]

What is the best way to perform PCA with at least 95% of the variance in the model accounted for by the dimensionality reduction?

pca = PCA() pca.fit(X_train) cumsum = np.cumsum(pca.explained_variance_ratio_) d = np.argmax(cumsum == 0.95) + 1 You can then set n_ components=d and run PCA again.

What is the easiest way to perform PCA with at least 95% of the variance in the model accounted for by the dimensionality reduction?

pca = PCA(n_components=0.95) X_reduced = pca.fit_transform(X_train)

How do you find the amount of variance that each element contributes to the dataset?

pca.explained_variance_ratio_

When it comes to TensorFlow, which API makes it possible to write a single preprocessing function that can be run in batch mode on your full training set?

tf.Transform This can then be exported to the TF Function

What are some of the miscellaneous API's?

tf.compat tf.config

What are the I/O and Preprocessing APIs?

tf.data tf.feature_column tf.audio tf.io tf.queue

What are the low-level deep learning APIs?

tf.nn tf.losses tf.metrics tf.optimizers tf.train tf.initializers

What are the API's for visualization with TensorBoard?

tf.summary

How do you change the dataset to make a binary classifier that takes MNIST and tries to predict if something is a five or not?

y_train_5 = (y_train == 5) # This is True for all 5s, and False for all other digits. y_test_5 = (y_test == 5)

Does it make any sense to chain two different dimensionality reduction algorithms?

It can absolutely make sense. Often, people will apply PCA to get rid of a large number of useless dimensions, then apply a much slower dimensionality reduction algorithm, such as LLE.

Can an SVM classifier output a confidence score when it classifies an instance?

It can compute the distance from the decision boundary, which can be used as a confidence score.

Why would you implement label propagation, and how?

It can greatly extend the number of labeled instances. One approach is to use an algorithm like K-Means and find the mot common label for each cluster.

What is an online learning system?

It can learn incrementally, as opposed to batch learning. This makes it capable of adapting rapidly to both changing data and autonomous systems, and of training on very large quantities of data.

What type of algorithm would you use to segment your customers into multiple groups?

If you don't know how to define groups, use a clustering algorithm. If you do know how to define groups, use a classification algorithm.

What are the main drawbacks of reducing a datasets dimensionality?

1. Information is lost. 2. Can be computationally intensive. 3. Adds some complexity to your pipeline. 4. Transformed features are often hard to interpret.

Name three advantages of the SELU activation function over ReLU.

1. It can take on negative values, so output of the neurons in any given layer is closer to zero, which helps get rid of vanishing gradients. 2. It always has a nonzero derivative, which avoids the dying units issue that can affect ReLU units. 3. It can ensure the model is self-normalized, which solves the exploding/vanishing gradients problem.

Can you name two techniques to find the right number of clusters when using a Gaussian mixture mode?

1. Plot the Bayesian Information Criterion (BIC) or the Akaike Information Criterion (AIC) as a function of the number of clusters then choose the number of clusters that minimizes the BIC or AIC. 2. Use a Bayesian Gaussian mixture model, which automatically selects the number of clusters.

Name three popular activation functions for MLP.

1. Step Function 2. Logistic (Sigmoid) 3. Hyperbolic Tangent (tanh) 4. Rectified Linear Unit (ReLU) 5. ELU

What are the main motivations for reducing a dataset's dimensionality?

1. To speed up a subsequent training algorithm. 2. To visualize the data and gain insights on the most important features. 3. To save space (compression)

When tackling MNIST, how many neurons do you need in the output layer, and which activation function should you use?

10 neurons with a softmax activation function.

What is the difference between a model, parameter and a learning algorithm's hyperparameter?

A model has one or more model parameters that determine what it will predict given a new instance. A learning algorithm attempts to find optimal values for these parameters such that it will generalize to new instances. A hyperparameter is a parameters of the learning algorithm itself, not of the model.

Why would you want to use Ridge Regression instead of plain Linear Regression?

A model with regularization generally performs better than one without, making Ridge Regression superior to linear regression.

Who wrote Hands on Machine Learning?

Aurélien Géron

Suppose you are using Ridge Regression and you notice that the training error and the validation error are almost equal and fairly high. Would you say that the model suffers from high bias or high variance?

Because it is underfitting the training set, it has high bias.

Suppose the features in your training set have very different scales. Which algorithms might suffer from this and how?

Because the cost function will have the shape of an elongated bowl, Gradient Descent will take a long time to converge. Also: regularized models may converge to a suboptimal solution if the if the features are not scaled, as regularization penalizes large weights, which causes smaller values to be ignored compared to features with larger values.

In what cases would you use Kernel PCA?

Best for non-linear datasets.

In what cases would you use Randomized PCA?

Best for when you want to significantly reduce the dimensionality, and it fits in memory. Much faster than regular PCA.

If a Decision Tree is underfitting the training set, is it a good idea to try scaling the input features?

Decision Tree's don't matter if inputs are scaled, so scaling inputs won't do anything.

Is a node's Gini impurity generally lower or greater than it's parent's?

Generally lower. This is due to the CART training algorithm's cust function, which splits each node in a way that minimizes the weighted sum of its children's Gini impurities.

How can you make other Gradient Descent algorithms converge?

Gradually reduce the learning rate

Do all Gradient Descent algorithms lead to the same model, provided you let them run long enough?

If the optimization problem is convex, and the learning rate is not too high, then yes. However, if you don't reduce the learning rate, SGD and MGD will never truly converge. They will bounce around the optimum.

Suppose you are using Polynomial Regression. You plot the learning curves and you notice that there is a large gap between the training error and the validation error. What is happening and how do you solve it?

If the val error is much highter than the training error, you are overfitting. Reduce the polynomial degree, or regularize the model (perhaps by adding an l2 penalty (ridge) or an l1 penalty (lasso) to the cost function) . You can also increase the size of the training set.

What is the difference between anomaly detection and novelty detection?

In anomaly detection, the goal is to determine anomalies in the dataset and in new instances. (e.g, Isolation Forest) In novelty detection, the training set is assumed to be clean, and it detects novelties in new instance. (e.g, One-Class SVM)

What is a Gaussian mixture? What tasks can you use it for?

It is a probabilistic model that assumes that the instances were generated from a mixture of several Gaussian distributions who parameters are unknown. The assumption is the data is grouped into a finite number of clusters with ellipsoidal shape. Useful for density estimation, clustering, and anomaly detection.

What is backpropagation and how does it work?

It is a technique to train DNNs by computing the gradients of the cost function with regard to every model parameter (all the weights and biases) then performing GD using these gradients.

Once a dataset's dimensionality has been reduced, is it possible to reverse the operation? If so, how? If not, why?

It is almost impossible to reverse because some information gets lost in the dimensionality reduction.

Is it okay to initialize the bias terms to 0?

It is okay.

What is the train-dev set? When do you need it?

It is part of the training set held out. It is used when there is a risk of mismatch between the training data and hte data used in the validation and test datasets.

In what cases would you use vanilla PCA?

It is the default. But only works if the dataset fits in memory.

What is the purpose of a validation set?

It is used to compare models. It makes it possible to select the best model and tune the hyperparameters.

How can you evaluate the performance of a dimensionality reduction algorithm on your dataset?

It performs well if it eliminates a lot of dimensions from the dataset without losing too much information. Measure this by applying the reverse transformation, and measure the reconstruction error.

Why would you want to use Lasso instead of Ridge Regression?

It pushes weights to zero, leading to sparse models where all weights are zero except for most important weights. Great for feature selection.

What is the curse of dimensionality?

It refers to the fact that many problems that do not exist in low dimensional space arise in high dimensional space. The more dimensions that something has, the higher the risk is that instances will be far away from each other.

Why was the logistic activation function a key ingredient in training the first MLPs?

It's derivative is always nonzero, so Gradient Descent can always roll down the slope. When the activation is a step function, GD can't move, and there is no slope.

Can you name two clustering algorithms that can scale to large datasets?

K-Means and BIRCH scale well to large datasets.

What are examples of a few clustering algorithms?

K-Means, DBSCAN, agglomerative clustering, BIRCH, Mean-Shift, affinity propagation, and spectral clustering.

If your training set contains 100,000 instances, will setting presort=True speed up training?

No, it will significantly slow it down. Presort only speeds up training if the dataset is smaller than a few thousand instances.

Is it okay to initialize all the weights to the same value as long as that value is selected randomly using He initialization?

No. All weights should be sampled independently. The goal of weights is to break symmetry.

Should you use the primal or the dual form of the SVM problem to train a model on a training set with millions of instances and hundreds of features?

Primal scales with training instances m, where dual is between m squared and m cubed--so primal scales much better. This only applies to linear SVM's as kernelized SVM's can only use the dual form.

Which Gradient Descent algorithm will reach the vicinity of the optimal solution the fastest?

SGD, as it only takes one training instance at a time.

Why is it important to scale the inputs when using SVMs?

SVM's try to fit the largest possible "street" between classes, so if the training set is not scaled, the SVM will tend to neglect small features.

Suppose the features in your training set have very different scales. What can you do about the problems this creates?

Scale the data before training the model.

Would you frame the problem of spam detection as a supervised learning problem or an unsupervised learning problem?

Supervised

How would you define clustering?

The unsupervised task of grouping similar instances together.

What is out-of-core learning?

These are algorithms that can handle vast quantities of data that cannot fit in a computers main memory. It chops the data into min-batches and uses online learning techniques to learn from these mini-batches.

What is a support vector?

These are instances on the street. The decision boundary is based on these support vectors. Computing predictions only involves the support vectors, not the whole training set.

How do model-based learning algorithms make their predictions?

They are fed new data that utilizes the parameters found by the learning algorithm.

What is label propagation?

This involves taking labeled instances and using these instances to label similar instances.

What is a test set, and why would you want to use it?

Used to estimate the generalization error that a model will make on new instances, before the model is launched in production.

In what cases would you use Incremental PCA?

Useful for larger datasets that don't fit in memory, but trains slower than regular PCA.

Can you think of a use case where active learning would be useful?

Useful when you have plenty of unlabeled instances, but labeling is costly.

How would you implement active learning?

Usually you use uncertainty sampling then make a human label the instances where it is uncertain.

Suppose you have an MLP composed of one input layer with 10 passthrough neurons, followed by one hidden layer with 50 artificial neurons, and finally one output layer with 3 artificial neurons. All artificial neurons use the ReLU activation function. What are the shapes of the output layer's weight vector W sub O and its bias vector b sub O?

W sub O = 50 x 3 b sub O = 3

Suppose you have an MLP composed of one input layer with 10 passthrough neurons, followed by one hidden layer with 50 artificial neurons, and finally one output layer with 3 artificial neurons. All artificial neurons use the ReLU activation function. What are the shapes of the hidden layer's weight vector W sub h and its bias vector b sub h?

W sub h = 10 x 50 b sub h = 50

Suppose you have an MLP composed of one input layer with 10 passthrough neurons, followed by one hidden layer with 50 artificial neurons, and finally one output layer with 3 artificial neurons. All artificial neurons use the ReLU activation function. What is the shape of the input matrix X?

X = m x 10 Where m represents the training batch size.

Suppose you have an MLP composed of one input layer with 10 passthrough neurons, followed by one hidden layer with 50 artificial neurons, and finally one output layer with 3 artificial neurons. All artificial neurons use the ReLU activation function. What is the shape of the network's output matrix Y?

Y = m x 3

Suppose you have an MLP composed of one input layer with 10 passthrough neurons, followed by one hidden layer with 50 artificial neurons, and finally one output layer with 3 artificial neurons. All artificial neurons use the ReLU activation function. Write the equation that computes the network's output matrix Y as a function of X, W sub h, b sub h, W sub O, and b sub O.

Y* = ReLU(ReLU(X W sub h + b sub h) W sub O + b sub O).

If a Decision Tree is overfitting the training set, is it a good idea to try decreasing max_depth?

Yes, since this will constrain the model and thus regularize it.

What can go wrong if you tune hyperparameters using the test set?

You risk overfitting the test set, and the generalization error you measure will be optimistic.


Related study sets

Chapter 13 Nutrition for Older Adults

View Set

Gilded Age and Unionization (1865-1900)

View Set

Unit 7 (Use of financial statements)

View Set

Iggy 11 Chapter 54: Care of Patients with Esophageal Problems

View Set

PM Ch.12, Project Management Final Exam +

View Set