Machine learning finals
policy
The algorithm a software agent uses to determine its actions is called its_____ .
multiply matrix
In the convolutional layer of a CNN, we apply the filter shown in red. What would be the new value of the cell in the center?
environment
One of the challenges of Reinforcement Learning is that in order to train an agent, you first need to have a working_________
z = w1 x1 + w2 x2 + ⋯ + wn xn = x⊺ w
The TLU computes a weighted sum of its inputs
True
The feature maps of a CNN capture the result of applying the filters to an input image.
linear binary classification
A single TLU can be used for simple __________ (like a Logistic Regression or linear SVM classifier).
Return for Action B = 10 + 0.5 * (-30) + (0.5)^2 * (20) = 10 + (-15) + (5) = 0
A, B, C, D are actions that are taken consecutively by the agent. A -> B -> C -> D Action A -> Reward 50 Action B -> Reward 10 Action C -> Reward -30 Action D -> Reward 20 Discount Factor is 0.5 What is the return for Action B?
ReLU
After the convolutional layer, we use the ________function we remove all negative values.
• One (passthrough) input layer, • One or more layers of TLUs, called hidden layers, • One final layer of TLUs called the output layer.
An MLP is composed of:
a) All of the options listed
Artificial Neural Networks became popular several times in history but their popularity faded away over time. Artificial Neural Networks are popular away but this time they have a better chance to stay popular and successful because a) Huge quantity of data is available to train neural networks b) Computing power has increased with powerful GPU cards c) Computer power is accessible to everyone via cloud platforms d) All of the options listed
If one model significantly better than the other, it doesn't help the other learn because the feedback isn't useful. For example, if discriminator is superior, it will return feedback to generator as 100% fake to all that generator produced and it will not help generator to learn and update its parameters to generate more realistic 'fake' outputs. If generator is much superior than discriminator, than the discriminator would think that all of the generators' output is 'real' and would not give good feedback to generator that will help it to improve.
Briefly describe why having discriminator or generator much better than the other does not help during the training process?
minimizing
During Neural Network training, network learning means _________ the cost function.
True
During the training of the discriminator, both real and fake images are used. During the training of the generator, only fake images are used.
PyTorch
Facebook's ________is a library, released in 2018, that can be utilized for building, training, executing neural networks.
softmax function
For multiclass classification using MLPs, we should use ___________ as the activation function for the output layer because it will ensure that all the estimated probabilities are between 0 and 1 and that they add up to 1 (which is required if the classes are exclusive).
'Discriminator, 'Generator'
GANs has two neural networks competing with each other. One is known as __________ and the other is known as ________.
• Speech Recognition: Input audio, output: text • Music Generation: Input: empty, or a starting note output: sequence of notes • Sentiment classification: Input: 'This movie is so poor quality.' Output: rating/stars • DNA sequence: Input: ADCGGGCTGATTA Output: protein sequence • Translation: Input: 'Merhaba, nasılsınız?' Output: 'Hello, how are you?' • Video activity recognition: Input: sequence of frames Output: 'running' • Name entity recognition: Input: Andrew Ng explained the details of deep learning.
Give examples of tasks that can be solved using RNNs.
They are learned by the convolutional network by training. A number of randomly initialized filters are passed over to the image. Overtime the filters that give us image outputs that give the best matches will be learned and the process is called feature extraction.
How do we find the filters in a CNN? What is feature extraction in CNNs?
On markov chain, the transition probability to evolve from a state s to another state only depends on the state itself however, in RL, it depends on the state and the chosen action. Also, the some state transitions return some reward (positive or negative) in reinforcement learning.
How do we need to modify markov chains to utilize them for modeling reinforcement learning?
input_ = keras.layers.Input(shape=X_train.shape[1:]) hidden1 = keras.layers.Dense(30, activation="relu")(input_) hidden2 = keras.layers.Dense(30, activation="relu")(hidden1) concat = keras.layers.Concatenate()([input_, hidden2]) output = keras.layers.Dense(1)(concat) model = keras.Model(inputs=[input_], outputs=[output])
How would you program the architecture of a deep and wide neural network like the following using Keras:
input_A = keras.layers.Input(shape=[5], name="wide_input") input_B = keras.layers.Input(shape=[6], name="deep_input") hidden1 = keras.layers.Dense(30, activation="relu")(input_B) hidden2 = keras.layers.Dense(30, activation="relu")(hidden1) concat = keras.layers.concatenate([input_A, hidden2]) output = keras.layers.Dense(1, name="main_output")(concat) aux_output = keras.layers.Dense(1, name="aux_output")(hidden2) model = keras.Model(inputs=[input_A, input_B], outputs=[output, aux_output])
How would you program the architecture of a deep and wide neural network like the following using Keras:
input_A = keras.layers.Input(shape=[5], name="wide_input") input_B = keras.layers.Input(shape=[6], name="deep_input") hidden1 = keras.layers.Dense(30, activation="relu")(input_B) hidden2 = keras.layers.Dense(30, activation="relu")(hidden1) concat = keras.layers.concatenate([input_A, hidden2]) output = keras.layers.Dense(1, name="output")(concat) model = keras.Model(inputs=[input_A, input_B], outputs=[output])
How would you program the architecture of a deep and wide neural network like the following using Keras:
agent, environment, rewards
In Reinforcement Learning, a software _______ makes observations and takes actions within an _______, and in return it receives _________.
Credit Assignment Problem
In Reinforcement Learning, the only guidance the agent gets is through rewards, and rewards are typically sparse and delayed. Therefore, the last action is not entirely responsible for getting the last reward. _______________ decides on how the reward credit should be distributed to the actions taken.
How to Learn unknown: • Experience each state and each transition at least once to know the rewards, • Experience them multiple times to estimate of the transition probabilities.
In reinforcement, when the agent is placed into a new environment, it initially has no idea what the transition probabilities are (T(s, a, s′)) and it does not know what the rewards are going to be either (R(s, a, s′)). In this case, how does it learn the unknown?
small region, fully
In the CNN, the neuron layer will only be connected to a ________ of the layer before it, instead of all of the neurons in a _____ connected manner.
one or more, binary, at least two
McCulloch and Pitts proposed a very simple model of biological neuron called artificial neuron. It has __________ binary (on/off) inputs and one _________output. A neuron is activated when ______of its inputs are active.
Bias neuron
Provides an extra bias feature is generally added (x0 = 1): outputs 1 all the time.
dynamic programming
Richard Bellman algorithm to find the optimal state value in markov decision process uses the __________ algorithm design technique, which breaks down a complex problem into tractable subproblems that can be tackled iteratively.
T(s, a, s′) is the transition probability from state s to state s′, when the agent chose action a. R(s, a, s′) is the reward that the agent gets when it goes from state s to state s′, given that the agent chose action a. γ is the discount factor.
Richard Bellman found a way to estimate the optimal state value of any state s. In the formula below, V*(s) is the optimal state value of any state s which is the sum of all discounted future rewards the agent can expect on average after it reaches a state s, assuming it acts optimally.
Heaviside, sign
Step functions used in Perceptrons could be _______ step function or the ______ function.
· Initialize all layers' connection weights randomly · For each training instance, the backpropagation algorithm first makes a prediction (forward pass) · Measures the error, · Goes through each layer in reverse to measure the error contribution from each connection (reverse pass), · Tweaks the connection weights to reduce the error (Gradient Descent step)
Summarize how the backpropagation algorithm works to train neural networks?
fit(), History
The _____ method of the model created by using Keras libraries returns a history object. The _____ object has information about the loss and extra metrics it measured at the end of each epoch on the training set and on the validation set. This information can be utilized to plot a learning curve to analyze the progress of the training of the neural network.
CartPole
The ______ is a very simple environment composed of a cart that can move left or right, and pole placed vertically on top of it. The agent must move the cart left or right to keep the pole upright.
backpropagation
The _________ algorithm is used to decide the filter and weights in the convolutional and fully connected layers of the CNNS.
optimal policy
The _________ noted π*(s): When the agent is in state s, it should choose the action with the highest Q-Value for that state:
TRUE
The idea behind the convolutional neural networks is to filter the images before training the deep network. Features in the images come forefront and spot the features to identify something.
threshold logic unit, numbers
The inputs and output of a _________TLU are _______ (instead of binary on/off values), and each input connection is associated with a weight.
Input neurons
The special passthrough neurons output whatever input they are fed and feed the perceptron. All the input neurons form the input layer.
TF-agents, Reinforcement, Tensorflow
The_______ library is a _________ Learning library based on T_______, developed at Google and open sourced in 2018.
hw(x) = step(z), where z = x⊺ w.
Then applies a step function to that sum and outputs the result:
feature anywhere in the image.
This systematic application of the same filter across an image is a powerful idea. If the filter is designed to detect a specific type of feature in the input, then the application of that filter systematically across the entire input image allows the filter an opportunity to discover that _____________
use GridSearchCV or RandomizedSearchCV
To find the best hyperparameters of a neural network, we can use __________ or __________ to explore the hyperparameter space.
feature extraction
Using the backpropagation algorithm, we figure out the filters that will give the best output. This process is called _______
randomly
We initialize the weights and bias __________. Initially neural network performs very poorly.
True
We need to be careful about using excessive number of pooling layers since it can cause information loss
Advantages: The model can easily be saved, cloned, and shared; Its structure can be displayed and analyzed; The framework can infer shapes and check types, so errors can be caught early (i.e., before any data ever goes through the model). It's also fairly easy to debug, since the whole model is a static graph of layers. Disadvantages: It is not dynamic: Cannot have loops, varying shapes, conditional branching, and other dynamic behaviors
What are the advantages and disadvantages of the declarative APIs (Sequential and Functional)?
1- In the CNN, the neuron layer will only be connected to a small region of the layer before it, instead of all of the neurons in a fully connected manner. This significantly decreases the computations needed to train the CNNs. 2- CNN compares the images piece by piece called features. By finding rough image matches in roughly the same position in two images, CNN achieves better accuracy
What are the advantages of CNNs over neural networks with dense layers?
· Recurrent Neural Networks (RNN) can process inputs of any length. · An RNN model is modeled to remember each information throughout the time which is very helpful in any time series predictor. · The weights can be shared across the time steps. · RNN can use their internal memory for processing the arbitrary series of inputs which is not the case with feedforward neural networks.
What are the advantages of RNNs (recurrent neural networks) over classical feedforward neural networks?
Unstable gradients (exploding or vanishing). To address: · Use a smaller learning rate, · Use a saturating activation function such as the hyperbolic tangent (which is the default), · Use gradient clipping, Layer Normalization, or dropout at each time step. A very limited short-term memory. To address: · Use LSTM (Long Short-Term Memory) · GRU layers Gated Recurrent Unit
What are the challenges of RNNs and how they can be addressed?
• Convolutions: Number of features, size of features • Pooling : Window size, window stride • Fully Connected: Number of neurons • Architecture: • How many each type of layer? • In what order?
What are the hyperparameters that can be tuned in a CNN?
• The output layer has a single neuron (since we only want to predict a single value) • Uses no activation function, • The loss function is the mean squared error.
What are the main differences of the neural network architecture while using the Sequential API of Keras for regression and for classification?
Pooling shrinks the input image size. This reduces the computational load, the memory usage, and the number of parameters. This also introduces some level of invariance to small translations, a small amount of rotational invariance, and slight scale invariance. However, pooling can be destructive and causes information loss.
What are the pros and cons of applying pooling in convolutional neural networks?
Action advantage is how much better or worse an action is, compared to the other possible actions, on average. To compute it, we play the game many times and then normalize all the action returns (by subtracting the mean and dividing by the standard deviation).
What is Action Advantage? How would you compute that?
It takes very long time to go through every training example for every gradient descent step. We shuffle the data and divide it into mini-batches (each one having for example 100 training examples). We compute the gradient descent for mini-batches instead of computing for the whole data set. We gain important computational speed up.
What is Mini-batch Stochastic Gradient Descent approach for training neural networks?
As the filter is applied multiple times to the input array, the result is a two-dimensional array of output values that represent a filtering of the input. As such, the two-dimensional output array from this operation is called a "feature map".
What is a feature map in CNNs?
This architecture makes it possible for the neural network to learn both deep patterns (using the deep path) and simple rules (through the short path).
What is the benefit of the architecture shown above?
Their goal is to subsample (i.e., shrink) the input image. Pooling layers reduce the computational load, the memory usage.
What is the goal of a pooling layer? What is the effect of it to CNNs ?
Fully connected layer, or a dense layer
When all the neurons in a layer are connected to every neuron in the previous layer (i.e., its input neurons)
deep neural network (DNN).
When an Artificial Neural Network contains a deep stack of hidden layers, it is called a _______
This lets the agent find the right balance between exploring new actions and exploiting the actions that are known to work well.
When used for Reinforcement Learning, a neural network will estimate a probability for each action, and then we will select an action randomly, according to the estimated probabilities. Why are we picking a random action based on the probabilities rather than just picking the action with the highest score?
# input neurons One per input feature (e.g., 28 x 28 = 784 for MNIST) # output neurons 1 (Binary classification) 1 per label (Multilabel Binary Classification) 1 per class (Multiclass Classification)
When we use MLPs for Classification, how can we set the number of input/output neuron?
# input neurons One per input feature (e.g., 28 x 28 = 784 for MNIST) # output neurons 1 per prediction dimension
When we use MLPs for Regressions, how can we set the number of input/output neuron?
a) Evaluate an action based on the sum of all the rewards that come after it, usually applying a discount factor γ (gamma) at each step.
Which of the following is a common strategy for the Credit Assignment Problem?
· Something else....
Which of the following is an activation function that is (not) used · The logistic (sigmoid) function: σ(z) = 1 / (1 + exp(-z)) · The hyperbolic tangent function: tanh(z) = 2σ(2z) - 1 · The Rectified Linear Unit function: ReLU(z) = max(0, z) · Something else....
TopNet ImgNet AppLNet MSNet
Which of the following is not a well-known CNN architecture? LeNeT-5 AlexNet GoogLeNet ResNet TopNet ImgNet AppLNet MSNet
a) The policy could be a neural network b) The policy does not have to be deterministic c) In some cases it does not even have to observe the environment d) All of the options listed <----
Which of the following is true for a reinforcement learning policy?
e) All of the options
Which of the following is true for perceptrons? a) Perceptrons are based on a threshold logic unit (TLU). b) Perceptron is a simple Artificial Neural Network (ANN) architecture. c) A Perceptron is composed of a single layer of TLUs, with each TLU connected to all the inputs d) Perceptron were invented in 1957 by Frank Rosenblatt e) All of the options
a) All of the options listed
Which of the following(s) is true for Convolutional Neural Networks (CNNs)? a) It was inspired by the visual cortex b) In the convolutional layer, neurons are connected to a small region of the layer before it c) It includes pooling layer to shrink the size of the image d) All of the options listed
• Change the learning rate • Try another optimizer • Try tuning model hyperparameters (number of layers, number of neurons per layer, types of activation functions) • Try other hyperparameters such as the batch size
Which of the followings can be done to tune the hyperparameters if validation accuracy of a neural network is not good?
a) Stochastic Policy Search b) Brute Force Policy Search c) Policy search using Genetic algorithms d) Policy search using optimization algorithms e) All of the options listed <--
Which of the followings is a policy search technique?
e) TLU Layer
Which of these is not a layer in CNNs? a) Convolution Layer b) ReLU Layer c) Pooling d) Fully Connected e) TLU Layer
· All of the options above
Which one is correct for GANs? · The discriminator looks at real and fake images over time, makes guesses, and gets feedback on whether its guess was right or wrong. · The discriminator uses BCE (Binary Cross Entropy) as its loss function. · The generator gets feedback for the fake images it produced from the discriminator. · All of the options above
True
While Reinforcement Learning was an old branch of Machine Learning existing since 1950s, the recent successes are mainly due to the application of the power of deep learning to the field of Reinforcement Learning
Discriminative, Generative
While a __________ model learn how to distinguish between classes such as dogs and cats, a __________ models try to learn how to make a realistic representation of some class.
zooming
While exploring a search space to find the best hyperparameters for a neural network, we can utilize the _________ technique that explores a region more when a region of the search space turns out to be good.
If you chain several linear transformations, all you get is a linear transformation. • Example: if f(x) = 2x + 3 and g(x) = 5x - 1, then chaining these two linear functions gives you another linear function: f(g(x)) = 2(5x - 1) + 3 = 10x + 1. If you don't have some nonlinearity between layers, then even a deep stack of layers is equivalent to a single layer, and you can't solve very complex problems with that. A large enough DNN with nonlinear activations can theoretically approximate any continuous function.
Why do we need activation functions in neural networks?
'Discriminator, 'Generator'
With feedback from the ___________ on whether a fake image looks real or fake, the ________starts producing fake images that are more and more realistic.
Keras
______ developed by François Chollet, open-sources in 2015, is a high-level Deep Learning API that allows you to easily build, train, evaluate, and execute all sorts of neural networks.
Pooling
________ layer of the CNNs shrinks the image stack into a smaller size. ______ groups the pixels of the images and filters them down to subset.
Discriminative
________ model is one typically used for classification in machine learning. They learn how to distinguish between classes such as dogs and cats, and are often called classifiers.
Generative
________ models try to learn how to make a realistic representation of some class. For instance, a realistic picture of a dog you see here. They take some random input represented by the noise
Open AI Gym
________is a toolkit for developing and comparing reinforcement learning algorithms.
Q-Value (Quality Values)
of the state-action pair (s, a) is the sum of discounted future rewards the agent can expect on average after it reaches the state s and chooses action a, but before it sees the outcome of this action, assuming it acts optimally after that action.