RSM338 Final

Pataasin ang iyong marka sa homework at exams ngayon gamit ang Quizwiz!

Explain what is meant by "odds against" and "odds on." Why might they be useful concepts for interpreting logistic regression models?

"Odds against" shows the profit of $1 bet that an event will happen when the $1 is forfeited if it does not happen. "Odds on" shows the amount that must be staked to provide a $1 profit if an event happens and a total loss if it does not happen. The natural logarithms of odds on and odds against are linear in the features in a logistic regression.

Give one advantage of logistic regression over (a) SVM and (b) the naïve Bayes classifier.

(a) Logistic regression provides a probability of a positive and negative sentiment while SVM does not. (b) Logistic regression does not require the assumption that the occurrence of one word is independent of the occurrence of another word.

How many different sequences have to be considered when Shapley values are calculated for four features?

4! = 24.

How is a CNN used to predict time series?

A CNN can predict time series by considering a one-dimensional receptive fields consisting of several successive terms in the series.

What are the advantages and disadvantages of autoencoders and PCAs for determining a small number of variables that capture most of the information in a large number of features?

A PCA provides independent factors and its output indicates the importance of each. The underlying model is linear. In an autoencoder, the underlying model is in general non-linear.

What is a bag-of-words model?

A bag-of-words model represents text by the frequency with which words occur.

What is meant by a categorical feature?

A categorical feature is a non-numerical feature where data is assigned to one of a number of categories.

What is meant by (a) a factor loading and (b) a factor score?

A factor loading is the amount of each original feature in the factor. Each observation can be expressed as a linear combination of the factors. A factor score for an observation is the amount of the factor in the observation.

What is (a) a feature map, (b) a receptive field and (c) a filter in a CNN?

A feature map is one component of a layer in a CNN. A receptive field defines the points used to determine values in a feature map, A filter defines the weights applied to values in the receptive field.

Explain what a partial dependence plot is and how it is calculated.

A partial dependence plot shows what happens on average when one feature changes. Results are averaged over random changes in the other features.

What is a random forest?

A random forest is an ensemble of decision trees. The different decision trees are created by using a subset of features or a subset of observations or by changing threshold values.

What is meant by "sentiment analysis"?

A sentiment analysis involves processing textual data from such sources as social media and surveys to determine whether it is positive, negative, or neutral about a particular product, company, person, event, etc.

In predicting house prices, how would you handle a feature which is "yes" if a house has air conditioning and "no" if it does not?

A single dummy variable which equals one if the house has air conditioning and zero otherwise could be used.

Explain what is meant by a "trigram."

A trigram is a group of three words.

How can you tell whether a machine learning model is over- fitting data?

A validation set is used. If the answers given by the validation set start to get worse as model complexity is increased there is overfitting.

What is a word vector?

A word vector is a set of numbers describing the meaning of a word. It does this by quantifying the extent to which the word tends to appear in close proximity to other words.

What is an encoder? What is a decoder?

An encoder calculates the latent variable values from the historical data. The decoder calculates the output (designed to match the input as far as possible) from the latent variables.

Why do the Euclidean distances between observations increase as the number of features increases? Suppose that you start with ten features and then by mistake create ten more features that are identical to the first ten. What effect does this have on the distance between two of the observations?

As the number of features increases, the sum of the squared differences between feature values has more terms and therefore tends to increase. When the ten additional features are created by mistake, the distance between two observations increases by √2 because every squared difference is calculated twice.

Explain the difference between bagging and boosting.

Bagging involves sampling from observations or features so that the same algorithm is used on different training sets. Boosting involves creating models sequentially with each model attempting to correct the error made by the previous model.

"Bayes' theorem allows one to invert the conditionality." What is meant by this statement?

Bayes' theorem deals with the situation where we know the probability of X conditional on Y and we want the probability of Y conditional on X.

List five different types of data cleaning.

Data cleaning can involve (a) correcting for inconsistent recording, (b) removing observations that are not relevant, (c) removing du- plicate observations, (d) dealing with outliers, and (e) dealing with missing data.

What is meant by deep Q-learning?

Deep Q-learning is when an ANN is used in conjunction with temporal difference learning.

Explain what is meant by (a) distribution-based clustering and (b) density-based clustering.

Distribution-based clustering involves assuming that observations are created by a mixture of distributions and using statistical methods to separate them. Density-based clustering involves add- ing new points to a cluster that are close to several points already in the cluster. It can lead to non-standard shapes for the clusters.

Explain how dynamic programming works.

Dynamic programming involves working from the horizon date back to the beginning, working out the best action for each of the states that can arise.

Explain why a reinforcement learning algorithm needs to involve both exploration and exploitation.

Exploitation involves taking the best action identified so far. Exploration involves randomly selecting a different action. If an algorithm just involved exploitation it might never find the best action. If it just involved exploration, it would not benefit from what it has learned.

Why is feature scaling important in unsupervised learning? Explain two methods for feature scaling. What are the advantages and disadvantages of each method?

Feature scaling is necessary in unsupervised learning to ensure that features are treated as being equally important. In the Z-score method, each feature is scaled so that it has a mean of zero and a standard deviation of one. In min-max scaling each feature is scaled so that the lowest value is zero and the highest is one. Min- max scaling does not work well when there are outliers because the rest of the scaled values are then close together. But it may work better than the Z-score method when features have been measured on different scales with lower and upper bounds.

How much is it worth to have an extra 5,000 square feet of back yard in Iowa?

From Table 9.1, each extra square foot of lot size is worth $0.3795. it follows that an extra 5,000 square feet is worth 5,000 × $0.3795 or $1,897.50.

What problems arise if the learning rate is too high or too low?

If the learning rate is too low the steepest descent algorithm will be too slow. When it is too high there are liable to be oscillations with the minimum not being found.

Explain the difference between a recurrent neural network and the plain vanilla neural networks discussed in Chapter 6.

In a recurrent neural network there is a time dimension. The output from time t is an input to the network used at time t + 1.

Explain the key difference between a CNN and a plain vanilla ANN.

In a regular ANN the value at a node in one layer is related to values at all nodes in the previous layer. In a CNN it is related to a small subset of the nodes in the previous layer.

Explain the key difference between a convolutional neural network and the plain vanilla neural networks discussed in Chapter 6.

In a regular ANN, the value at a node in one layer is related to the values at all nodes in the previous layer. In a CNN, a layer consists of several feature maps. But the grid points in each feature map are related to a small subset of values in the previous layer.

Explain the key difference between an RNN and a plain vanilla ANN.

In an RNN there is a time sequence to the data. The nodes in one layer are related to values calculated for the same nodes at the previous time as well as to the nodes in the previous layer.

How does hierarchical clustering work? What are its advantages and disadvantages relative to k-means?

In hierarchical clustering we start by putting every observation in its own cluster. At each step, we find the two closest clusters and join them to create a new cluster. The disadvantage is that it is slow. The advantage is that it identifies clusters within clusters.

How does reinforcement learning differ from supervised learning?

In reinforcement learning the objective is to calculate the best strategy for taking a sequence of decisions through time in a changing environment. Supervised learning involves one or more estimates being made from features at a single point in time.

What is meant by temporal difference learning?

In temporal difference learning each trial involves actions being taken with exploration and exploitation. The new observation on the value of taking a particular action in a particular state is determined by looking one period ahead and using the most recent estimate of the value of being in the state that will be reached.

Explain the Monte Carlo approach to reinforcement learning.

In the Monte Carlo approach each trial involves actions being taken with exploration and exploitation. The new observation on the value of taking a particular action in a particular state is the total future reward (possibly discounted) from the time of the action until the horizon date.

What is plotted in an ROC? Explain the trade-offs it describes.

In the ROC curve the true positive rate is plotted against the false positive rate. It shows the trade-off between correctly predicting positive outcomes and failing to predict negative outcomes.

How is the objective function changed for (a) Ridge regression, (b) Lasso regression, and (c) Elastic Net regression?

In the case of Ridge regression we add a constant times the sum of the squares of the coefficients to the mean squared error. In the case of Lasso regression we add a constant times the sum of the absolute values of the coefficients. In the case of Elastic Net regression we add a constant times the sum of the squares of the coefficients and a different constant times the sum of the absolute values of the coefficients.

Suppose that Table 7.8 shows the current Q-values for Nim. In the next game you win because one match is always picked up by both you and your opponent. How would the table be updated for (a) the Monte Carlo approach and (b) the temporal difference learning approach?

In the case of the Monte Carlo approach we update as follows: Q(8,1) = 0.272+0.05(1.000−0.272) = 0.308 Q(6,1) = 0.155+0.05(1.000−0.155) = 0.197 Q(4,1) = 0.484+0.05(1.000−0.484) = 0.510 Q(2,1) = 0.999+0.05(1.000−0.999) = 0.999 In the case of the temporal difference approach we update as follows: Q(8,1) = 0.272+0.05(0.155−0.272) = 0.266 Q(6,1) = 0.155+0.05(1.000−0.155) = 0.197 Q(4,1) = 0.484+0.05(0.999−0.484) = 0.510 Q(2,1) = 0.999+0.05(1.000−0.999) = 0.999

What are the main differences between the decision tree approach to prediction and the regression approach?

In the decision tree approach the features are considered one-by-one in order of importance whereas in the regression approach they are considered all at once. The decision tree approach does not assume linearity and is more intuitive. It is also less sensitive to outlying observations than linear regression.

Explain (a) the elbow method and (b) the silhouette method for choosing the number of clusters, k.

In the elbow method we look for the point at which the marginal improvement in inertia (i.e., within cluster sum of squares) when an additional cluster is introduced is small. In the silhouette method, we calculate for each value of k and for each observation i a(i): the average distance from other observations in its own cluster, and b(i): the average distance from observations in the nearest cluster. The observation's silhouette is 𝑠(𝑖) = (𝑏(𝑖) − 𝑎(𝑖)) / max{𝑎(𝑖), 𝑏(𝑖)} and the best value of k is the one for which the average silhouette across all observations is greatest.

Suppose that there are three words in a vocabulary and we wish to classify an opinion that contains the first two words, but not the third, as positive or negative using the naïve Bayes classifier. The training set is as follows (1 indicates that the opinion contains the word, 0 indicates that it does not): Estimate the probability that the opinion under consideration is (a) positive and (b) negative.

In this case, p1=0.667, p2=0.5, p3 =0.667 q1=0.5, q2 =0.25, and q3 = 0.75. Also Prob(Pos) = 0.6 and Prob(Neg) = 0.4. The probability of a positive sentiment is: (0.667 × 0.5 × 0.667 × 0.6) / (0.667×0.5×0.667×0.6+0.5×0.25×0.75×0.4) = 0.78

How is information gain measured?

Information gain is measured as the reduction in either entropy or the Gini measure.

How is entropy defined?

It is a measure of impurity or disorder in a set of data. When there are n alternative outcomes entropy is where 𝑝𝑖 is the probability of the ith outcome.

In logistic regression, what is the equation for the sensitivity of the probability of a negative outcome to a very small change in the feature value?

It is the minus the sensitivity of a positive outcome or

Explain the LIME approach to model interpretability.

LIME finds a simple model that fits a complicated model for values of the features that are close to the currently observed values.

What problem is Laplace smoothing designed to deal with?

Laplace smoothing is designed to deal with the problem that a word appears in an opinion but not in one class of the training set observations. It changes the probability of that class from zero to a small number.

What is the difference between machine learning and artificial intelligence?

Machine learning is a branch of artificial intelligence where intelligence is created by learning from large data sets.

Why do negative words such as "not" cause a problem in a bag-of-words model?

Negatives such as "not" mean that wrong conclusions are liable to be reached if a bag-of-words model merely looks at the frequency with which a single word appears. An improvement is to look at word pairs (bigrams).

Which of the models introduced in this book are most difficult to interpret?

Neural networks, SVM models, and ensemble models are difficult to interpret.

Explain two types of predictions that are made in supervised learning.

One type of prediction is concerned with estimating the value of a continuous variable. The other is concerned with classification.

Under what circumstances is principal components analysis most useful for understanding data?

Principal components analysis (PCA) is useful when there are a number of highly correlated features. It has the potential to explain most of the variability in the data with a small number of new factors (which can be considered as manufactured features) that are uncorrelated with each other.

Explain the meaning of the term "regularization." What is the difference between L1 and L2 regularization?

Regularization is designed to avoid over-fitting by reducing the weights (i.e. coefficients) in a regression. L1 regularization is Lasso where a constant times the sum of the absolute values of the coef- ficients is added to the objective function. It makes some of the weights zero. L2 regularization is Ridge where a constant times the sum of the squares of the coefficients is added to the objective function. It reduces the absolute magnitude of the weights.

When is reinforcement learning appropriate?

Reinforcement learning is concerned with situations where a sequence of decisions has to be made in a changing environment.

What is the main advantage of (a) Ridge regression and (b) Lasso Regression?

Ridge regression reduces the magnitude of the coefficients when the correlation between features is high. Lasso regression sets to zero the values of the coefficients of variables that have little effect on prediction results.

When is semi-supervised learning appropriate?

Semi-supervised learning is concerned with making predictions where some of the available data have values for the target and some do not.

What are the advantages of using Shapley values in model inter- pretability?

Shapley values are designed so that when there is a change the sum of the contributions to the change of each feature equals the total change.

What is the difference between stemming and lemmatization?

Stemming involves removing suffices such as "ing" and "s". Lemmatization involves searching for the root word.

Explain how TF and IDF are used in information retrieval.

TF of a word in a document is the number of times the word appears in the document divided by the number of words in the document. IDF of a word is log(N/n) where N is the total number of documents and n is the number of documents containing the word. In information retrieval, TF is multiplied by IDF to provide, for each document that might be retrieved, a score for each word in a search request.

What is meant by the bias-variance trade-off? Does the linear model in Figure 1.5 give a bias error or a variance error? Does the fifth-order-polynomial model in Figure 1.2 give a bias error or a variance error?

The bias-variance trade-off is the trade-off between (a) under-fitting and missing key aspects of the relationship, and (b) over-fitting so that idiosyncrasies in the training data are picked up. The linear model is under-fitting and therefore gives a bias error. The fifth order-polynomial model is over-fitting and therefore gives a variance error.

What would be the center of a cluster consisting of the two observations in question 2.2?

The center is obtained by averaging feature values. It is the point that has values 4, 5.5, and 5.5 for the three features.

"Interactions between features create problems when the contributions of features to the change in a prediction is calculated." Explain this statement.

The contributions calculated for a feature usually assume that the feature changes with all other features remaining fixed. If there are interactions between features, it may be unrealistic to assume that one feature can change without other features changing.

"The decision tree algorithm has the advantage that it is trans- parent." Explain this comment.

The decision tree algorithm is transparent in that it is easy to see why a particular decision was made.

Suppose there are three features, A, B, and C. One observation has values 2, 3, and 4 for A, B, and C, respectively. Another has values 6, 8, and 7 for A, B, and C respectively. What is the Euclidean distance between the two observations?

The distance is √(6 − 2)^2 + (8 − 3)^2 + (7 − 4)^2 = 7.07.

Explain what is meant by the dummy variable trap.

The dummy variable trap is the problem that when a categorical variable is hot encoded and there is a bias (constant term), there are many sets of parameters that give equally good best fits to the training set. The problem is solved with regularization.

What is an ensemble method?

The ensemble method is a way of combining multiple algorithms to make a single prediction.

Explain what the two networks in a generative adversarial network do.

The generator network tries to match the historical data and the discriminator network tries to distinguish the historical data from the generated data.

What is meant by the learning rate in a gradient descent algorithm?

The learning rate is the size of the step taken down the valley once the line of steepest descent has been identified.

What assumption is made when the naïve Bayes classifier is used in sentiment analysis?

The naïve Bayes classifier assumes that the occurrence of one word is independent of the occurrence of another word.

What is the assumption underlying the naïve Bayes classifier?

The naïve Bayesian classifier assumes that, for observations in a class, feature values are independent.

How many parameters are there when an ANN has five features, two hidden layers and ten neurons per hidden layer, and one target?

The number of parameters is 6×10+10×11×1+11×1= 181

What is the objective function in a "plain vanilla" linear regression?

The objective in "plain vanilla" linear regression is to minimize the mean squared error of the forecasted target values.

What is the objective function in a logistic regression?

The objective in logistic regression is to maximize: ∑ ln(𝑄) + ∑ ln(1−𝑄) Positive Outcomes + Negative Outcomes where Q is the estimated probability of a positive outcome.

What is the optimal strategy for playing Nim? To what extent has the Monte Carlo simulation found the best action after 1,000, 5,000, and 25,000 games in Tables 7.8 to 7.10?

The optimal strategy is to leave your opponent with 4n+1 matches where n is an integer. When there are 8 matches the optimal strat- egy is to pick up 3 matches. After 1000 games this has been iden- tified as the best strategy, but not convincingly so. After 5,000 and 25,000 games, the best decisions become more clearly differenti- ated.

Explain how a stopping rule is chosen when an ANN is trained.

The results for the validation set are produced at the same time as the results for the training set. The algorithm is stopped when the results for the validation set start to worsen. This is to avoid over- fitting.

What is the sigmoid function?

The sigmoid function is: 𝑓(𝑦) = 1 / 1+ 𝑒^−𝑦

List five ways in which text can be pre-processed for an NLP application.

The text must be split into words. Punctuation must be removed. Very common words, such as "the", "a" and "and" (referred to as stop words) can be removed. Stemming can be applied to replace a word by its stem (e.g., "sleeping" by "sleep"). Lemmatization can be used to reduce a word to its root (e.g., "better" to "good"). Spelling mistakes can be corrected. Abbreviations can be replaced by the full word. Rare words can be removed.

How do you choose the thresholds for a numerical variable in a decision tree?

The threshold is the value that maximizes the information gain.

What is the key trade-off that must be made when a variational autoencoder is used?

The trade-off is between (a) the closeness with which the output matches the input and (b) the extent to which the distribution of latent variables matches a multivariate normal distribution.

What is the definition of (a) the true positive rate, (b) the false positive rate, and (c) the precision?

The true positive rate is the proportion of positive outcomes that are predicted correctly. The false positive rate is the proportion of negative outcomes that are predicted incorrectly. The precision is the proportion of positive predictions that are correct.

What is the universal approximation theorem?

The universal approximation theorem states that any continuous non-linear function can be approximated to arbitrary accuracy using a neural network with one layer.

Explain the role of the validation data set and the test data set.

The validation set is used to compare models so that one that has good accuracy and generalizes well can be chosen. The test set is held back to provide a final test of the accuracy of the chosen model.

In what ways is a linear model simpler to interpret than a non- linear model?

The weights of a linear model have a simple interpretation. They show the effect of changing the value of one feature while keeping the others the same.

What are the alternative ways of creating labels for text in a sentiment analysis?

There are some publicly available data sets where opinions have been labeled. These are sometimes used for training. Otherwise, it is necessary to manually label the opinions used for training and testing.

When is unsupervised learning appropriate?

Unsupervised learning is concerned with identifying patterns (clusters) in data.

Explain the steps in the k-means algorithm.

We choose k points as cluster centers, assign observations to the nearest cluster center, re-compute cluster centers, re-assign observations to cluster centers, and so on.

In predicting house prices, how would you handle a feature which describes the lot as "no slope", "gentle slope", "moderate slope", and "severe slope."

We could use a single dummy variable which equals 0 for no slope, 1 for gentle slope, 2 for moderate slope, and 3 for steep slope.

What are the problems in using an autoencoder to generate new synthetic data? How are they solved with a variational autoencoder?

We want to sample new values for the latent variables that are different from those produced by the historical data. To do this, a VAE fits the latent variable values produced by the historical data to a distribution that can be sampled from.

In predicting house prices how would you handle a feature which identifies the neighborhood of the house.

We would create a dummy variable for each neighborhood. The dummy variable equals one if the house is in the neighborhood and zero otherwise.

Explain how ANNs can be used in derivatives valuation.

When a derivative is normally valued using a computationally slow numerical procedure such as Monte Carlo simulation, an ANN can be created for valuation. Data relating the derivative's value to the inputs is created using the numerical procedure. The ANN is then trained on the data and used for all future valuations.

What activation function is suggested in the chapter for relating the target to the values in the final layer when the objective is (a) to predict a numerical variable and (b) to classify data?

When the target is numerical the suggested final activation function is linear. When observations are being classified, the suggested activation function is the sigmoid function.

Why is it sometimes necessary to use an artificial neural network in conjunction with reinforcement learning?

When there are many actions or many states (or both) the action/state matrix does not fill up quickly and values can be estimated using an ANN.

How is the Gini measure defined?

When there are n alternative outcomes the Gini measure is defined as where 𝑝𝑖 is the probability of the ith outcome.

"A Long Short-Term Memory network is an extension of the idea underlying a recurrent neural network." Explain this statement.

Whereas an RNN remembers the output from the previous time period, an LSTM network can remember output from several periods ago. The LSTM network determines which outputs should be remembered and which can be forgotten.

Explain what is meant by (a) a hidden layer and (b) a neuron, and (c) an activation function.

a) A hidden layer is a set of intermediate values used when the outputs are calculated from the inputs in a neural network. The set of inputs form the input layer. The set of outputs form the output layer. b) A neuron is one element within a hidden layer for which a value is calculated. c) An activation function is the function used for the calculation of the values at neurons in one layer from values at neurons in the previous layer.

Explain how a sigmoid function relates the values at the neurons in one layer to the values at neurons in the previous layer.

he sigmoid function for calculating the value at a neuron is 𝑓(𝑦) = 1 / 1+𝑒−𝑦 where y is a constant (the bias) plus a linear combination of the values at the neurons in the previous layer.


Kaugnay na mga set ng pag-aaral

Legal Studies 131 Quiz and Past Exam Questions

View Set

DSM - 5 Categories of Mental Disorders

View Set

Chapter 65: Caring for Clients with Skin, Hair, and Nail Disorders

View Set

Chapter 1: Introduction to Python 3

View Set

Current Events in East Asia Questions

View Set

Nursing Process/Diagnoses Practice Test (NCLEX style) 15 multiple choice

View Set

BA101 Exam III - Recharge (Chapters 19, 20, 21, 4, 5, 6)

View Set