CAIP Certnexus
Hidden layer(s) can
- Have arbitrary number of nodes/units - Have arbitrary number of links from input nodes and to output nodes (or to next hidden layer) - There can be multiple hidden layers
After training a random forest, your model returned an out-of-bag error of 0.14. What does this indicate? 14% of the decision trees in the forest incorrectly predicted their own out-of-bag samples. 14% of all samples were incorrectly placed out-of-bag. 14% of each decision tree's out-of-bag samples, on average, were predicted incorrectly. 14% of the majority vote predictions for all out-of-bag samples were incorrect.
14% of the decision trees in the forest incorrectly predicted their own out-of-bag samples. - This is what the out-of-bag error describes: the ratio of incorrect majority vote predictions to correct majority vote predictions for all out-of-bag samples. 14% of all samples were incorrectly placed out-of-bag is not what out-of-bag error describes It also does not include an average of incorrect predictions
Which of the following artificial neural networks (ANNs) is best suited to producing artificially aged photographs of missing persons? Generative adversarial network (GAN) Convolutional neural network (CNN) Recurrent neural network (RNN) Multi-layer perceptron (MLP)
A GAN is designed so that a generator will create new images based on its training, which will then be analyzed by a discriminator in an attempt to identify forgeries. An RNN is best suited to natural language processing (NLP) and time series tasks, not image manipulation.
Which of the following is true of logistic regression as compared to k-nearest neighbor (k-NN) for classification? 0 / 1 point Logistic regression usually gives a worse model than k-NN. Logistic regression usually performs the same as k-NN. Logistic regression will sometimes be better than k-NN and sometimes not. Logistic regression usually gives a better model than k-NN.
Each algorithm may perform better under different circumstances, so it is useful to try both in order to compare their results. Logistic regression does not produce an intrinsically better model.
Recurrent Neural Network (RNN)
A RNN models sequential interactions through a hidden state, or memory. It can take up to N inputs and produce up to N outputs. For example, an input sequence may be a sentence with the outputs being the part-of-speech tag for each word (N-to-N). An input could be a sentence, and the output a sentiment classification of the sentence (N-to-1). An input could be a single image, and the output could be a sequence of words corresponding to the description of an image (1-to-N). At each time step, an RNN calculates a new hidden state ("memory") based on the current input and the previous hidden state. The "recurrent" stems from the facts that at each step the same parameters are used and the network performs the same calculations based on different inputs
Which of the following are critical components of an effective machine learning presentation? (Select two.) An opportunity for discussion A technical explanation A business problem A hypothesis
A business problem Correct You need to justify to your audience why your machine learning model is valuable to the business, and a huge part of that is highlighting the business problem that the model is trying to solve. A hypothesis Correct You need to put your initial assumptions up front so that your audience knows how your thinking evolved over time, and eventually led to your ultimate conclusions. An opportunity for discussion This should not be selected While some presentations can benefit from inviting the audience to discuss, this is not a critical component of all presentations.
You have a dataset of books, each of which is categorized according to its literary genre and the author's gender. Your null hypothesis is that author gender and literary genre have no significant effect on each other. Which of the following hypothesis testing methods would be most appropriate in evaluating the null hypothesis? A/B test Analysis of variance (ANOVA) t-test Chi-squared (χ²) test
A chi-squared test compares the effect of categorical variables. In this case, it can evaluate the question implied by the null hypothesis: Does an author's gender influence the literary genre they write in, and vice versa? ANOVA is used to compare distributions, which is not relevant to categorical variables, like those in the scenario.
Backpropagation
A common method of training a neural net in which the initial system output is compared to the desired output, and the system is adjusted until the difference between the two is minimized.
memory cells
A component of an RNN that maintains a certain state in time
Which of the following describes the relationship between a machine learning model and a machine learning algorithm? A machine learning model represents the input data before it is fed into a machine learning algorithm. A machine learning model generates a machine learning algorithm through training. A machine learning model is the sum of multiple machine learning algorithms. A machine learning model is created by applying a machine learning algorithm to input data.
A machine learning model is created by applying a machine learning algorithm to input data. The algorithm is a set of rules upon which the model is built. The model uses these rules to estimate something about the world through available data. Think of an algorithm as a blueprint or recipe, and the model as the finished building when given resources or the finished meal when given ingredients.
How does a multi-label classification perceptron differ from a binary perceptron? Multi-label perceptrons do not use the bias term. Multi-label perceptrons have multiple output neurons. Multi-label perceptrons use threshold logic units (TLUs). Multi-label perceptrons have multiple hidden layers.
A multi-label perceptron includes an output layer that has a neuron for each output class, whereas a binary perceptron has just one output neuron. Both multi-label and binary perceptrons use TLUs.
Multi-layer perceptron (MLP)
A multilayer perceptron (MLP) is a feedforward artificial neural network that generates a set of outputs from a set of inputs. An MLP is characterized by several layers of input nodes connected as a directed graph between the input and output layers. MLP uses backpropogation for training the network. MLP is a deep learning method.
Generative Adversarial Network (GAN)
A system to create new data in which a generator creates data and a discriminator determines whether that created data is valid or invalid.
Pooling Layer
A type of layer in a convolutional neural network (CNN) that applies an aggregation function to input features in order to make a more efficient selection.
Which of the following cloud services is used as a teaching tool for reinforcement learning? AWS DeepRacer Cloud AutoML Watson OpenScale Azure Databricks
AWS DeepRacer AWS DeepRacer promotes reinforcement learning to beginners by enabling them to train a toy race car. Watson OpenScale is an enterprise-grade AI project management platform, not a service that teaches reinforcement learning to beginners. Cloud AutoML is a managed machine learning service that is beginner friendly, but it doesn't promote reinforcement learning in particular. Azure Databricks is a managed machine learning and deep learning service, not a service that teaches reinforcement learning to beginners.
Which of the following is an example of an inference attack? A customer database includes name and address information. An attacker is able to find the address of a specific customer just by looking up their name. An account database includes user names and cryptographically secured passwords. An attacker finds a database on the Internet that matches plaintext passwords with their secured versions, enabling the attacker to steal users' credentials. A donation contact list for a political candidate is filtered to only show people from a particular area code. However, an attacker is able to reverse this filter to see the entire list of contacts, regardless of area code. An employee database is sorted by salary, but those salary figures have since been removed. Still, an attacker is able to figure out how much money a specific employee makes.
An employee database is sorted by salary, but those salary figures have since been removed. Still, an attacker is able to figure out how much money a specific employee makes. The attacker is able to infer some information based on existing clues—in this case, the fact that the database has been sorted by salary. If the attacker knows the high and low end of salaries in the company, or even knows the salary of a single employee, they can estimate the salary for their target.
Which of the following data sampling techniques makes cross-validation unnecessary for random forests? Stratified k-fold k-fold Bagging "Random" splitting hyperparameter
Bagging, or bootstrap aggregating with replacement, randomly samples about two thirds of the data examples for each tree for training, using the remaining one third as out of bag for testing. Each tree in the forest will see a different set of training data and the forest will see all the training data. The bagging process ensures the sampling is representative of the data.
Which of the following gradient descent methods uses the entire dataset to calculate step-wise gradients? Batch gradient descent (BGD) Mini-batch gradient descent (MBGD) Stochastic gradient descent (SGD) Stochastic average gradient (SAG)
Batch Gradient Descent (BGD) calculates the gradients for each data example in the set. Stochastic Gradient Descent) SGD only calculates gradients for a random selection of data examples in the set.
Why might big data actually be detrimental to the machine learning process? (Select two.) Big datasets can have a negative impact on predictive performance. Big datasets can be difficult for machine learning algorithms to process. Big datasets are difficult to obtain, which may result in lost time and resources to acquire them. Big datasets can have a negative impact on computing performance.
Big datasets can be difficult for machine learning algorithms to process. Correct Given that big data can vary wildly in terms of structure and content, a machine learning algorithm may have a difficult time processing such unorganized data. Big datasets can have a negative impact on computing performance. Correct In most cases, the more data you use in the training, the more computing power and time you'll need. Big datasets are difficult to obtain, which may result in lost time and resources to acquire them. This should not be selected This was true at one point, but now, acquiring incredibly large amounts of data is commonplace and not necessarily cost prohibitive.
You have a classification problem in which a model needs to classify a product as either a shirt or a sweater, AND classify that same product as either small, medium, or large. What type of classification problem is this? 0 / 1 point Multi-label classification Binary classification Multi-class classification Both multi-label and multi-class classification
Both multi-label and multi-class - The first condition of the example in this scenario (shirt or sweater) is an example of binary classification; but when combined with the second condition (small, medium, or large), the problem becomes both multi-label (because a product can be a sweater, and that sweater can be medium size) and multi-class (because the size of the product has three or more choices). In multi-label classification, a data example can be assigned multiple labels, which is not entirely what the scenario describes.
You are creating a classification model, and are selecting between logistic regression and k-nearest neighbor (k-NN). Which of the following correctly compares the two? 0 / 1 point Unlike k-NN, the output from logistic regression is the classification itself and not a probability. Unlike k-NN, logistic regression doesn't really improve its classification abilities through learning; it merely calculates the distance between data examples mapped to a feature space and determines the class. k-NN will typically take more time than logistic regression to actually make a prediction when there are many data examples and features. In a classification model, both will perform identically.
Calculating neighbors in k-NN can quickly become infeasible in large datasets. Logistic regression does, in fact, improve its abilities through learning. Logistic regression does, in fact, output a probability. In a classification model, it is not guaranteed that both will perform identically
Which of the following scenarios indicates the presence of noise in a dataset? Certain values are incorrect due to faulty measurements. Certain values fail to contribute to the model's pattern recognition abilities. Certain values are missing altogether in the dataset. Certain values strongly deviate from the dataset's normal distribution.
Certain values fail to contribute to the model's pattern recognition abilities. Noise involves data that doesn't help the model make predictions, and in some cases, hinders those predictions. Certain values are incorrect due to faulty measurements This describes errors, not noise.
You have a dataset of customer information, such as a customer's location, spending habits, product reviews, and so forth. While you don't have anything specific to predict, you want to engage in customer segmentation so that customers with similarities are considered a unified audience in your targeted marketing campaigns. Which type of machine learning outcome is most appropriate for this situation? Dimensionality reduction Classification Regression Clustering
Clustering Clustering is a type of unsupervised learning that groups together data points with similarities, and is therefore an excellent outcome when the goal is customer segmentation.
How does regularization help reduce overfitting in a machine learning model? By splitting data into groups that rotate during training. By transforming features to be on the same scale. By constraining the model parameters. By converting categorical variables into numeric variables.
Constraining the model parameters decreases variance and consequently minimizes overfitting. By transforming features to be on the same scale. - This is describing a technique like normalization or standardization, not regularization.
Cold-deck imputation or hot-deck imputation do this
Copy the missing values from similar records
Which of the following is a cost function used to evaluate the performance of a softmax function in multinomial logistic regression? 0 / 1 point Cross-entropy Cluster sum of squares Coefficient of determination Log loss
Cross-entropy penalizes low probability scores for particular classes in a multinomial logistic regression model. Cluster sum of squares cost functions are used to evaluate clustering models, not multinomial logistic regression models.
Which of the following refers to the entire process by which data is prepared and transformed into a more usable form? Feature engineering Dimensionality reduction Data wrangling Data cleaning
Data wrangling is the entire process of transforming data so that it becomes more usable. It can include many techniques, like data cleaning and feature engineering. Data Cleaning is one possible component of the preparation process, but not the entire process itself.
Leptokurtic distribution
Distribution curve is very tall, thin and peaked. (Memory: Leptokurtic leaps tall buildings in a single bound.) Kurtosis is greater than 3
How are the neurons in a multi-layer perceptron (MLP) hidden layer connected? Each neuron in a hidden layer is connected to all neurons in the output layer. Selected neurons in a hidden layer are connected to selected neurons in the output layer. Selected neurons in a hidden layer are connected to all neurons in the output layer. Each neuron in a hidden layer is connected to selected neurons in the output layer.
Each neuron in a hidden layer is connected to all neurons in the output layer. - these are referred to as fully connected layers Selected neurons in a hidden layer are connected to selected neurons in the output layer. - not the case Selected neurons in a hidden layer are connected to all neurons in the output layer. - This is not the case in hidden layers in MLPs.
Which of the following methods is used to train a multi-layer perceptron (MLP) network? Error calculations start at the end of the network and then work backwards. The activation function removes bad predictions. Error calculations start at the first layer of the network and then are propagated to the next layer. Neurons that make incorrect predictions have their weight increased.
Error calculations start at the end of the network and then work backwards. - This is called backpropagation. Starting from the last hidden layer, the neurons that would have led to a correct prediction have their weights increased. This repeats for each layer moving backwards through the network until the input layer is reached. Error calculations start at the first layer of the network and then are propagated to the next layer. - this is not the way error calculations are propagated
You have a tabular dataset of employee records. When a fellow practitioner references salary = 67000, which type of data are they referring to'? Attribute Example Feature Dimension
Feature - A feature most often refers to a specific data value held in an attribute—a column in tabular data—which is what your co-worker is referencing. An attribute most often refers to the columns in a tabular dataset, which is not what your co-worker is referencing. An example most often refers to a row or record in a tabular dataset, which is not what your co-worker is referencing
Which of the following is an advantage of iterative learning over closed-form solutions? Iterative learning is guaranteed to find the best model. Iterative learning is generally faster to perform. Iterative learning does not require feature scaling to be effective. Iterative learning enables learning from very large datasets.
Iterative learning does not require computations to be made with the entire dataset at once, and therefore enables learning on datasets that are too big to work with directly. Iterative learning requires repeated steps of computation and is therefore more expensive to perform than calculating closed-form solutions.
Which of the following kernels are most prone to overfitting the data? (Select two.) Linear Gaussian Sigmoid Polynomial
Gaussian radial basis function (RBF) may be prone to overfitting. A high polynomial degree may lead to polynomial overfitting.
Which of the following methods for hyperparameter optimization uses a population of individual parameter combinations and evolves them generationally using a fitness function? 0 / 1 point Bayesian optimization Randomized search Genetic algorithms Grid search
Genetic algorithms are evolutionary algorithms that use a fitness function to "evolve" parameters. Bayesian optimization is a method in which priors are initialized and iteratively modified based on a loss function.
Which of the following kernel methods is particularly effective when applied to data that has many more examples than features? Linear kernel Gaussian radial basis function (RBF) kernel Polynomial kernel Sigmoid kernel
Guassian Radial Basis Function (RBF) - The RBF kernel can project data into higher-dimensional space with minimal performance loss when there are many more data examples than features.
Which of the following data encoding schemes maps a string of text to seemingly random, yet still algorithmically deterministic, value? Frequency-based encoding One-hot encoding Hash encoding Target mean encoding
Hash encoding takes a string of text and maps it to a value of fixed size according to the hash algorithm used. The same input will always result in the same hash output, making this type of encoding deterministic. Frequency-based encoding uses the frequency of a variable to assign it a weight. It doesn't map text to a deterministic value.
What does skewness measure? The shape of the tails in a distribution. The difference between the most extreme outlier and either the minimum or maximum. How many data examples are two standard deviations above the mean. How much a distribution differs from the normal distribution.
How much a distribution differs from the normal distribution. Skewness measures a distribution's symmetry and how it deviates from the symmetrical bell shape of a normal distribution. The shape of the tails in a distribution - This is what kurtosis measures, not skewness.
In which of the following might a sigmoid kernel be used? Image classification Time series Natural language processing (NLP) Clustering
Image Classification - A sigmoid kernel with SVMs is very similar to a neural network and can be used for image classification and object detection. Sigmoid kernels would not be used in a time series or clustering
Hierarchical Divisive Clustering vs Hierarchical Agglomerative Clustering
In Agglomerative, each data example is its own cluster then it merges the closest. In Derivate, all data examples are in 1 cluster then it splits the clusters
Which of the following is a Python tool that provides a frontend environment to the TensorFlow deep learning library? SciPy Apache Spark MLlib Keras PyTorch
Keras Keras is a high-level interface into TensorFlow, and is particularly useful for beginners new to deep learning. SciPy is a component of the Python data science stack that provides mathematical functions, not a frontend to TensorFlow. PyTorch is a deep learning library that is similar to TensorFlow, but it is not a frontend to TensorFlow. Apache Spark MLlib is a framework for machine learning that leverages clustering computing, not a frontend to TensorFlow.
Convolutional neural networks (CNNs)
In machine learning, a convolutional neural network (CNN, or ConvNet) is a class of deep, feed-forward artificial neural network that have successfully been applied to analyzing visual imagery.
Reciever operating characteristic (ROC)
In studies of signal detection, the graphical plot of the hit rate as a function of the false alarm rate (false positive vs false negative)
It is taking too long to train a linear regression model using an iterative cost minimization technique. What might you change so it takes less time? Switch from stochastic average gradient (SAG) to batch gradient descent (BGD). Use a larger training dataset. Increase the learning rate. Switch from stochastic gradient descent (SGD) to batch gradient descent (BGD).
Increasing the learning rate of gradient descent decreases training time because it decreases the number of "steps" required to converge at a minimum. Switch from SAG to BGD - This will more likely add time to training, not reduce it.
Which of the following splitting metrics does the C4.5 decision tree algorithm use? Entropy Gini index Information gain Information gain ratio
Information Gain Ratio C4.5 uses information gain ratio as its splitting metric. Like information gain, the information gain ratio is the entropy of a parent node minus the entropy of its child nodes; but with information gain ratio, if two decision nodes have the same information gain, the decision node with the fewest number of distinct values takes precedence.
You develop a model that is erring toward high bias. Which of the following best describes the situation you've created in this model? It may underfit the training set. It is more complex than it should be. It is more likely to be influenced by noise. It is skillful.
It may underfit the training set. - Models with high bias are more likely to underfit to the training data, i.e., they will not be able to find meaningful patterns. It is more complex than it should be - Models with a high bias tend to be more simplistic
Which of the following concepts refers to actionable intelligence? Wisdom Data Knowledge Information
Knowledge Knowledge is actionable intelligence because it can reveal some action that must be taken.
You've managed to find a spreadsheet of customer purchasing history for the business over a period of a few years. You plan to feed this data into your supervised machine learning model in order to predict what types of products will generate the most gross income in the future. The spreadsheet has rows for each purchase and columns for the ID of the customer who made the purchase; what category of product they purchased; the quantity of the purchase; the total price of the purchase; and the revenue generated from the purchase. Luckily, none of the cells have missing values. Considering the objective of the machine learning model, what crucial type of data is missing from this spreadsheet? 0 / 1 point Example Label Attribute Feature
Label - The model is supposed to predict gross income, which means gross income is the label, but no such column seems to exist. Since each column is defined, you have your attributes.
Which of the following is a cross-validation technique that is commonly used to minimize bias in small datasets? Leave-one-out Bagging Stratified k-fold Holdout
Leave-one-out cross-validation (LOOCV) is particularly effective at minimizing bias in small datasets, and is typically not applied to larger datasets due to variance and performance issues. Stratified k-fold cross-validation doesn't minimize bias in small datasets in particular. Bagging is related to cross-validation and is a sampling method typically used by random forests, but it doesn't necessarily minimize bias in small datasets.
With the classification and regression tree (CART) algorithm, which feature is chosen as the root decision node? The feature with the median purity. The feature with the least purity. The feature with the highest Gini index. The feature with the lowest Gini index. Incorrect This feature would not be chosen as the root node.
Lowest Gini Index - Feedback: This feature has the greatest purity, so it is chosen as the root node. The feature with the highest Gini index would not be chosen as the root node
Which of the following reasons makes mean squared error (MSE) often preferable over mean absolute error (MAE) in machine learning? MSE is easier to compute. MSE is differentiable. MSE always gives a positive value for error. MSE incorporates the size of the training set.
MSE is differentiable, and differentiable cost functions enable useful machine learning techniques.
How does a random forest used for classification determine the final prediction? Mode of all decision tree classifications Highest accuracy of all algorithms Mean of all decision tree predictions Weighted average of impurity reduction across all trees in forest Incorrect An evaluation metric like accuracy is not used to determine the final prediction.
Mode of all decision tree predictions - A random forest is an ensemble method that uses majority voting to find the mode of the classifiers from all decision trees in the forest. Mean of all decision tree predictions - This describes how a decision tree can be used for regression. Weighted average of impurity reduction across all trees in forest - This approach can be used to identify feature variables of primary importance to aid dimensionality reduction. Highest accuracy - An evaluation metric like accuracy is not used to determine the final prediction.
Which of the following graphics processing units (GPUs) supports the proprietary Compute Unified Device Architecture (CUDA) used in applications like deep learning? Intel Xe series ARM Mali series Nvidia GeForce series AMD RX series
Nvidia GeForce series CUDA is a proprietary Nvidia standard, and is usable by modern GPUs in the GeForce line. AMD GPUs do not support CUDA.
Which of the following is a benefit of using a pooling layer in a convolutional neural network (CNN)? Pooling layers prepare data to be sent to a final dense (fully connected) layer by flattening the input. Pooling layers retain important information by adding zeroed pixels to the input, preserving its dimensions and enabling a filter to scan the entire image. Pooling layers reduce computation time by retrieving only the maximum value scanned by a filter. Pooling layers reduce computation time by increasing the distance between the filters, effectively downsampling the input.
Pooling layers reduce computation time by retrieving only the maximum value scanned by a filter. - In a pooling layer, an aggregation function makes a more efficient selection of features than a convolutional layer. Pooling layers prepare data to be sent to a final dense (fully connected) layer by flattening the input. - This describes a flattening layer, not a pooling layer. Pooling layers reduce computation time by increasing the distance between the filters, effectively downsampling the input. - this describes stride, not a pooling layer
Regularization
Regularization is a technique used in an attempt to solve the overfitting problem in statistical models.
What function does ridge regression use for its regularization term? ℓ₁ norm Mean absolute error (MAE) Mean squared error (MSE) ℓ₂ norm
Ridge regression uses the ℓ₂ norm as its regularization term. Lasso regression, not ridge regression, uses the ℓ₁ norm as its regularization term.
Why is standard deviation preferred over variance for explaining or reporting purposes? Standard deviation ensures that a descriptive measure like mean is in the same scale as the data itself. Standard deviation produces numbers that are small and easier to read. Standard deviation is able to demonstrate variability independent of the number of samples in the population. Standard deviation can be used to produce charts and other visualizations.
Standard deviation ensures that a descriptive measure like mean is in the same scale as the data itself. This is what happens when you reverse the squaring operation of variance. Doing so helps make the results more interpretable. Both standard deviation and variance make it easy to demonstrate variability independent of the number of samples in the population
What is an advantage that mean absolute error (MAE) has over mean squared error (MSE)?
Taking the square makes values less than 1 smaller and values more than 1 larger
Which of the following measures weighs both precision and recall? 0 / 1 point F₁ score Area under curve (AUC) Receiver operating characteristic (ROC) Accuracy
The F₁ score combines precision and recall to give a combined measure. ROC shows the relationship between the true positive rate and the false positive rate. AUC gives a measurement of aggregate performance across all decision thresholds for a classification model. Accuracy measures how frequently each prediction is correctly deemed positive or negative.
The threshold logic unit (TLU) in a simple perceptron determines output using which of the following functions? Sum of weighted inputs Sigmoid activation function Heaviside step function Mean of input neuron weights Incorrect A simple perceptron does not use the sigmoid activation function.
The Heaviside step function outputs a binary value 0 or 1 based on the weighted sum of the inputs plus the bias value. A simple perceptron does not use the sigmoid activation function.
Which logical operation cannot be modeled using a simple perceptron? OR AND XOR NOT
The XOR operation is not linearly separable and can't be modeled by a simple perceptron.
What can you tell by examining the cutoff line in a dendrogram? 0 / 1 point Which attributes in a linear regression model are not contributing to the outcome. The optimal number of clusters in a hierarchical clustering model. The line of best fit in a linear regression model. The number of hidden layers in a multi-layer perceptron.
The cutoff line tends to indicate a good stopping point for the hierarchical model; in other words, how many clusters there should be. Dendrograms are not applicable to linear regression.
Which of the following hyperparameters controls the width of the hyperplane in support-vector machines (SVMs) that solve linear regression problems? Regularization penalty (C) Gamma (γ) Epsilon (ε) Alpha (α)
The epsilon hyperparameter controls the width of the hyperplane in SVMs that solve linear regression problems, where larger epsilon values result in more errors.
How can you select which SVM kernel has the best skill for a particular dataset? Use the kernel that is automatically chosen for you Use a grid search function Use a score function Always use Gaussian
The grid search function allows all selected kernels and hyperparameters to be tried in combination with the best combination reported by the function While a score function can provide a comparative value for a kernel, each would have to be tested individually, which would not be efficient.
In a silhouette analysis, which of the following indicates the optimal choice of k clusters? 0 / 1 point A silhouette coefficient exactly at 0.5. A silhouette coefficient close to 1. A silhouette coefficient above 1. A silhouette coefficient close to 0.
The higher the silhouette coefficient (i.e., the closer to 1 it is), the better. Silhouette coefficients do not exceed 1.
Which of the following gates in a long short-term memory (LSTM) cell identifies what information should be kept in long-term memory? Output gate Input gate Forget gate tanh gate
The input gate determines how much of the output from the tanh function will be combined with the existing long-term state. The forget gate determines what should be removed from the long-term memory, not what should be kept. The output gate determines what should be included in the next cell's short-term memory.
Which of the following is used as the cost function for training a logistic regression model? 0 / 1 point Mean absolute error (MAE) Mean squared error (MSE) The normal function Log loss
The log loss cost function is used for logistic regression models. MSE is used as a cost function for linear regression models
Which of the following is true regarding the downside of using the closed-form normal equation to solve linear regression problems? The normal equation leads to lower predictive skill if the data is not properly normalized before training. The normal equation is not useful in large datasets due to the memory issues involved with computing inverse matrices. The normal equation cannot be regularized and is therefore unable to account for overfitting issues. The normal equation is ineffective with most datasets because you cannot take the inverse of a non-square matrix.
The normal equation is not useful in large datasets due to the memory issues involved with computing inverse matrices. Computing inverse matrices takes a great deal of time and computing power, and the larger the dataset, the more such calculations need to take place. This makes the normal equation not ideal for large datasets.
Forget gate
The portion of a Long Short-Term Memory cell that regulates the flow of information through the cell. Forget gates maintain context by deciding which information to discard from the cell state.
Which of the following activation functions would be a good choice for the hidden layers in a network? tanh Heaviside ReLU Sigmoid
The rectified linear unit (ReLU) function outputs a 0 if the input is negative, which helps make the network "sparse," increasing training performance. Use leaky ReLU to avoid the vanishing gradients problem, which can prevent error gradients from propagating backward.
What does it mean for a machine learning model to be "stochastic"?
The same input can produce different outputs over multiple training sessions.
Which of the following functions is used to train a multinomial logistic regression model? 0 / 1 point Softmax function Cost function Heaviside step function Rectified linear unit (ReLU) function
The softmax function is used to train multinomial logistic regression models by determining class probabilities. ReLU is a type of neural network activation function, and is not used to train multinomial logistic regression models.
In a confidence interval for a hypothesis test, you have a confidence level of 95% and a range of mean values that is (300, 750). What does this mean? There is a 5% chance that the true mean is not between 300 and 750. There is a 5% chance that the true mean is between 300 and 750. There is a 95% chance that the true mean is between 300 and 750. There is a 95% chance that the true mean is not between 300 and 750.
There is a 95% chance that the true mean is between 300 and 750. There is a 5% chance that the true mean is not between 300 and 750. - This is not what is indicated by the given confidence level and its associated range of values.
Which of the following enables the recurrent layers in a recurrent neural network (RNN) to be viewed in a time sequence? Unrolling Gated recurrent unit (GRU) Backpropagation through time (BPTT) Embedding
Unrolling places recurrent RNN layers in a time sequence, enabling easier visualization and training. Once unrolled, backpropagation through time (BPTT) is performed just as it would be on a normal ANN. BPTT does not enable layers to be viewed in sequence but is used to train the model once it has been unrolled. GRU is a simplified version of a long short-term memory (LSTM) cell with improved training speed.
Cost Function in Regression
Used to show difference between the model's predicted values and actual values
What are two situations that SVMs are a better choice than other classification and regression algorithms?
When the data contains outliers and when there are many dimensions
activation function
a function that assigns an output signal on the basis of the total input
coefficient of determination
a measure of the amount of variation in the dependent variable about its mean that is explained by the regression equation
prejudice, bias
a result of training data that is influenced by cultural or other stereotypes
Artificial Neural Networks (ANNs)
computer systems that are intended to mimic human cognitive functioning
Which of the following tools is primarily used to create your own word embeddings using n-grams? Word2vec fastText Bag-of-words Doc2vec
fastText - This partitions words into n-grams and stores them in embedded space. Each n-gram is n letters long. Word2vec - This can create word embeddings where each word is a vector. It does not typically use n-grams.
embedding
in an RNN, the process of condensing a language vocabulary into vectors of relatively small dimensions. Pre-Trained Word Embeddings that other machine learning practitioners have made public can be used to save on training time
Attrition bias
occurs when participants drop out of a long-term experiment or study
LSTM (long short term memory cell)
preserves input that is significat to the training process, while forgetting input that is not
Gated Recurrent Unit (GRU)
simplified version of LSTM cell that can be used to lower training time
stride
the distance between filters in a convolution as they scan an image
padding
the practice of adding pixels around an input image to preserve its dimensions, while enabling a convolutional layer to be the same size as the actual input
Perceptron
the simplest neural network possible: a computational model of a single neuron
