ML

Réussis tes devoirs et examens dès maintenant avec Quizwiz!

Given the following Pandas dataframe: df = pd.DataFrame([[0, 2, 3], [0, 4, 1], [10, 20, 30]], columns=['A', 'B', 'C']) what does df.loc[0].iat[1] do? Returns the element '2' Returns the element '0' Removes row 0 Returns an 'out of bounds' exception None of these

Returns the element '2' The statement df.loc[0].iat[1] retrieves the element at row index 0 and column index 1 in the DataFrame df. In the given DataFrame df = pd.DataFrame([[0, 2, 3], [0, 4, 1], [10, 20, 30]], columns=['A', 'B', 'C']), the element at row index 0 and column index 1 is 2.

Which of the following is used in LDA to find the maximum separation between categories. The eigenvalues of the between class scatter matrix. The eigenvalues of the ratio of between-class to within-class scatter matrices. The eigenvalues of the ratio of within-class to between-class scatter matrices. The eigenvalues of the covariance of the centered dataset features. None of these

The eigenvalues of the ratio of between-class to within-class scatter matrices. In Linear Discriminant Analysis (LDA), the goal is to maximize the separation between different class categories. This is achieved by finding the directions (eigenvectors) that maximize the ratio of between-class scatter to within-class scatter. The scatter matrices used in LDA are the between-class scatter matrix (Sb) and the within-class scatter matrix (Sw). The eigenvalues of the ratio (Sb/Sw) determine the importance of each eigenvector in maximizing the separation between classes. By selecting the eigenvectors corresponding to the largest eigenvalues, LDA can achieve the maximum separation between categories.

Which of the following represents the shape of the Weights for a layer in the DNN framework set up with multiple layers? shape of B (biases) for the current layer (#nodes in the current layer, #nodes in previous layer) shape of the Z for the current layer (#nodes in the previous layer, #nodes in current layer) None of these

(#nodes in the current layer, #nodes in previous layer) Each node in the current layer is connected to every node in the previous layer, so the weights matrix will have dimensions based on the number of nodes in the current layer and the number of nodes in the previous layer.

For a our DNN framework set up as a Perceptron (1 node, 1 layer), what would be the shape of the weights matrix for an input image of 28x28 pixels and 80 examples? (784, 1) (62720, 1) (28, 80) (784, 80) None of these

(784, 1) In this case, each pixel in the 28x28 input image is considered as a separate input feature, resulting in a total of 784 input features. The weights matrix will have the same number of rows as the number of input features (784) and a single column for the weights associated with the single node in the layer.

Consider being tested for cancer that occurs in 3% of the people your age and the test is 95% reliable. If you test positive how probable is it that you have cancer? (use your Bayes theorem assignment code) 0.27 0.37 0.47 0.57 None of these

0.37 Given: P(A) = 0.03 (probability of having cancer) P(B|A) = 0.95 (probability of testing positive given having cancer) We want to find P(A|B), the probability of having cancer given a positive test result. Using Bayes' theorem: P(A|B) = (P(B|A) * P(A)) / P(B) To calculate P(B), we need the probability of testing positive in any case, whether having cancer or not. P(B) = P(B|A) * P(A) + P(B|~A) * P(~A) P(B) = 0.95 * 0.03 + (1 - 0.95) * (1 - 0.03) Now, we can calculate P(A|B): P(A|B) = (P(B|A) * P(A)) / P(B) P(A|B) = (0.95 * 0.03) / P(B) Calculating P(B): P(B) = (0.95 * 0.03) + (0.05 * 0.97) P(B) = 0.0285 + 0.0485 P(B) = 0.077 Now, substituting the values back into P(A|B): P(A|B) = (0.95 * 0.03) / 0.077 P(A|B) ≈ 0.3684

What is the R-squared score for the following y = np.array([3, -0.5, 2, 7, 10, 4, -2.3]) y_hat = np.array([2.5, 0.0, 2, 8, 12.2, 3, -3.2])

0.9235807860262009

With the distribution generated from the following code, how many outliers exist? from numpy.random import randnseed(1) data = 5 * randn(1000) + 50 0 6 9 22 None of these

22

For the warmup from 2/8, how many retests must you take to achieve a certainty of at least 99%? 2 3 4 5

3 Given that the probability of passing a single test is 0.9 (or 90%), we can calculate the probability of failing a test as 1 - 0.9 = 0.1 (or 10%). To achieve a certainty of at least 99%, we need to calculate the probability of failing all the tests and subtract it from 1. Let's calculate the probability of failing all the tests for different numbers of retests: 1 retest: Probability of failing = 0.1 2 retests: Probability of failing both = 0.1 * 0.1 = 0.01 3 retests: Probability of failing all three = 0.1 * 0.1 * 0.1 = 0.001 4 retests: Probability of failing all four = 0.1 * 0.1 * 0.1 * 0.1 = 0.0001 As we can see, the probability of failing all the tests decreases exponentially with each additional retest. To achieve a certainty of at least 99%, we need the probability of failing all the tests to be less than or equal to 1% (or 0.01).

Which of the following best describes "fully connected" in a DNN? All nodes in the current layer are connected to all nodes in the output layer. All nodes in the current layer are connected to all nodes in the previous or subsequent layer. All features from a dataset are flattened into a j x 1 vector. The number of outputs is equal to the number of inputs. None of these

All nodes in the current layer are connected to all nodes in the previous or subsequent layer. In a fully connected layer of a Deep Neural Network (DNN), also known as a dense layer, each neuron in the current layer is connected to every neuron in the previous or subsequent layer. This means that the output of each neuron in the current layer is determined by a weighted sum of inputs from all neurons in the previous layer. This connectivity pattern allows information to flow freely between layers, enabling the network to learn complex representations and capture non-linear relationships in the data.

Why should training inputs be scaled (standardized and normalized) when using KNN? The inputs do not need to be scaled for KNN. Because KNN is a density-based algorithm. Because KNN is a distance-measure algorithm. Because inputs to all Machine Learning algorithms should be scaled. None of these

Because KNN is a distance-measure algorithm. When using the K-nearest neighbors (KNN) algorithm, it is generally recommended to scale or normalize the training inputs. This is because KNN relies heavily on calculating distances between data points to determine their similarity. If the features have different scales or units, certain features with larger scales may dominate the distance calculation and influence the outcome more than others. Scaling the inputs ensures that all features contribute equally to the distance calculation. By scaling the inputs, you normalize the range of values for each feature, making them comparable and preventing any one feature from dominating the distance calculation. Common scaling techniques include standardization (subtracting the mean and dividing by the standard deviation) and normalization (scaling to a specified range, such as [0, 1]). Scaling the inputs can improve the performance and accuracy of KNN by providing more meaningful and balanced distance measures between data points.

K-means attempts to iteratively calculate the position of which of the following? Mean of all the data point positions Volume of the clusters First data point location Centroid of the clusters None of these

Centroid of the clusters K-means attempts to iteratively calculate the positions of the centroids of the clusters. The centroids represent the mean or average position of the data points within each cluster

In your own words, describe why the Chi Square test can be useful for data science or machine learning?

Chi-Square test is useful for data science and machine learning because it allows us to quantify and interpret the difference between what we expect our outcome to be and the outcome that we are getting

What is the main function/goal of the LDA algorithm? Construct new variables as linear combinations of the initial variables such that all information is compressed into the last new variable. Find Linear Discriminants based on new variables comprised of the transpose of the covariance matrix. Construct new variables as linear combinations of the initial variables such that most information is compressed into the first new variables. Find the Linear Discriminants with the least variance summed over the sample space. None of These

Construct new variables as linear combinations of the initial variables such that most information is compressed into the first new variables. The main function/goal of the Linear Discriminant Analysis (LDA) algorithm is to transform the original features into a new set of variables while maximizing the separation between different classes or categories in the data. It aims to find linear combinations of the original variables that best discriminate between different classes. The LDA algorithm achieves this by maximizing the ratio of between-class variance to within-class variance. The resulting linear combinations, called linear discriminants, are ordered in such a way that the first discriminant captures the most discriminative information, the second discriminant captures the second most discriminative information, and so on.

What is the main function/goal of the PCA algorithm? Construct new variables as linear combinations of the initial variables such that all information is compressed into the last variable. Find Principal components based on new variables comprised of the transpose of the covariance matrix. Construct new variables as linear combinations of the initial variables such that most information is compressed into the first variables. Find the Principal components with the least variance summed over the sample space. None of these

Construct new variables as linear combinations of the initial variables such that most information is compressed into the first variables. PCA aims to transform a high-dimensional dataset into a lower-dimensional space while retaining the most important information. It achieves this by identifying the principal components, which are new variables formed as linear combinations of the original variables. The first principal component captures the maximum variance in the data, followed by the second principal component capturing the second highest variance, and so on. By compressing the information into the first few principal components, PCA allows for dimensionality reduction while retaining as much of the original data's variance as possible. This reduction in dimensionality can help with data visualization, feature selection, and noise reduction, among other applications.

What does the following function do? X_train = X_train.reshape(X_train.shape[0], -1).T Converts the shape of X_train to (X_train.shape[0], X_train.shape[1].T) Converts the shape of X_train to (X_train.shape[0], X_train.shape[1]) Converts the shape of X_train to (X_train.shape[0], m) where m = number of examples Converts the shape of X_train to (X_train.shape[0], n) where n = product of all other dimensions of X_train None of these

Converts the shape of X_train to (X_train.shape[0], m) where m = number of examples In the given code, X_train is reshaped using the reshape function. The reshape function allows us to modify the shape of an array. X_train.shape[0] represents the number of examples in X_train, and by reshaping X_train to have a new shape of (X_train.shape[0], -1), we keep the number of examples unchanged and reshape the remaining dimensions to fit the new shape. The -1 in the reshape function indicates that the size of that dimension will be automatically determined based on the original shape and the other specified dimensions. The .T at the end performs a transpose operation, swapping the rows and columns of the reshaped array. This is often done to align the dimensions properly for further computations or operations.

Which of the following performs 'element-wise' matrix multiply (i.e. Hadamard product) and which performs standard matrix multiply (i.e. dot product). Assume numpy has been imported as np. E.g.x1 = np.arange(9.0).reshape((3, 3))x2 = -1*np.array([[0,1,2],[3,4,5],[6,7,8]])#Y= np.multiply(x1, x2)#Y = x1*x2#Y = x1@x2 Extension:Which multiply operation is performed with the numpy 'dot' function? Element-wise: 'np.multiply' and '@', Standard Matrix Multiply: '*' Element-wise: 'np.multiply' and '*', Standard Matrix Multiply: '@' and 'dot' Element-wise: '@' and '*' , Standard Matrix Multiply: 'np.multiply' Element-wise: 'np.multiply', Standard Matrix Multiply: '@' and '*' None of these Cross-correlation: 'np.multiply' and '*', Standard Matrix Multiply: '@'

Cross-correlation: 'np.multiply' and '*', Standard Matrix Multiply: '@' Regarding the extension question, the np.dot() function also performs the standard matrix multiplication.

What is one of the advantages of the DBSCAN algorithm over Kmeans? DBSCAN can identify more tightly packed clusters than Kmeans. DBSCAN is more computationally efficient and uses less memory than Kmeans. DBSCAN can identify oddly-shaped or overlapped clusters. There are no advantages of DBSCAN over Kmeans. None of these

DBSCAN can identify oddly-shaped or overlapped clusters. One of the advantages of the DBSCAN (Density-Based Spatial Clustering of Applications with Noise) algorithm over K-means is its ability to identify clusters of arbitrary shape. Unlike K-means, which assumes that clusters are spherical and have similar sizes, DBSCAN is capable of identifying clusters of different shapes and sizes, including irregularly shaped or overlapped clusters. DBSCAN achieves this by defining clusters based on density connectivity rather than geometric proximity alone.

Which of the following is the best definition of 'between-class scatter' (S_b) in LDA? Distances from each of the dataset class categories center to the overall dataset center Distance along the axis of maximum variation between all data in the dataset Distances of each sample in the dataset class to the mean of the dataset class Distance perpendicular to the dataset class categories center and the overall dataset center None of these

Distances from each of the dataset class categories center to the overall dataset center. The between-class scatter measures the spread or dispersion between different class categories in the dataset. It quantifies the separation between the class category centers and the overall dataset center. By calculating the distances between the class category centers and the overall dataset center, LDA aims to maximize the between-class scatter, as it indicates a better discrimination between different classes.

What do the eigenvectors and eigenvalues represent in PCA? Eigenvectors: covariance of the features along the diagonal, Eigenvalues: scaled covariance in the direction of the eigenvector Eigenvectors: amount of variance attached to each PC, Eigenvalues: direction of PC with most variance Eigenvectors: direction of PC with least variance, Eigenvalues: amount of variance attached to each PC Eigenvectors: direction of PC with most variance, Eigenvalues: amount of variance attached to each PC None of these

Eigenvectors: direction of PC with most variance, Eigenvalues: amount of variance attached to each PC In PCA, eigenvectors and eigenvalues play a fundamental role. The eigenvectors represent the directions or axes in the original feature space, while the eigenvalues represent the amount of variance or importance associated with each eigenvector. When performing PCA, the eigenvectors are computed from the covariance matrix (or correlation matrix) of the original data. These eigenvectors represent the principal components (PCs) of the data, and they indicate the directions along which the data varies the most. The eigenvectors are orthogonal to each other, and they form a new coordinate system. The eigenvalues correspond to the variance of the data along each eigenvector or PC. They indicate the amount of information or variability that is captured by each PC. Higher eigenvalues indicate that the corresponding PC explains more variance in the data.

Which of the following best describes how a cluster is formed in DBSCAN? Extend all the noise points into a cluster by adding any point that is within its neighborhood Group all core points together Extend all density-reachable data points with noise points Extend a cluster from a core point by adding other core points within the neighborhood None of these

Extend a cluster from a core point by adding other core points within the neighborhood In DBSCAN (Density-Based Spatial Clustering of Applications with Noise), clusters are formed by connecting core points to their density-reachable neighbors. A core point is a data point that has at least a specified number of other points within its neighborhood (defined by the epsilon parameter). Starting from a core point, the algorithm expands the cluster by adding other core points that are within the neighborhood. This process continues until no more core points can be added, and the cluster is formed. Noise points, which are points that do not have enough neighbors to be considered core points, are not included in any cluster.

Which of the following will vectorize the cost function J? J = (- 1 / m) * (Y*np.log(A + epsilon) + (1 - Y)*(np.log(1 - A + epsilon))) J = (- 1 / m) * (np.sum(Y*np.log(A + epsilon) + (1 - Y)*(np.log(1 - A + epsilon)))) J = (- 1 / m) * np.squeeze((Y*np.log(A + epsilon) + (1 - Y)*(np.log(1 - A + epsilon)))) J = (- 1 / m) * (Y*np.log(A) + (1 - Y)*(np.log(1 - A))) None of these

J = (- 1 / m) * (np.sum(Y*np.log(A + epsilon) + (1 - Y)*(np.log(1 - A + epsilon))))

A what type of machine learning model does Lasso regression revert to when the lambda (alpha) regularization parameter approaches zero? Linear Regression Logistic Regression Non-linear Regression Ridge Regression None of these

Linear Regression Lasso regression is a linear regression model with L1 regularization, where the lambda parameter controls the strength of the regularization. As the lambda parameter approaches zero, the penalty on the coefficients diminishes, and the L1 regularization term has less influence on the model. This effectively removes the sparsity-inducing property of Lasso regression, and the model becomes equivalent to ordinary Linear Regression. In Linear Regression, the objective is to minimize the sum of squared residuals between the predicted and actual values. There is no additional penalty term on the coefficients as in Lasso regression or Ridge regression. The model aims to find the best-fit line or hyperplane that minimizes the residual errors.

Which of the following biological neuron element is modeled by an electrical neuron model 'weight' parameter? Dendrite Soma Axon Synapse None of these

None of These Among the options provided, the 'weight' parameter in an electrical neuron model is not directly related to any specific biological neuron element. Instead, the 'weight' parameter is used in artificial neural networks (ANNs) to represent the strength or importance of the connection between neurons. It is an adjustable parameter that determines the contribution of a particular input to the overall activation of a neuron. In an ANN, the 'weight' parameter is multiplied by the input signal from the previous layer before being processed by the neuron. It helps regulate the influence of each input on the neuron's output, allowing the network to learn and make predictions based on the given data.

Which of the following gradients are needed to complete backpropagation in a DNN? dW, dB, dX only dZ, dX, dA only dB, dW only dA only None of these

None of These To complete backpropagation in a DNN (Deep Neural Network), the following gradients are needed: dW (gradient of the weights) dB (gradient of the biases) dZ (gradient of the pre-activation) dA (gradient of the activation) These gradients are necessary for updating the weights and biases of the neural network during the learning process. Therefore, the correct answer is None of these as all the mentioned gradients (dW, dB, dZ, dA) are needed for backpropagation in a DNN.

Given the list: T = [[1,2,3], [4,5,6], [7,8,9]], which of the following will be printed after executing: print(T[2]) [2, 5, 8] [1, 2, 3] [4,5,6] None of these

None of these The code T[2] accesses the third element in the list T, which is [7, 8, 9].

What does 'brr' contain after executing the following:? arr = np.arange(10) brr = arr.reshape((2,4)) [0 1 2 3 4 5 6 7 8 9] [[0 1 2 3 4], [5 6 7 8 9]] [[1 2 3 4], [5 6 7 8 ]] None of these - the code has an error

None of these - the code has an error [[0 1 2 3] [4 5 6 7]] is result

Describe Bayes Theorem. What is the equation, why is it useful, and what are some applications?

P(A|B) = (P(B|A) * P(A)) / P(B) The Bayes Theorem is a way to calculate conditional probabilities. It is useful because it does not require joint probabilities to calculate conditional probabilities. One way we can use it is to create a spam email filterer.

Which of the following best describes what a Scree plot is and what it can be used for? Plot of covariance versus number of eigenvalues computed. Used to initialize the principal components. Plot of principal component (PCs) magnitudes versus PC directions. Used to eliminate largest variance data. Plot of variance or proportion of variance versus number of principal components (PCs). Used to determine which PC's to keep. Plot of variance or proportion of variance versus number of features in the data set. Used to determine which features to keep. None of these

Plot of variance or proportion of variance versus number of principal components (PCs). Used to determine which PC's to keep. A Scree plot is a graphical representation of the variance or proportion of variance explained by each principal component (PC) in a PCA analysis. It shows the eigenvalues (or variance) on the y-axis and the corresponding PC number on the x-axis. The plot is typically a line or a bar chart. The Scree plot is used to determine the number of principal components to retain in the analysis. It helps identify the "elbow" or point of diminishing returns, where adding more components does not significantly increase the explained variance. By examining the plot, one can decide how many principal components to keep based on the desired level of explained variance.

Which of the following is the primary goal of the K nearest neighbors (KNN) algorithm? Transform a dataset into K clusters Minimize the number of clusters in an unlabeled dataset Predict or classify a dependent variable based some number of data points in the dataset Generate a cost metric which maximizes the number of distances in a dataset None of these

Predict or classify a dependent variable based some number of data points in the dataset The primary goal of the K-nearest neighbors (KNN) algorithm is to predict or classify a dependent variable based on the features of the K nearest data points in the dataset. KNN is a supervised learning algorithm that works based on the assumption that similar data points tend to have similar labels. Given a new data point, KNN finds the K nearest neighbors (data points) based on a distance metric (e.g., Euclidean distance) and determines the class or value of the new data point based on the majority class or average value of its K nearest neighbors.

Which of the following best describes the definition of epsilon (eps) in DBSCAN? Count of the number of points in a neighborhood Radius around a data point which defines a neighborhood Threshold that defines density-reachable data points A neighborhood defining either a core or border point None of these

Radius around a data point which defines a neighborhood In DBSCAN (Density-Based Spatial Clustering of Applications with Noise), epsilon (eps) is the radius parameter that defines the distance within which neighboring points are considered part of the same neighborhood. It determines the size of the region or neighborhood around each data point that is used to determine its density and connectivity to other points. Any data point within the radius (eps) of another point is considered a part of its neighborhood.

What does a 'silhouette coefficient' evaluate? Ratio of the difference of the average inter-cluster and intra-cluster distances to the max difference of inter and intra distances. Ratio of the sum of the variances of each cluster to the max difference of the variances. Ratio of the total number of cluster centers found to the total variance of each cluster. Ratio of the max inter-cluster and intra-cluster distances to the min inter-cluster and intra-cluster distances. None of these

Ratio of the difference of the average inter-cluster and intra-cluster distances to the max difference of inter and intra distances. The silhouette coefficient evaluates the quality of clustering by measuring how well each sample in a cluster is separated from samples in other clusters. It is calculated as the difference between the average distance to samples in the same cluster (intra-cluster distance) and the average distance to samples in the nearest neighboring cluster (inter-cluster distance), divided by the maximum of these two values. A higher silhouette coefficient indicates better-defined and well-separated clusters.

Which of the following best describes the Numpy "squeeze" function? Broadcasts a NumPy array to the shape needed for a calculation (e.g. forward prediction equation) Transposes a NumPy array Flattens a NumPy array of any dimension. Removes axes of length 1 from a NumPy array. None of these

Removes axes of length 1 from a NumPy array. The "squeeze" function in NumPy is used to remove axes of length 1 from an array. An axis with length 1 represents a singleton dimension, meaning that it has only one element along that axis. The "squeeze" function eliminates these singleton dimensions, effectively reducing the dimensionality of the array. For example, if we have an array with shape (3, 1, 4), the "squeeze" function will remove the second dimension of length 1, resulting in a new array with shape (3, 4). This function is useful in cases where we want to remove unnecessary singleton dimensions from an array, as they can sometimes cause issues during computations or when working with algorithms that expect a certain array shape.

What does the following code snip do? data['horsepower'].replace(np.nan, data['horsepower'].astype('float').mean(axis=0), inplace=True) Returns the horsepower series with True for elements with nan. Replaces any nan's with in the horsepower column with 0. Replaces any nan's with the mean of the horsepower feature. The code produces an exception error since axis=0 defines a row. None of these

Replaces any nan's with the mean of the horsepower feature. The code snippet data['horsepower'].replace(np.nan, data['horsepower'].astype('float').mean(axis=0), inplace=True) replaces any NaN (missing) values in the 'horsepower' column of the DataFrame 'data' with the mean of the non-missing values in the 'horsepower' column.

Increasing the lambda (alpha) regularization parameter in lasso regression serves what purpose? Selects more features to counter overfitting the model Selects less features to counter overfitting the model Selects less features to counter underfitting the model Selects more features to counter underfitting the model None of these

Selects less features to counter overfitting the model By selecting fewer features, Lasso regression helps to reduce overfitting, as it focuses on the most relevant features for prediction and discards or reduces the impact of irrelevant or redundant features. This can improve the generalization capability of the model and prevent it from fitting the noise or random fluctuations in the training data.

logistic Regression uses which of the following functions to divide the data into two classes? Hyperbolic Tangent Sigmoid Sinc (sin(x)/x) Logistic None of these

Sigmoid Logistic regression uses the sigmoid function (also known as the logistic function) to divide the data into two classes. The sigmoid function is an S-shaped curve that maps any real-valued number to a value between 0 and 1. In logistic regression, the sigmoid function is applied to the linear combination of the input features and model coefficients to produce a probability value. This probability represents the likelihood of the input belonging to a particular class. By choosing a threshold (often 0.5), logistic regression classifies the data points based on whether the predicted probability is above or below the threshold. If the probability is above the threshold, the data point is assigned to one class, and if it is below the threshold, it is assigned to the other class. The sigmoid function allows logistic regression to model binary classification problems and separate the data into two distinct classes.

Summarize the differences between Simple and Multiple Linear Regression.

Simple linear regression is used for the relationship between one independent variable and one dependent variable while Multiple Linear Regression is used for the relationship between multiple (more than 1) independent variables and one dependent variable.

If a 3rd degree Polynomial Regression model has both training and test error of zero (i.e. perfectly predicts the output), what might happen if you apply the same dataset to a 4th degree model? Both the training error and test error are guaranteed to be zero Both the training error and test error will be non-zero The training error should be zero but the test error may be non-zero The training error should be non-zero but the test error will be zero None of these

The training error should be zero but the test error may be non-zero If a 3rd degree Polynomial Regression model has both training and test error of zero (i.e., perfectly predicts the output), it does not guarantee that the same dataset applied to a 4th degree model will have both training and test error of zero. In fact, applying a more complex model like a 4th degree polynomial to the same dataset can potentially lead to overfitting. Overfitting occurs when a model captures the noise and random fluctuations in the training data, resulting in poor generalization to unseen data. The additional complexity introduced by the higher degree polynomial may cause the model to fit the training data too closely, leading to a very low or zero training error but a higher test error.

Why does the logistic regression cost function contain a negative "correction" for the log functions (i.e. -log(P(x) and -log(1- P(x)))? To correct for rounding error in the cost equation. To correct the offset introduced by the mean square error residual bias. To correct for a left skewed probability density function. To correct for the log of the probabilities which will always be negative (log[0..1] --> negative) None of these

To correct for the log of the probabilities which will always be negative (log[0..1] --> negative) In logistic regression, the negative logarithm of the predicted probabilities (log(P(x))) and the negative logarithm of the complement of the predicted probabilities (log(1 - P(x))) are used in the cost function. This is because the probabilities predicted by the logistic regression model range between 0 and 1. Taking the logarithm of values between 0 and 1 will result in negative values. By including the negative logarithm terms in the cost function, the optimization algorithm aims to minimize the cost, which means maximizing the likelihood of the correct class labels given the input features. The negative sign is used to reverse the direction of optimization, turning it into a minimization problem. The negative logarithm terms penalize the model when it predicts low probabilities for the true class (log(P(x))) and high probabilities for the false class (log(1 - P(x))). This encourages the model to assign high probabilities to the correct class and low probabilities to the incorrect class, leading to better classification performance.

What can using K-fold cross validation help prevent in a Machine Learning model? Excessive CPU usage in training Underfitting the model Zero frequency problem in the model Runaway Mean Squared Error None of these

Underfitting the model Using K-fold cross-validation can help prevent underfitting the model. Cross-validation is a technique used to assess the performance of a machine learning model on unseen data. K-fold cross-validation involves dividing the dataset into K subsets (folds) and performing training and evaluation K times, each time using a different fold as the validation set and the remaining folds as the training set. By using K-fold cross-validation, we can obtain more reliable estimates of the model's performance by evaluating it on multiple subsets of the data. This helps to mitigate the risk of underfitting, where the model fails to capture the underlying patterns in the data and performs poorly on both the training and unseen data.

What determines when the K-means algorithm for a given placement of random centroids should stop? When the calculated centroid positions no longer move from one iteration to the next When the MSE is 0 over all iterations After 10 iterations of the algorithm When the sum of the squared distances is maximized from one iteration to the next None of these

When the calculated centroid positions no longer move from one iteration to the next The K-means algorithm stops when the calculated centroid positions no longer move significantly from one iteration to the next. This is determined by checking the convergence criteria, which is typically based on a threshold or tolerance value. When the centroids stabilize and their positions do not change beyond the specified threshold, the algorithm is considered to have converged and stops

Given the list: my_list = [14, 2, 39, 45, 18, 4, 1, 94, 57, 18], which of the following is the output for print(my_list[ : : 2])? [39] [14, 2] [14, 39, 18, 1, 57] None of these

[14, 39, 18, 1, 57]

Given the list: my_list = [14, 2, 39, 45, 18, 4, 1, 94, 57, 18], which of the following is the output for print(my_list[ 1 : : 3])? [2, 39, 45] [2, 18, 94] [14, 45, 1, 18] None of these

[2, 18, 94]

Without using a calculator or computer program, what is the inverse of the following: np.array([[2, 0 , 0], [0, -0.25, 0], [0, 0, 10]])? Undefined, the determinant is 0 [ 0.5 0. 0. 0. -4. 0. 0. 0. 0.1] [[2, 0 , 0], [0, -0.25, 0], [0, 0, 10]] [[ 0.5 0. 0. ], [ 0. -4. 0], [ 0. 0. 0.1]] None of these

[[ 0.5 0. 0. ], [ 0. -4. 0], [ 0. 0. 0.1]] The determinant of the matrix A is given by: det(A) = 2 * (-0.25) * 10 = -5 If the determinant is not equal to zero, the matrix is invertible. In this case, since the determinant is not equal to zero, the inverse of the matrix A exists. The inverse of a 3x3 matrix can be found by using the formula: A_inv = (1/det(A)) * adj(A) where adj(A) is the adjugate of matrix A. Using the formula, we can calculate the inverse of matrix A as follows: A_inv = (1/(-5)) * adj(A) Calculating the adjugate of matrix A: adj(A) = np.array([[(-0.25)10, 0, 0], [0, 210, 0], [0, 0, 2*(-0.25)]]) adj(A) = np.array([[-2.5, 0, 0], [0, 20, 0], [0, 0, -0.5]]) Multiplying adj(A) by (1/(-5)): A_inv = (1/(-5)) * np.array([[-2.5, 0, 0], [0, 20, 0], [0, 0, -0.5]]) Simplifying the expression: A_inv = np.array([[-0.5, 0, 0], [0, -4, 0], [0, 0, 0.1]])

Consider the attached classification diagram for the AND logic function. What is the slope and intercept for the decision boundary if modeled by a Perceptron (with 2 inputs)? Hint: Perceptron prediction equation is: y_hat = f(np.dot(W, X) + b) = f(w1x1 + w2x2 + b) slope: w1, intercept: w2 slope: w1*w2, intercept: b slope: -b, intercept: -w2/w1 slope: -w1/w2, intercept: -b/w2 None of these

slope: -w1/w2, intercept: -b/w2

What is the covariance of the following matrix after centering? where each matrix ROW is a feature? np.array([[3, 2, 9.1], [7, 10, 14], [3, 0.5, 17.2]]) [[14.8 11.4 34.6 ], [34.6. 26.5 81.1 ], [11.4 12.3 26.5 ] ] [[3 2. 9.1], [7 10 14], [3. 0.5 17.2]] [[-1.7 -33 -3.9 ], [-2.7 -0.3 -6.4 ], [ 4.4 3.7. 10.3 ]] [[14.8 11.4 34.6 ], [11.4 12.3 26.5 ], [34.6. 26.5 81.1 ]] None of these

[[-1.7 -33 -3.9 ], [-2.7 -0.3 -6.4 ], [ 4.4 3.7. 10.3 ]] covariance matrix = (1 / (n-1)) * (X - mean(X))' * (X - mean(X)) Where X is the centered matrix, mean(X) is the mean of each column of X, and (X - mean(X))' is the transpose of (X - mean(X)). Given the centered matrix: np.array([[-1.33, -2.17, -4], [2.67, 5.83, 0.9], [-1.33, -3.67, 4.1]]) Calculating the covariance matrix using the formula: covariance matrix = (1 / 2) * np.transpose(X) * X Substituting the values: covariance matrix = (1 / 2) * np.transpose(np.array([[-1.33, -2.17, -4], [2.67, 5.83, 0.9], [-1.33, -3.67, 4.1]])) * np.array([[-1.33, -2.17, -4], [2.67, 5.83, 0.9], [-1.33, -3.67, 4.1]]) Simplifying the expression: covariance matrix = (1 / 2) * np.array([[14.8, 11.4, 34.6], [11.4, 12.3, 26.5], [34.6, 26.5, 81.1]]) covariance matrix = np.array([[7.4, 5.7, 17.3], [5.7, 6.15, 13.25], [17.3, 13.25, 40.55]])

What is the product of the following matrices? np.array([[8, 17.3, 7.8], [2.9, 1, 1.4], [5, 0.5, 1.2]]) and np.array([[3, 2], [7, 10], [3, 0.5]]) Not possible due to dimension mismatch [[168.5 192.9], [ 19.9 16.5], [ 22.1 15.6]] [[8. 17.3 7.8], [2.9 1 1.4], [5 0.5 1.2]] [[ 22.1 15.6], [168.5. 192.9], [ 19.9 16.5]] None of these

[[168.5, 192.9], [19.9, 16.5], [22.1, 15.6]]

If a Python List is implemented as a 'queue' or FIFO (First In First Out) data structure, what functions can be used to add and remove items from the queue? push(), pop() append(), pop(-1) append(), pop(0) None of these

append(), pop(0)

What is the correct way to create a 2-D numpy array, where numpy is imported as np? arr = np.array([ [1, 2, 3, 4], [5, 6, 7, 8], [9, 10, 11, 12] ]) arr = np.array([1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12]) arr = np.array([1, 2, 3, 4], [5, 6, 7, 8], [9, 10, 11, 12]) arr = np.arange(12)

arr = np.array([ [1, 2, 3, 4], [5, 6, 7, 8], [9, 10, 11, 12] ])

What is the correct way to create a 1-D numpy array, where numpy is imported as np? arr = np[1, 2, 3, 4] arr = np.array([1, 2, 3, 4]) arr = numpy(1, 2, 3, 4) arr = np.dim1([1, 2, 3, 4])

arr = np.array([1, 2, 3, 4])

Which of the following equations can be used to calculate the bias gradient in backpropagation? dB = (np.dot(dB, X))/m dB = (dZ * X.T)/m dB = (np.dot(X, dZ.T))/m dB = np.sum(dZ)/m None of these

dB = np.sum(dZ, axis=1, keepdims=True)/m In backpropagation, the gradient of the bias is computed by summing the gradient of the cost function with respect to the output (dZ) along the appropriate axis (axis=1 in this case) and then dividing by the number of training examples (m).

Which of the following equations can be used to calculate the weights gradient in backpropagation? dW = (np.dot(dB, X))/m dW = (dZ * X.T)/m dW = (np.dot(X, dZ.T))/m dW = np.sum(dZ)/m None of these

dW = (np.dot(dZ, X.T))/m In backpropagation, the gradient of the weights is computed by multiplying the gradient of the cost function with respect to the output (dZ) by the transpose of the input (X.T) and then dividing by the number of training examples (m).

In backpropagation, which of the following is equivalent to the output gradient in layer 'L'? dX in layer L-1 dX in layer L+1 dW = in layer L+1 np.dot(W.T, dZ) in layer L-1 None of these

dX in layer L+1

What Pandas function could be used to drop a row with a missing element in a data frame . dropRow() dropna(axis=1) dropna(axis=0) drop() None of these

dropna(axis=0) The dropna() function is used to remove rows or columns with missing values (NaN) from a DataFrame. By specifying axis=0, it indicates that rows containing at least one missing value should be dropped.

Which of the following code snips could be used to generate 3 horizontally distributed plots.? fig, ax = plt.subplots(3) fig, ax = plt.subplots(1, 3) fig, ax = plt.subplots(3, 1) fig, ax = plt.subplots(3, horizontal) None of these

fig, ax = plt.subplots(1, 3) This code snippet creates a figure with 1 row and 3 columns of subplots, resulting in three horizontally distributed plots. The variable fig represents the figure object, and ax is an array of Axes objects corresponding to each subplot.

Which of the following will iterate through my_list2 and my_list3 lists at the same time? for x in my_list2, for y in my_list3: for x, y in zip(my_list2, my_list3): for i in my_list2, my_list3: None of these

for x, y in zip(my_list2, my_list3): The zip() function combines the elements from multiple iterables into tuples, and the for loop then iterates over these tuples, assigning each element to x and y respectively. This allows you to process the corresponding elements of my_list2 and my_list3 together in each iteration.

What does the following code snip do? Assume numpy has been imported as np. Assume numpy.random.uniform has been imported .Assume X is a 2-d numpy array of coordinates. min_ = list(np.min(X, axis=0)) max_ = list(np.max(X, axis=0)) d1 = np.array([random.uniform(min_[0], max_[0]), random.uniform(min_[1], max_[1])]) generates a random x,y coordinate between the min and max values in X generates a random integer between the min and max values in X generates a uniformly distributed dataset of len(X) between min and max generates an array containing the min and max coordinate of X None of these

generates a random x,y coordinate between the min and max values in X

What is a common data structure used to compute the conditional probability in Multinomial Naive Bayes Classifiers? heat map histogram ROC curve mean squared error None of these

histogram In Multinomial Naive Bayes, the feature vectors are often represented as histograms, where each bin represents a category or value, and the count in each bin represents the frequency or occurrence of that category or value in the document or data point. These histograms capture the distribution of the features across the different classes or categories. The conditional probability in Multinomial Naive Bayes is calculated based on the frequencies or counts of features in each class, which are typically stored in histograms. The conditional probability represents the likelihood of observing a particular feature value given a specific class. By utilizing these histograms, Multinomial Naive Bayes classifiers can compute the conditional probabilities and make predictions based on them.

Which of the following sklearn.cluster.KMeans functions can be used to find the optimum K on an elbow curve? cluster_centers_ n_iter_ labels_ inertia_ None of these

inertia_ The sklearn.cluster.KMeans function that can be used to find the optimum K on an elbow curve is inertia_. The inertia_ attribute represents the sum of squared distances of samples to their closest cluster center. It is a measure of how well the data points are clustered. By plotting the inertia values for different values of K and looking for the "elbow" point where the inertia starts to level off, we can determine the optimal number of clusters

What Pandas function could be used to find missing entries in a data frame . isnull() ismissing() all() dropna() None of these

isnull() The isnull() function returns a DataFrame of the same shape as the input DataFrame, where each element is a Boolean value indicating whether it is a missing (NaN) value or not.

Which of the following will create a list 'mult' comprised of 3 copies of the list 'my_list'? mult = 3 * len(my_list) mult = (3) * my_list mult = [3] * my_list None of these

mult = (3) * my_list

What Pandas functions could be used to order a series of the number of records with in the column 'myCol' with a specific value 'spec_value' in decending order. myCol['spec_value'].value_counts().sort_values() myCol['spec_value'].sort_values().value_counts() myCol[].value_counts('spec_value').sort_values() myCol['spec_value'].value_counts().sort_values(ascending=False) None of these

myCol['spec_value'].value_counts().sort_values(ascending=False) This code first uses the value_counts() function on the 'spec_value' column to count the occurrences of each value. Then, the sort_values() function is used with the ascending=False parameter to sort the counts in descending order.

How could we create a Pandas data frame with 'latitude' and 'longitude' from the attached data frame (called my_df)? ew_df = my_df new_df = my_df['latitude', 'longitude'] new_df = my_df('latitude', 'longitude') new_df = my_df[['latitude', 'longitude']] None of these

new_df = my_df[['latitude', 'longitude']] This code selects the 'latitude' and 'longitude' columns from the my_df DataFrame using double brackets [['latitude', 'longitude']]. It creates a new DataFrame new_df with only the selected columns.

Which of the following code snips would calculate the Euclidean distance between a test coordinate and a numpy array of coordinates. Assume numpy as been imported as np Assume the following example data: -- testpoint = np.array([0,0]) -- testdata = np.array([[1,2], [3,4], [5,12], [8,15]]) np.sum((point - data)**2, axis=1) np.sqrt(np.sum((point - data)**2, axis=1)) np.sqrt((point - data)**2, axis=1) np.sqrt(np.sum((point - data)**2, axis=0)) None of these

np.sqrt(np.sum((point - data)**2, axis=1))

Which of the following code snips from the scipy.stats.norm library could be used to calculate the P-value from the z-score? p-value = norm.sf(abs(z)) p-value = norm.pvalue(abs(z)) p-value = norm.survival(abs(z)) p-value = norm.sf((abs(p-Value)) None of these

p_value = norm.sf(abs(z)) In this code snippet, norm.sf() is used to calculate the survival function, which represents the area under the probability density function (PDF) of the normal distribution to the right of the given z-score. By taking the absolute value of the z-score with abs(z), we ensure that the P-value is calculated based on the magnitude of the z-score. The resulting P-value represents the probability of observing a value as extreme as the given z-score or more extreme, assuming a standard normal distribution.

What does the following code snip do? Assume matplotlib.pyplot has been imported as plt. plt.plot([x for x, _ in centers], [y for _, y in centers], '+', color='black') plots a line between each x,y coordinate in 'centers' plots the x values from 'centers', then plots the y values from 'centers' plots a black '-' for each x, y coordinate in 'centers' plots a black '+' for each x, y coordinate in 'centers' None of these

plots a black '+' for each x, y coordinate in 'centers'

In your own words, describe why the P-value is an important statistical measure for data science or machine learning?

the P-value is important because it helps us determine how likely the results we get are based on if the assumption(s) we are making are true.

What is the centered result of the following, where each matrix ROW is a feature? np.array([[3, 2, 9.1], [7, 10, 14], [3, 0.5, 17.2]]) transpose([[-1.7 -3.3 -3.9 ] [-2.7 -0.3 -6.4 ] [ 4.4 3.7 10.3 ]]) [[168.5 192.9], [ 19.9 16.5], [ 22.1 15.6]] transpose([[-2.7 -0.3 - 6.4 ] [-1.7 -3.3 -3.9 ] [ 4.4 3.7 10.3 ]]) [[3 2 9.1], [7 10 14], [3. 0.5 17.2]] None of these

transpose([[-1.7 -3.3 -3.9 ] [-2.7 -0.3 -6.4 ] [ 4.4 3.7 10.3 ]]) The mean of each column is: [ (3+7+3)/3, (2+10+0.5)/3, (9.1+14+17.2)/3 ] = [4.33, 4.17, 13.1] To center the matrix, we subtract the mean of each column from each element in that column. Subtracting the mean of the first column [4.33] from the elements in the first column: [3-4.33, 7-4.33, 3-4.33] = [-1.33, 2.67, -1.33] Subtracting the mean of the second column [4.17] from the elements in the second column: [2-4.17, 10-4.17, 0.5-4.17] = [-2.17, 5.83, -3.67] Subtracting the mean of the third column [13.1] from the elements in the third column: [9.1-13.1, 14-13.1, 17.2-13.1] = [-4, 0.9, 4.1] Therefore, the centered result of the matrix is: np.array([[-1.33, -2.17, -4], [2.67, 5.83, 0.9], [-1.33, -3.67, 4.1]])

What is the decision (classification) boundary for a linearly separable dataset if the weights and bias in a Perceptron model are: w1=0.53, w2=0.86, b=0.73? y = 0.53x + 0.73 y = 0.73 y = -0.62x - 0.85 y = -0.85x - 0.62 None of these

y = -0.6163x - 0.8488 The decision (classification) boundary for a linearly separable dataset in a Perceptron model is represented by a linear equation of the form y = mx + b, where m is the slope and b is the intercept. In this case, the given weights and bias are: w1 = 0.53 w2 = 0.86 b = 0.73 To determine the slope and intercept, we can use the formula: slope = -w1 / w2 intercept = -b / w2 Substituting the given values: slope = -0.53 / 0.86 ≈ -0.6163 intercept = -0.73 / 0.86 ≈ -0.8488 Therefore, the decision boundary equation is approximately: y ≈ -0.6163x - 0.8488

Which of the following prediction equations is NOT a linear regression model? ŷ= 𝑏0+ 𝑏1𝑥1 ŷ= 𝑏0+ 𝑏1𝑏2^𝑥 ŷ= 𝑏0+𝑏1𝑥1+𝑏2𝑥2^10 ŷ= 𝑐0+𝑐1𝑥1^3+𝑐2𝑥1 None, they are all linear regression models

ŷ= 𝑏0+𝑏1𝑥1+𝑏2𝑥2^10. In linear regression, the relationship between the dependent variable (ŷ) and the independent variables (𝑥1, 𝑥2, etc.) is assumed to be linear. The equation represents a polynomial regression model where the term 𝑥2^10 introduces a non-linear component. Therefore, this equation is not a linear regression model.

What is the derivative of σ(𝑥)=1/(1+𝑒^−𝑥) with respect to x? σ(𝑥)∙(1−σ(𝑥)) σ(𝑥) (σ(𝑥)-1)∙(1+σ(𝑥)) 1−σ(𝑥) None of these

σ(𝑥)∙(1−σ(𝑥))


Ensembles d'études connexes

Module 14: Communication & Module 5: Decision Making

View Set

Insurance Exam Practice Test Questions

View Set

Intro to c sharp chapter 7-9 multiple choice

View Set

Chp 2. Financial Statements Taxes and Cash flow

View Set

03. EF Pre-Int Practical English 1: Hotel Problems

View Set

Nursing Concepts PrepU Practice Questions - Ch. 1

View Set

Modern Dental Assisting Chapter 31

View Set