MCQs Machine Learning

¡Supera tus tareas y exámenes ahora con Quizwiz!

A and B are the two coins we have. The probability of receiving a head on each toss of coin A is 1/2, while the probability of getting heads on each toss of coin B is 1/3. Tosses of the same coin are independent of one another. We choose a coin at random and throw it till it comes up heads. The chance of picking coin A is 1/4 and that of selecting coin B is 3/4. What is the average number of tosses required to get the first heads? (A) 2.75 (B) 3.35 (C) 4.13 (D) 5.33

Adding more basis functions in a linear model... (pick the most probably option) (A) Decreases model bias (B) Decreases estimation bias (C) Decreases variance (D) Doesn't affect bias and variance

Assume we computed the gradient of our cost function and saved it in a vector g. Given the gradient, what is the cost of one gradient descent update? (A) O(D) (B) O(N) (C) O(ND) (D) O(ND2 )

Assume we have a dataset that can be trained with 100 percent accuracy using a decision tree of depth 6. Consider the following points and select an option based on them. Note : All other hyperparameters are the same, and no other factors are impacted. (1). Depth 4 has a strong bias and a low variance. (2). Depth 4 will be low in bias and variance. (A) Only 1 (B) Only 2 (C) Only 1 and 2 (D) None of the above

Assume you're in the final round of the game show "Let's Make a Deal," and you have to pick between three doors - 1, 2, and 3. One of the three doors has a vehicle behind it, while the other two contain goats. Assume you select Door 1 and the host opens Door 3 to reveal a goat. Which of the following choices would you select to increase your chances of winning? (A) Switch your choice (B) Retain your choice (C) It doesn't matter probability of winning or losing is the same with or without revealing one door.

Assume you're working on a classification issue with a severely unbalanced class. In the training data, the majority class is observed 99 percent of the time. After making predictions using test data, your model has 99 percent accuracy. In this situation, which of the following is true? (1). For unbalanced class issues, the accuracy metric is not a good concept. (2). For unbalanced class issues, an accuracy metric is a good idea. (3). Precision and recall metrics are useful for situations with unbalanced classes. (4). Precision and recall metrics are ineffective for situations with an unbalanced class. (A) 1 and 3 (B) 1 and 4 (C) 2 and 3 (D) 2 and 4

Batch Normalization is helpful because (A) It normalizes (changes) all the input before sending it to the next layer (B) It returns back the normalized mean and standard deviation of weights (C) It is a very efficient backpropagation technique (D) None of these

Bayesian methods can perform better than the other methods while validating the hypotheses that make probabilistic predictions. (A) True (B) False

Below are the 8 actual values of target variable in the train file: [0,0,0, 0, 1, 1,1,1,1,1], What is the entropy of the target variable? (A) -(6/10 log(6/10) + 4/10 log(4/10)) (B) 6/10 log(6/10) + 4/10 log(4/10) (C) 4/10 log(6/10) + 6/10 log(4/10) (D) 6/10 log(4/10) - 4/10 log(6/10)

Bootstrapping allows us to (A) choose the same training instance several times (B) choose the same test set instance several times (C) build models with alternative subsets of the training data several times (D) test a model with alternative subsets of the test data several times

Both supervised learning and unsupervised clustering necessitate the use of at least one (A) Hidden attribute (B) Output attribute (C) Input attribute (D) Categorical attribute

Computers excel at learning. (A) facts (B) concepts (C) procedures (D) principles

Cosine similarity is most popularly used in (A) Text classification (B) Image classification (C) Feature selection (D) None of the above

Cross-fertilizing a red and a white flower produces red flowers 25% of the time. Now we cross-fertilize five pairs of red and white flowers and produce five offspring. What is the probability that there are no red flower plants in the five offspring? (A) 23.7% (B) 37.2% (C) 22.5% (D) 27.3%

Engineering a good feature space is a crucial ___ for the success of any machine learning model. (A) Pre-requisite (B) Process (C) Objective (D) None of the above

Exploration of numerical data can be best done using (A) Boxplots (B) Histograms (C) Scatter plot (D) None of the above

HIV is still a terrifying condition for which to even get tested. When a recruit is recruited in the United States military, he or she is tested for HIV. They are subjected to three rounds of Elisa (an HIV test) before being declared positive. The prior chance of anybody being infected with HIV is 0.00148. Elisa has a true positive rate of 93 percent and a true negative rate of 99 percent. What is the likelihood that a candidate has HIV if he tested positive on the first Elisa test? The prior chance of anybody being infected with HIV is 0.00148. Elisa has a true positive rate of 93% and a true negative rate of 99%. (A) 12% (B) 80% (C) 42% (D) 14%

In LDA, intra-class and inter-class ___ matrices are calculated. (A) Scatter (B) Adjacency (C) Similarity (D) None of the above

In SVM, these functions take a lower dimensional input space and transform it to a higher dimensional space. (A) Kernels (B) Vector (C) Support Vector (D) Hyperplane

In ensemble learning, predictions for weak learners are aggregated so that an ensemble of these models predicts better than individual models. Which of the following assertions is/are correct for weak learners in an ensemble model? (1). They don't often overfit. (2). Because they have a large bias, they are unable to tackle complicated learning tasks. (3). They are frequently overfit. (A) 1 and 2 (B) 1 and 3 (C) 2 and 3 (D) None of the above

In feature extraction, some of the commonly used ___ are used for combining the original features. (A) Operators (B) Delimiters (C) Words (D) All of the above

Introducing a non-essential variable into a linear regression model may result in : (1).Increase in R-square, (2).Decrease in Rsquare (A) Only 1 is correct (B) Only 2 is correct (C) Either 1 or 2 (D) None of these

Lasso can be interpreted as least-squares linear regression where (A) weights are regularized with the l1 norm (B) weights are regularized with the l2 norm (C) the solution algorithm is simpler

Let us consider two examples, say 'predicting whether a tumour is malignant or benign' and 'price prediction in the domain of real estate'. These two problems are same in nature. (A) True (B) False

Out of 200 emails, a classification model correctly predicted 150 spam emails and 30 ham emails. What is the error rate of the model? (A) 10% (B) 90% (C) 80% (D) none of the above

PCA is a technique for (A) Feature extraction (B) Feature construction (C) Feature selection (D) None of the above

Pearson captures how linearly dependent two variables are whereas Spearman captures the monotonic behavior of the relation between the variables. (A) TRUE (B) FALSE

SVM is an example of? (A) Linear Classifier and Maximum Margin Classifier (B) Non-linear Classifier and Maximum Margin Classifier (C) Linear Classifier and Minimum Margin Classifier (D) Non-linear Classifier and Minimum Margin Classifier

Simple regression assumes a ........... relationship between the input attribute and output attribute (A) linear (B) nonlinear (C) reciprocal (D) inverse

Six times a fair six-sided die is rolled. What is the probability of all outcomes being unique? (A) 0.01543 (B) 0.01993 (C) 0.23148 (D) 0.03333

Structured representation of raw input data to meaningful ___ is called a model. (A) pattern (B) data (C) object (D) none of the above

Supervised machine learning is as good as the data used to train it. (A) True (B) False

Support Vectors are near the hyperplane (A) True (B) False

Temperature is a (A) Interval data (B) Ratio data (C) Discrete data (D) None of the above

The average squared difference between classifier predicted output and actual output (A) mean squared error (B) root mean squared error (C) mean absolute error (D) mean relative error

The binomial distribution can be used to model the outcomes of coin tosses. (A) True (B) False

The new features created in PCA are known as (A) Principal components (B) Eigenvectors (C) Secondary components (D) None of the above

The probabilistic approach used in machine learning is closely related to: (A) Statistics (B) Physics (C) Mathematics (D) Psychology

The reason the Bayesian interpretation can be used to model the uncertainty of events is that it does not expect the long run frequencies of the events to happen. (A) True (B) False

This approach is quite similar to wrapper approach as it also uses and inductive algorithm to evaluate the generated feature subsets. (A) Embedded approach (B) Filter approach (C) Pro Wrapper approach (D) Hybrid approach

This is the first step in the supervised learning model. (A) Problem Identification (B) Identification of Required Data (C) Data Pre-processing (D) Definition of Training Data Set

This type of interpretation of probability tries to quantify the uncertainty of some event and thus focuses on information rather than repeated trials. (A) Frequency interpretation of probability (B) Gaussian interpretation of probability (C) Machine learning interpretation of probability (D) Bayesian interpretation of probability

Time complexity of K-fold cross-validation is (A) linear in K (B) quadratic in K (C) cubic in K (D) exponential in K

Tony draws four cards at random from a 52-card deck and replaces them in the deck ( Any set of 4 cards is equally likely ). Then, Alex draws 8 cards at random from the same deck ( Any set of 8 cards is equally likely). Assume that Tony's selection of four cards and Alex's selection of eight cards are independent. What is the likelihood that all four cards chosen by Tony are included in the set of eight cards picked by Alex? (A) 48C4 x 52C4 (B) 48C4 x 52C8 (C) 48C8 x 52C8 (D) None of the above

We can define this probability as p(A|B) = p(A,B)/p(B) if p(B) > 0 (A) Conditional probability (B) Marginal probability (C) Bayes probability (D) Normal probability

What happens when we introduce more variables to a linear regression model? (A) The r squared value may increase or remain constant, the adjusted r squared may increase or decrease (B) The r squared may increase or decrease while the adjusted r squared always increases. (C) Both r square and adjusted r square always increase on the introduction of new variables in the model. (D) Both might increase or decrease depending on the variables introduced.

What is Machine learning? (A) The autonomous acquisition of knowledge through the use of computer programs (B) The autonomous acquisition of knowledge through the use of manual programs (C) The selective acquisition of knowledge through the use of computer programs (D) The selective acquisition of knowledge through the use of manual programs

What is a dead unit in a neural network? (A) A unit which doesn't update during training by any of its neighbour (B) A unit which does not respond completely to any of the training patterns (C) The unit which produces the biggest sum-squared error (D) None of these

What is a top-down parser? (A) Begins by hypothesizing a sentence (the symbol S) and successively predicting lower level constituents until individual preterminal symbols are written (B) Begins by hypothesizing a sentence (the symbol S) and successively predicting upper level constituents until individual preterminal symbols are written (C)Begins by hypothesizing lower level constituents and successively predicting a sentence (the symbol S) (D) Begins by hypothesizing upper level constituents and successively predicting a sentence (the symbol S)

What is the meaning of hard margin in SVM? (A) SVM allows very low error in classification (B) SVM allows high amount of error in classification (C) Underfitting (D) SVM is highly flexible

What would you do in PCA to get the same projection as SVD? (A) Transform data to zero mean (B) Transform data to zero median (C) Not possible (D) None of these

When doing least-squares regression with regularization (assuming that the optimization can be done exactly), increasing the value of the regularization parameter λ (A) will never decrease the training error. (B) will never decrease the training error. (C) will never decrease the testing error. (D) may either increase or decrease the testing error.

When doing least-squares regression with regularization (assuming that the optimization can be done exactly), increasing the value of the regularization parameter(Lambda)? (A) will never decrease the training error (B) will never increase the training error (C) will never decrease the testing error (D) will never increase the testing error.

When you find many noises in data, which of the following options would you consider in kNN? (A) Increase the value of k (B) Decrease the value of k (C) Noise does not depend on k (D) k = 0

Which of the following K values will result in the lowest leave-one-out cross validation accuracy? (A) 1NN (B) 3NN (C) 4NN (D) All have same leave one out error

Which of the following algorithms is an example of the ensemble learning algorithm? (A) Random Forest (B) Decision Tree (C) NN (D) SVM

Which of the following approaches performs actions that are comparable to dropout in a neural network? (A) Bagging (B) Boosting (C) Stacking (D) None of these

Which of the following events is most likely? (A) At least one 6, when 6 dice are rolled (B) At least 2 sixes when 12 dice are rolled (C) At least 3 sixes when 18 dice are rolled (D) All the above have same probability

Which of the following factors can aid in the reduction of overfitting in an SVM classifier? (A) Use of slack variables (B) High-degree polynomial features (C) Normalizing the data (D) Setting a very low learning rate

Which of the following is a common use of unsupervised clustering? (A) detect outliers (B) determine a best set of input attributes for supervised learning (C) evaluate the likely performance of a supervised learner model (D) determine if meaningful relationships can be found in a dataset

Which of the following is an example of a deterministic algorithm? (A) PCA (B) K-Means (C) KNN (D) None of the above

Which of the following is not an inductive bias in a decision tree? (A) It prefers longer tree over shorter tree (B) Trees that place nodes near the root with high information gain are preferred (C) Overfitting is a natural phenomenon in a decision tree (D) Prefer the shortest hypothesis that fits the data

Which of the following is the measure of cluster quality? (A) Purity (B) Distance (C) Accuracy (D) all of the above

Which of the following options is true about the kNN algorithm? (A) It can be used only for classification (B) It can be used only for regression (C) It can be used for both classification and regression (D) It is not possible to use for both classification and regression

Which supervised learning method can handle both numerical and categorical input attributes? (A) linear regression (B) Bayes classifier (C) logistic regression (D) backpropagation learning

While fitting a linear regression to the data, you see the following: As the amount of training data grows, the test error reduces but the training error increases. The train error is pretty low (nearly as low as you would anticipate), while the test error is significantly greater than the train error. What do you believe is the primary cause for this behavior? Select the most likely option. (A) High variance (B) High model bias (C) High estimation bias (D) None of the above

__________ has been used to train vehicles to steer correctly and autonomously on road. (A) Machine learning (B) Data mining (C) Robotics (D) Neural networks

———- in terms of SVM means that an SVM is inflexible in classification (A) Hard Margin (B) Soft Margin (C) Linear Margin (D) Non-linear Classifier

———- is a line that linearly separates and classifies a set of data. (A) Hyperplane (B) Soft Margin (C) Linear Margin (D) Support Vectors

Which of the following statements about convolutional neural networks (CNNs) for image processing is correct? (A) Filters in earlier layers tend to include edge detectors (B) Pooling layers reduce the spatial resolution of the image (C) They have more parameters than fully connected networks with the same number of layers and the same numbers of neurons in each layer (D) A CNN can be trained for unsupervised learning tasks, whereas an ordinary neural net cannot

A, B

Data can broadly divided into following two types (A) qualitative (B) Speculative (C) quantitative (D) None of the above

A, C

For feature selection, both PCA and Lasso can be utilized. Which of the following assertions is NOT correct? (A) Lasso selects a subset (not necessarily a strict subset) of the original features (B) PCA and Lasso both allow you to specify how many features are chosen (C) PCA produces features that are linear combinations of the original features (D) PCA and Lasso are the same if you use the kernel trick

A, C

We calculate a low-rank approximation to a term-document matrix in latent semantic indexing. Which of the following is the reason for the low-rank reconstruction? (A) Finding documents that are related to each other, e.g. of a similar genre (B) The low-rank approximation provides a lossless method for compressing an input matrix (C) In many applications, some principal components encode noise rather than meaningful structure (D) Low-rank approximation enables discovery of nonlinear relations

A, C

Which of the following are unsupervised problem? (A) Google news (B) Rain predictor (C) Customer segmentation (D) Tumor prediction

A, C

What approaches may be used to reduce overfitting in decision trees? (A) Pruning (B) Make sure each leaf node is one pure class (C) Enforce a minimum number of samples in leaf nodes (D) Enforce a maximum depth for the tree

A, C, D

Which of the following are true about generative models? (A) They model the joint distribution P(class = C AND sample = x) (B) The perceptron is a generative model (C) They can be used for classification (D) Linear discriminant analysis is a generative model

A, C, D

Which of the following are not classification problems? (A) Predicting price of house (B) Predicting patient has tumor (C) Predicting who will hold the title in football league (D) Predicting percentage of student for next semester

A, D

A 1-inch-diameter coin is tossed onto a table with a grid of lines spaced two inches apart. What is the probability that the coin will land inside a square without hitting any of the grid's lines? You can suppose that the person throwing the coin has no talent and is throwing it at random. (A) 1/2 (B) 1/4 (C) 0 (D) 1/3

A class of 60 pupils is randomly divided into three equalsized classes. All partitions have an equal chance of occurring. Raj and Deep are two of the pupils in that group. What is the probability that Raj and Deep will be in the same class? (A) 1/3 (B) 19/59 (C) 18/58 (D) 1/2

A computer program is said to learn from __________ E with respect to some class of tasks T and performance measure P, if its performance at tasks in T, as measured by P, improves with E. (A) Training (B) Experience (C) Database (D) Algorithm

A least squares regression study of weight (y) and height (x) yielded the following least squares line: y = 120 + 5x. This means that if the height is increased by one inch, the weight should increase by what amount? (A) increase by 1 pound (B) increase by 5 pound (C) increase by 125 pound (D) None of the above

A multiple regression model has: (A) Only one independent variable (B) More than one independent variable (C) More than one dependent variable (D) None of the above

A roulette wheel has 38 slots, 18 are red, 18 are black, and 2 are green. You play five games and always bet on red. What is the probability that you win all the 5 games? (A) 0.0368 (B) 0.0238 (C) 0.0526 (D) 0.0473

A roulette wheel has 38 slots, 18 of which are red, 18 of which are black, and 2 of which are green. You play five games and always place your bets on the red slots. How many games do you think you'll win? (A) 1.1165 (B) 2.3684 (C) 2.6316 (D) 4.7368

After performing the Z-test, what can we conclude ____? (A) Listening to music does not improve memory. (B) Listening to music significantly improves memory at p (C) The information is insufficient for any conclusion. (D) None of the above

Ahmed is participating in a lottery game in which he must select two numbers ranging from 0 to 9, followed by an English alphabet (from 26-letters). He has the option of selecting the same number both times. If his ticket matches the two numbers and one letter chosen in the correct order, he wins the big prize of $10405. He gets $100 if only his letter matches but one or both of the digits do not match. In every other case, he loses everything. He has to pay $5 to play the game. Assume he's selected 04R to play. What is the expected net profit from purchasing this ticket? (A) $-2.81 (B) $2.81 (C) $-1.82 (D) $1.82

An alternative name for an output attribute (A) predictive variable (B) independent variable (C) estimated variable (D) dependent variable

Assume you were interviewed for a technical position. Half of those who attended the initial interview were called back for a second interview. 95 percent of those who received a call for a second interview were pleased with their initial interview. Seventy-five percent of those who did not receive a second call were satisfied with their first interview. What is the likelihood that you will be called for a second interview if you feel well after your first? (A) 66% (B) 56% (C) 75% (D) 85%

Assume you've been given a variable V, as well as its mean and median. Based on these numbers, you may determine if the variable "V" is skewed to the left or right given the criterion mean(V) > median (V) (A) TRUE (B) FALSE

Based on this joint distribution on two events p(A,B), we can define the this distribution as follows: p(A) = p(A,B) = p(A|B = b) p(B = b) (A) Conditional distribution (B) Marginal distribution (C) Bayes distribution (D) Normal distribution

Choose the proper alternatives for machine learning (ML) and artificial intelligence (AI), (A) ML is an alternative method of programming intelligent computers. (B) The aims of ML and AI are significantly different. (C) Machine learning is a collection of algorithms that converts a dataset into software. (D) Artificial intelligence (AI) is software that can simulate the human mind. (A) - (A), (B), (D) (B) - (A), (C), (D) (C) - (B), (C), (D) (D) - All are correct

Choosing data in such a way that each class is represented equally in both the training and test sets (A) cross validation (B) stratification (C) verification (D) bootstrapping

Consider rolling a tetrahedral die twice. What is the chance that the first roll's number is strictly greater than the second roll's number? It should be noted that a tetrahedral die has just four sides (1, 2, 3, and 4). (A) 1/2 (B) 3/8 (C) 7/16 (D) 9/16

Consider the problem of binary classification. Assume I trained a model on a linearly separable training set, and now I have a new labeled data point that the model properly categorized and is far away from the decision border. In which instances is the learnt decision boundary likely to change if I now add this additional point to my previous training set and re-train? When the training model is, (A) Perceptron and logistic regression (B) Logistic regression and Gaussian discriminant analysis (C) Support vector machine (D) Perceptron

Data used to build a data mining model (A) validation data (B) training data (C) test data (D) hidden data

Each data instance is assigned a conditional probability value using this method. (A) linear regression (B) logistic regression (C) simple regression (D) multiple linear regression

For a classification task, instead of random weight initializations in a neural network, we set all the weights to zero. Which of the following statements is true? (A) There will not be any problem and the neural network will train properly (B) The neural network will train but all the neurons will end up recognizing the same thing (C) The neural network will not train as there is no net gradient change (D) None of these

For supervised learning we have ____ model. (A) interactive (B) predictive (C) descriptive (D) prescriptive

For understanding relationship between two variables, ____ can be used. (A) Box plot (B) Scatter plot (C) Histogram (D) None of the above

Hamming distance between binary vectors 1001 and 0101 is (A) 1 (B) 2 (C) 3 (D) 4

If the correlation coefficient (r) between scores in a math test and amount of physical exercise by a student is 0.86, what percentage of variability in math test is explained by the amount of exercise? (A) 86% (B) 74% (C) 14% (D) 26%

In neural networks, nonlinear activation functions such as sigmoid, tanh, and ReLU (A) speed up the gradient calculation in backpropagation, as compared to linear units (B) help to learn nonlinear decision boundaries (C) are applied only to the output units (D) always output values between 0 and 1

In statistical terms, this represents the weighted average score. (A) Variance (B) Mean (C) Median (D) More

In univariate linear least squares regression, relationship between correlation coefficient and coefficient of determination is ______? (A) Both are unrelated False (B) The coefficient of determination is the coefficient of correlation squared True (C) The coefficient of determination is the square root of the coefficient of correlation False (D) Both are same

Increase in which of the following hyper parameter results into overfit in Random forest? (1). Number of Trees. (2). Depth of Tree, (3). Learning Rate (A) Only 1 (B) Only 2 (C) 2 and 3 (D) 1,2 and 3

It is better to utilize the nearest neighbor method... (A) with large-sized datasets (B) when irrelevant attributes have been removed from the data (C) when a generalized model of the data is desirable (D) when an explanation of what has been found is of primary importance

LOOCV in machine learning stands for (A) Love one-out cross validation (B) Leave-one-out cross-validation (C) Leave-object oriented cross-validation (D) Leave-one-out class-validation

Let A and B be events on the same sample space, with P (A) = 0.6 and P (B) = 0.7. Can these two events be disjoint? (A) Yes (B) No

Let's say your model is overfitting. Which of the following is NOT a suitable method for attempting to decrease overfitting? (A) Increase the amount of training data (B) Improve the optimization algorithm being used for error minimization. (C) Decrease the model complexity. (D) Reduce the noise in the training data.

Let's say your model is overfitting. Which of the following is NOT a suitable method for attempting to decrease overfitting? (A) Increase the amount of training data. (B) Improve the optimization algorithm being used for error minimization (C) Decrease the model complexity. (D)Reduce the noise in the training data.

Machine learning techniques differ from statistical techniques in that machine learning methods (A) typically assume an underlying distribution for the data (B) are better able to deal with missing and noisy data (C) are not able to explain their behavior. (D) have trouble with large-sized datasets

Naïve Bayes classifier makes the naïve assumption that the attribute values are conditionally dependent given the classification of the instance. (A) True (B) False

One main disadvantage of Bayesian classifiers is that they utilize all available parameters to subtly change the predictions. (A) True (B) False

Ordinal data can be naturally ____. (A) Measured (B) Ordered (C) Divided (D) None of the above

Out of 200 emails, a classification model correctly predicted 150 spam emails and 30 ham emails. What is the accuracy of the model? (A) 10% (B) 90% (C) 80% (D) none of the above

Price prediction in the domain of real estate is an example of? (A) Unsupervised Learning (B) Supervised Regression Problem (C) Supervised Classification Problem (D) Categorical Attribute

Principal component is a technique for (A) Feature selection (B) Dimensionality reduction (C) Exploration (D) None of the above

Regression trees are often used to model ........... data. (A) linear (B) nonlinear (C) categorical (D) symmetrical

Supervised learning differs from unsupervised clustering. Supervised learning requires (A) At least one input attribute (B) Input attributes to be categorical (C) At least one output attribute (D) Output attributes to be categorical

The association between the number of years an employee has been with a firm and the person's pay is 0.75. What can be stated regarding employee pay and years of service? (A) There is no relationship between salary and years worked (B) Individuals that have worked for the company the longest have higher salaries (C) Individuals that have worked for the company the longest have lower salaries (D) The majority of employees have been with the company a long time

The covariance between two random variables X and Y measures the degree to which X and Y are (linearly) related, which means how X varies with Y and vice versa. What is the formula for Cov (X,Y)? (A) Cov(X,Y) = E(XY)−E(X)E(Y) (B) Cov(X,Y) = E(XY)+ E(X)E(Y) (C) Cov(X,Y) = E(XY)/E(X)E(Y) (D) Cov(X,Y) = E(X)E(Y)/ E(XY)

The distance between hyperplane and data points is called as: (A) Hyper Plan (B) Margins (C) Error (D) Support Vectors

The k-means algorithm is a (A) Supervised learning algorithm (B) Unsupervised learning algorithm (C) Semi-supervised learning algorithm (D) Weakly supervised learning algorithm

The probability that a particular hypothesis holds for a data set based on the Prior is called (A) Independent probabilities (B) Posterior probabilities (C) Interior probabilities (D) Dependent probabilities

The standard error is defined as the square root of this computation. (A) The sample variance divided by the total number of sample instances (B) The population variance divided by the total number of sample instances (C) The sample variance divided by the sample mean (D) The population variance divided by the sample mean

This type of learning to be used when there is no idea about the class or label of a particular data (A) Supervised learning algorithm (B) Unsupervised learning algorithm (C) Semi-supervised learning algorithm (D) Reinforcement learning algorithm

Training data run on the algorithm is called as? (A) Program (B) Training (C) Training Information (D) Learned Function

Two common types of data issue are (A) Outlier (B) Missing value (C) Boundary value (D) None of the above

Two events A and B are called mutually exclusive if they can happen together. (A) True (B) False

What happens to the confidence interval when certain outliers are introduced into the data? (A) Confidence interval is robust to outliers (B) Confidence interval will increase with the introduction of outliers. (C) Confidence interval will decrease with the introduction of outliers. (D) We cannot determine the confidence interval in this case.

What sizes of training data sets are not best suited for SVM? (A) Large data sets (B) Very small training data sets (C) Medium size training data sets (D) Training data set size does not matter

What would be the relationship between the training time taken by 1-NN, 2-NN, and 3-NN? (A) 1-NN > 2-NN > 3-NN (B) 1-NN < 2-NN < 3-NN (C) 1-NN ~ 2-NN ~ 3-NN (D) None of these

Which neural network architecture would be most suited to handle an image identification problem (recognizing a dog in a photo)? (A) Multi Layer Perceptron (B) Convolutional Neural Network (C) Recurrent Neural network (D) Perceptron

Which of the following statements regarding the Gradient of a Continuous and Differentiable Function is false? (A) is zero at a minimum (B) is non-zero at a maximum (C) is zero at a saddle point (D) decreases as you get closer to the minimum

Which of the following will be Euclidean distance between the two data points A(4,3) and B(2,3)? (A) 1 (B) 2 (C) 4 (D) 8

With Bayes classifier, missing data items are (A) treated as equal compares (B) treated as unequal compares (C) replaced with a default value (D) ignored.

n-gram of size 1 is called (A) Bigram (B) Unigram (C) Trigram (D) None of the above

Consider the problem of binary classification. Assume I trained a model on a linearly separable training set, and now I have a new labeled data point that the model properly categorized and is far away from the decision border. In which instances is the learnt decision boundary likely to change if I now add this additional point to my previous training set and re-train? (A) When my model is a perceptron. (B) When my model is logistic regression. (C) When my model is an SVM. (D) When my model is Gaussian discriminant analysis.

B, C

Feature selection tries to eliminate features which are (A) Rich (B) Redundant (C) Irrelevant (D) Relevant

B, C

In the kernelized perceptron algorithm with learning rate = 1, the coefficient ai corresponding to a training example xi represents the weight for K(xi , x). Suppose we have a two-class classification problem with yi ∈ {1, −1}. If yi = 1, which of the following can be true for ai? (A) ai = −1 (B) ai = 1 (C) ai = 0 (D) ai = 5

B, C, D

Which of the following are true about subset selection? (A) Subset selection can substantially decrease the bias of support vector machines (B) Subset selection can reduce overfitting (C) Ridge regression frequently eliminates some of the features (D) Finding the true best subset takes exponential time

B, D

Which of the following assertions about bias and variance is true? (A) Models which overfit have a high bias. (B) Models which overfit have a low bias (C) Models which underfit have a high variance (D) Models which underfit have a low variance.

B, D

A fair six-sided die is rolled twice. What is the probability of getting 2 on the first roll and not getting 4 on the second roll? (A) 1/36 (B) 1/18 (C) 5/36 (D) 1/6

A term used to describe the case when the independent variables in a multiple regression model are correlated is (A) regression (B) correlation (C) multicollinearity (D) none of the above

Any hypothesis find an approximation of the target function over a sufficiently large set of training examples will also approximate the target function well over other unobserved examples. This is called _____. (A) Hypothesis (B) Inductive hypothesis (C) Learning (D) Concept learning

Assume a life insurance firm offers a $240,000 one-year term life insurance policy for $210 to a 25-year-old girl. The female's chances of surviving the year are 0.999592. Determine the insurance company's estimated value for this policy. (A) $131 (B) $140 (C) $112 (D) $125

Assume you're in the business of selling sandwiches. 70% of those polled choose eggs, while the remaining 30% prefer chicken. What is the probability of selling two egg sandwiches to the following three customers? (A) 0.343 (B) 0.063 (C) 0.147 (D) 0.027

Assume you're participating in a game in which we throw a fair coin several times. You've already lost three times when you guessed heads but got tails. Which of the following statements is correct in this situation? (A) You should guess heads again since the tails has already occurred thrice and its more likely for heads to occur now (B) You should say tails because guessing heads is not making you win (C) You have the same probability of winning in guessing either, hence whatever you guess there is just a 50-50 chance of winning or losing (D) None of these

Classification problems are distinguished from estimation problems in that (A) classification problems require the output attribute to be numeric (B) classification problems require the output attribute to be categorical (C) classification problems do not allow an output attribute (D) classification problems are designed to predict future outcome

Computational complexity of Gradient descent is (A) linear in D (B) linear in N (C) polynomial in D (D) dependent on the number of iterations

Consider one layer of weights (edges) in a grayscale convolutional neural network (CNN), which connects one layer of units to the next layer of units. Which layer contains the fewest parameters that must be learnt during training? (Choose one.) (A) A convolutional layer with 10 3 × 3 filters (B) A convolutional layer with 8 5 × 5 filters (C) A max-pooling layer that reduces a 10 × 10 image to 5 × 5 (D) A fully-connected layer from 20 hidden units to 4 output units

Consider the regression line y = ax + b, where a represents the slope and b represents the intercept. If we know the value of the slope, how can we always get the value of the intercept? (A) Put the value (0,0) in the regression line True (B) Put any value from the points used to fit the regression line and compute the value of b False (C) Put the mean values of x & y in the equation along with the value a to get b False (D) None of the above can be used False

Conversion of a text corpus to a numerical representation is done using ___ process. (A) Tokenization (B) Normalization (C) Vectorization (D) None of the above

Correlation between two variables (Var1 and Var2) is 0.65. Now, after adding numeric 2 to all the values of Var1, the correlation co-efficient will_______? (A) Increase (B) Decrease (C) None of these

Feature ___ involves transforming a given set of input features to generate a new set of more powerful features. (A) Selection (B) Engineering (C) Transformation (D) Re-engineering

For box plot, the upper and lower whisker length depends on (A) Median (B) Mean (C) Iqr (D) All of the above

For two real-valued attributes, the correlation coefficient is 0.85. What does this value indicate? (A) The attributes are not linearly related (B) As the value of one attribute increases the value of the second attribute also increases (C) As the value of one attribute decreases the value of the second attribute increases (D) The attributes show a curvilinear relationship

For unsupervised learning we have ____ model. (A) interactive (B) predictive (C) descriptive (D) prescriptive

Four marbles are contained in a jar. 3 red and 1 white. After each draw, two marbles are pulled, with one being replaced. What is the probability of drawing the same color marble twice? (A) 1/2 (B) 1/3 (C) 5/8 (D) 1/8

Grid search is (A) Linear in D and Polynomial in D (B) Polynomial in D (C) Exponential in D and Linear in N (D) Polynomial in D and Linear in N

HIV is still a very scary disease to even get tested for. The US military tests its recruits for HIV when they are recruited. They are tested on three rounds of Elisa( an HIV test) before they are termed to be positive. The prior probability of anyone having HIV is 0.00148. The true positive rate for Elisa is 93% and the true negative rate is 99%. What is the probability of having HIV, given he tested positive on Elisa the second time as well. The prior probability of anyone having HIV is 0.00148. The true positive rate for Elisa is 93% and the true negative rate is 99%. (A) 20% (B) 42% (C) 93% (D) 88%

How does a ridge regression estimator's bias-variance decomposition compare to that of ordinary least squares regression? (Choose one.) (A) Ridge has larger bias, larger variance (B) Ridge has smaller bias, larger variance (C) Ridge has larger bias, smaller variance (D) Ridge has smaller bias, smaller variance

How many types are available in machine learning? (A) 1 (B) 2 (C) 3 (D) 4

In ___ approach, identification of best feature subset is done using the induction algorithm as a black box. (A) Embedded (B) Filter (C) Wrapper (D) Hybrid

In a class of 30 students, what is the likelihood that two of them would have their birthdays on the same day (assuming it is not a leap year)? Students having birthdays on January 3rd, 1993 and January 3rd, 1994, for example, would be a positive occasion. (A) 49% (B) 52% (C) 70% (D) 35%

In language understanding, the levels of knowledge that does not include? (A) Phonological (B) Syntactic (C) Empirical (D) Logical

It has been shown that there is a very clear association between math exam scores and the quantity of physical activity performed by a student on test day. What conclusions can you draw from this? (1). A strong correlation indicates that test scores are high following exercise. (2). Causation is not implied by correlation. (3). The strength of the linear association between the quantity of exercise and test results is measured by correlation. (A) 1 (B) 1 & 3 (C) 2 & 3 (D) All of the above

Knowing the weight and bias of each neuron in a neural network is the most essential stage. You can estimate any function if you can acquire the proper value of weight and bias for each neuron. What is the best method to go about this? (A) Assign random values and pray to God they are correct (B) Search every possible combination of weights and biases till you get the best value (C) Iteratively check that after assigning a value how far you are from the best values, and slightly change the assigned values values to make them better (D) None of these

Machine learning is ___ field (A) Inter-disciplinary (B) Single (C) Multi-disciplinary (D) All of the above

Predicting whether a tumour is malignant or benign is an example of? (A) Unsupervised Learning (B) Supervised Regression Problem (C) Supervised Classification Problem (D) Categorical Attribute

Ritesh has two children, one of them is a girl. What is the likelihood that the second kid is likewise a girl? You can suppose that the globe has an equal number of males and females. (A) 0.5 (B) 0.25 (C) 0.333 (D) 0.75

Some test results are normally distributed, with a mean of 18 and a standard deviation of 6. What percentage of test takers scored between 18 and 24? (A) 20% (B) 22% (C) 34% (D) None of the above

The leaf nodes of a model tree are (A) averages of numeric output attribute values (B) nonlinear regression equations (C) linear regression equations (D) sums of numeric output attribute values

The multiple coefficient of determination is calculated as follows: (A) dividing SSR by SST (B) dividing SST by SSR (C) dividing SST by SSE (D) none of the above

The process of forming general concept definitions from examples of concepts to be learned (A) Deduction (B) abduction (C) induction (D) conjunction

There are eight marbles in all, two each of green, yellow, orange, and red. How many different ways can you choose one marbel? (A) 1 (B) 2 (C) 4 (D) 8

This clustering algorithm initially assumes that each data instance represents a single cluster (A) agglomerative clustering (B) conceptual clustering (C) K-Means clustering (D) expectation maximization

This is the cleaning/transforming the data set in the supervised learning model. (A) Problem Identification (B) Identification of Required Data (C) Data Pre-processing (D) Definition of Training Data Set

This refers to the transformations applied to the identified data before feeding the same into the algorithm. (A) Problem Identification (B) Identification of Required Data (C) Data Pre-processing (D) Definition of Training Data Set

What is the name of a regression model in which more than one independent variable is utilized to predict the dependent variable? (A) a simple linear regression model (B) a multiple regression models (C) an independent model (D) none of the above

What would be the Type I error? (A) Concluding that listening to music while studying improves memory, and it's right. (B) Concluding that listening to music while studying improves memory when it actually doesn't. (C) Concluding that listening to music while studying does not improve memory but it does.

When the mean values computed for the current iteration of the procedure are equal to the mean values computed for the previous iteration, the unsupervised clustering process stops - Which algorithm has this property? (A) agglomerative clustering (B) conceptual clustering (C) K-Means clustering (D) expectation maximization

Which is a measure of the estimated regression equation's quality of fit? (A) multiple coefficient of determination (B) mean square due to error (C) mean square due to regression (D) none of the above

Which is a type of machine learning where a target feature, which is of categorical type, is predicted for the test data on the basis of the information imparted by the training data? (A) Unsupervised Learning (B) Supervised Regression (C) Supervised Classification (D) Categorical Attribute

Which of the following points would Bayesians and frequentists disagree on? (A) The use of a non-Gaussian noise model in probabilistic regression (B) The use of probabilistic modelling for regression (C) The use of prior distributions on the parameters in a probabilistic model (D) The use of class priors in Gaussian Discriminant Analysis

Which of the following will be Manhattan distance between the two data points A(8,3) and B(4,3)? (A) 1 (B) 2 (C) 4 (D) 8

You are evaluating papers for the World's Fanciest Machine Learning Conference and come across the following submissions. Which ones would you be willing to accept? (A) My method achieves a training error lower than all previous methods! (B) My method achieves a test error lower than all previous methods! (Footnote: When regularization parameter λ is chosen so as to minimize test error.) (C) My method achieves a test error lower than all previous methods! (Footnote: When regularization parameter λ is chosen so as to minimize cross-validation error.) (D) My method achieves a cross-validation error lower than all previous methods! (Footnote: When regularization parameter λ is chosen so as to minimize cross-validation error.)

___ approach uses induction algorithm for subset validation. (A) Filter (B) Hybrid (C) Wrapper (D) Embedded

Let's say you wish to divide a graph G into two subgraphs. Let L represent G's Laplacian matrix. Which of the following might assist you in locating a suitable split? (A) The eigenvector corresponding to the second largest eigenvalue of L (B) The left singular vector corresponding to the second-largest singular value of L (C) The eigenvector corresponding to the second smallest eigenvalue of L (D) The left singular vector corresponding to the second-smallest singular value of L

C, D

Neural networks (A) optimize a convex cost function (B) always output values between 0 and 1 (C) can be used for regression as well as classification (D) can be used in an ensemble

C, D

On which of the following would Bayesians and frequentists disagree? (A) The use of a non-Gaussian noise model in probabilistic regression. (B) The use of probabilistic modelling for regression. (C) The use of prior distributions on the parameters in a probabilistic model. (D) The idea of assuming a probability distribution over models.

C, D

Which of the following approaches is capable of producing zero training error on every linearly separable dataset? (A) Decision tree (B) 15-nearest neighbors (C) Hard-margin SVM (D) Perceptron

C, D

A model of language consists of the categories which does not include? (A) Language units (B) Role structure of units (C) System constraints (D) Structural units

A researcher concludes from his analysis that a placebo cures AIDS. What type of error is he making? (A) Type 1 error (B) Type 2 error (C) No error (D) Cannot be determined

Assume we train a hard-margin linear SVM in R2 on n > 100 data points, resulting in a hyperplane with precisely two support vectors. What is the maximum number of support vectors for the new hyperplane if we add one more data point and retrain the classifier (assuming the n + 1 points are linearly separable)? (A) 2 (B) n (C) 3 (D) n+1

Assume you're given three variables: X, Y, and Z. C1, C2, and C3 are the Pearson correlation coefficients for (X, Y), (Y, Z), and (X, Z), respectively. You have now added 2 to all X values (i.e. new values are X+2), removed 2 from all Y values (i.e. new values are Y-2) and Z remains the same. D1, D2, and D3 are the new coefficients for (X,Y), (Y,Z), and (X,Z). How do the values of D1, D2, and D3 relate to the values of C1, C2, and C3? (A) D1= C1, D2 < C2, D3 > C3 (B) D1 = C1, D2 > C2, D3 > C3 (C) D1 = C1, D2 > C2, D3 < C3 (D) D1 = C1, D2 = C2, D3 = C3

Assume you've discovered multi-collinear features. Which of the following actions do you intend to take next? (1). Both collinear variables should be removed. (2). Instead of deleting both variables, we can simply delete one. (3). Removing correlated variables may result in information loss. We may utilize penalized regression models such as ridge or lasso regression to keep such variables. (A) Only 1 (B) Only 2 (C) Either 1 or 3 (D) Either 2 or 3

Classification is a type of supervised learning where a target feature, which is of categorical type, is predicted for the test data on the basis of the information imparted by the training data. The target categorical feature is known as? (A) Object (B) Variable (C) Method (D) Class

Different learning methods does not include (A) Memorization (B) Analogy (C) Deduction (D) Introduction

Different learning methods does not include? (A) Memorization (B) Analogy (C) Deduction (D) Introduction

Factors which affect performance of a learner system does not include (A) Representation scheme used (B) Training scenario (C) Type of feedback (D) Good data structures

For bi-variate data exploration, _____ is an effective tool. (A) Box plot (B) Two-way cross-tab (C) Histogram (D) None of the above

For categorical data, ____ cannot be used as a measure of central tendency. (A) Median (B) Mean (C) quartile (D) None of the above

In a neural network, which of the following techniques is used to deal with overfitting? (A) Dropout (B) Regularization (C) Batch Normalization (D) All of these

In a perceptron, what is the order of the following tasks? (1). Perceptron weights should be randomly initialized (2). Proceed to the next dataset batch (3). Change the weights if the forecast does not match the output (4). Compute an output for a sample input. (A) 1, 2, 3, 4 (B) 4, 3, 2, 1 (C) 3, 1, 2, 4 (D) 1, 4, 3, 2

In the k-Means Algorithm, which of the following alternatives can be utilized to get global minima? (1). Experiment with different centroid initialization algorithms (2). Change the number of iterations (3). Determine the appropriate number of clusters. (A) 2 and 3 (B) 1 and 3 (C) 1 and 2 (D) All of above

In which neural net architecture, does weight sharing occur? (A) Convolutional neural Network (B) Recurrent Neural Network (C) Fully Connected Neural Network (D) Both A and B

Logistic regression is a ........... regression technique that is used to model data having a ........... outcome. (A) linear, numeric (B) linear, binary (C) nonlinear, numeric (D) nonlinear, binary

Regarding bias and variance, which of the follwing statements are true? (A) Models which overfit have a high bias and underfit have a high variance (B) Models which overfit have a high bias and underfit have a low variance (C) Models which overfit have a low bias and underfit have a high variance (D) Models which overfit have a low bias and underfit have a low variance

The adjusted multiple coefficient of determination accounts for (A) the number of dependent variables in the model (B) the number of independent variables in the model (C) unusually large predictors (D) none of the above

The average positive difference in values between computed and intended outcomes is known as _____ (A) root mean squared error (B) mean squared error (C) mean absolute error (D) mean positive error

The kernel trick (A) can be applied to every classification algorithm (B) is commonly used for dimensionality reduction (C) changes ridge regression so we solve a d × d linear system instead of an n × n system, given n sample points with d feature (D) exploits the fact that in many learning algorithms, the weights can be written as a linear combination of input points

The line described by the linear regression equation (OLS) attempts to ____? (A) Pass through as many points as possible. (B) Pass through as few points as possible (C) Minimize the number of points it touches (D) Minimize the squared distance from the points

The model learns and updates itself through reward/punishment in case of (A) Supervised learning algorithm (B) Unsupervised learning algorithm (C) Semi-supervised learning algorithm (D) Reinforcement learning algorithm

The q-learning algorithm is a (A) Supervised learning algorithm (B) Unsupervised learning algorithm (C) Semi-supervised learning algorithm (D) Reinforcement learning algorithm

There is no one model that works best for every machine learning problem. This is stated as (A) Fit gap model theorem (B) One model theorem (C) Free lunch theorem (D) No free lunch theorem

This clustering algorithm merges and splits nodes to help modify nonoptimal partitions. (A) agglomerative clustering (B) expectation maximization (C) conceptual clustering (D) K-Means clustering

This step of supervised learning determines 'the type of training dataset'. (A) Problem Identification (B) Identification of Required Data (C) Data Pre-processing (D) Definition of Training Data Set

What are the factors to select the depth of neural network? (1). Type of neural network (eg. MLP, CNN etc) (2). Input data (3). Computation power, i.e. Hardware capabilities and software capabilities (4). Learning Rate (A) 1, 2, 4, 5 (B) 2, 3, 4, 5 (C) 1, 3, 4, 5 (D) All of these

What is the link between the degree of significance and the level of confidence? (A) Significance level = Confidence level (B) Significance level = 1- Confidence level (C) Significance level = 1/Confidence level (D) None of the above

When an event A independent of itself? (A) Always (B) If and only if P(A)=0 (C) If and only if P(A)=1 (D) If and only if P(A)=0 or 1

When the number of features increase (A) Computation time increases (B) Model becomes complex (C) Learning accuracy decreases (D) All of the above

Which data is used to tune the parameters of supervised learning model (A) training (B) test (C) verification (D) validation

Which is the correct sequence of steps for gradient descent algorithm: (1). Calculate the difference between the actual and predicted values (2). Iterate until you discover the optimal network weights (3). Pass an input via the network to obtain values from the output layer (4). Set up the random weight and bias (5). Change the values of each neuron that contributes to the mistake to minimize the error. (A) 1, 2, 3, 4, 5 (B) 5, 4, 3, 2, 1 (C) 3, 2, 1, 5, 4 (D) 4, 3, 1, 5, 2

Which of the following elements has no effect on the performance of the learner system? (A) Representation scheme used (B) Training scenario (C) Type of feedback (D) Good data structures

Which of the following is a constant property of a kernel matrix? (A) Invertible (B) All the entries are positive (C) At least one negative eigenvalue (D) Symmetric

Which of the following is a performance measure for regression? (A) Accuracy (B) Recall (C) Error rate (D) RMSE

Which of the following is not TRUE for regression? (A) It relates inputs to outputs (B) It is used for prediction (C) It may be used for interpretation (D) It discovers causal relationships

Which of the following is not a benefit of using Grid search? (A) It can be applied to non-differentiable functions. (B) It can be applied to non-continuous functions (C) It is easy to implement (D) It runs reasonably fast for multiple linear regression

Which of the following is true about SVM? (A) It is useful only in high-dimensional spaces (B) It always gives an approximate value (C) It is accurate (D) Understanding SVM is difficult

Which of the following is true about SVM? (A) It is useful only in high-dimensional spaces (B) It requires less memory (C) SVM does not perform well when we have a large data set (D) SVM performs well when we have a large data set

Which of the following measure is not used for a classification model? (A) Accuracy (B) Recall (C) Error rate (D) Purity

Which of the following statements is/are true about "Type-1" and "Type-2" errors? (1).Type1 is known as false positive and Type2 is known as false negative. (2).Type1 is known as false negative and Type2 is known as false positive. (3).Type1 error occurs when we reject a null hypothesis when it is actually true. (A) Only 1 (B) Only 2 (C) Only 3 (D) 1 and 3

Which of the following statements regarding outliers is correct? (A) Outliers should be identified and removed from a dataset (B) Outliers should be part of the training dataset but should not be present in the test data (C) Outliers should be part of the test dataset but should not be present in the training data (D) The nature of the problem determines how outliers are used

Which of the following statements regarding the prediction are correct?? (A) The output attribute must be categorical (B) The output attribute must be numerical (C) The resultant model is designed to determine future outcomes (D) The resultant model is designed to classify current behavior

p(A, B) = p(A ∩ B) = p(A | B) . p(B) is referred as: (A) Conditional probability (B) Unconditional probability (C) Bayes rule (D) Product rule

———- are the data points (representing classes), the important component in a data set, which are near the identified set of lines (hyperplane). (A) Hard Margin (B) Soft Margin (C) Linear Margin (D) Support Vectors

Which of the following options cannot be the probability of any event? (A). -0.00001 (B). 0.5 (C). 1.001 (A) Only A (B) Only B (C) Only C (D) A and B (E) B and C (F) A and C

MCQs Machine Learning

Conjuntos de estudio relacionados

Lecture Notes | Week 3 | Types of Construction & Means of Egress| Part 2

Study guide Test 1 Sports injuries

NUR 319 Final Exam

SBIRT (Screening, Brief Intervention and Referral to Treatment)

Mod 12/13 WLAN concepts

LIFE ExamFX (guarantee)

hotels quiz

Principles 2 practice test

EBP Exam 1

Quiz 1 NU 225

Chapter 7 Questions

اللغة العربيه

Study guide exam 3 480

Ch 4 - research methods

Chain of Infection

L to J 7th Grade Math Vocabulary

marketing principles test (quizes)

Module 10: Quiz

C. 1-2 review Dreamers

Learning Unit 1 Quiz