interview

¡Supera tus tareas y exámenes ahora con Quizwiz!

What is difference between Skip gram and CBOW model in Word2vec algorithm?

Both models (model1 and model2) are used in Word2vec algorithm. The model1 represent a CBOW model where as Model2 represent the Skip gram model. As we know, CBOW is learning to predict the word by the context. Or maximize the probability of the target word by looking at the context. And this happens to be a problem for rare words. For example, given the context yesterday was a really [...] day CBOW model will tell you that most probably the word is beautiful or nice. Words like delightful will get much less attention of the model, because it is designed to predict the most probable word. This word will be smoothed over a lot of examples with more frequent words. On the other hand, the skip-gram model is designed to predict the context. Given the word delightful it must understand it and tell us that there is a huge probability that the context is yesterday was really [...] day, or some other relevant context. With skip-gram the word delightful will not try to compete with the word beautiful but instead, delightful+context pairs will be treated as new observations. pic is skip-gram

True- False: Overfitting is more likely when you have huge amount of data to train?

Solution: False With a small training dataset, it's easier to find a hypothesis to fit the training data exactly i.e. overfitting.

(True/False) Support vectors are the data points that lie closest to the decision surface.

Solution: True They are the points closest to the hyperplane and the hardest ones to classify. They also have a direct bearing on the location of the decision surface.

Suppose that we have N independent variables (X1,X2... Xn) and dependent variable is Y. Now Imagine that you are applying linear regression by fitting the best fit line using least square error on this data. You found that correlation coefficient for one of it's variable(Say X1) with Y is -0.95. Which of the following is true for X1? A) Relation between the X1 and Y is weak B) Relation between the X1 and Y is strong C) Relation between the X1 and Y is neutral D) Correlation can't judge the relationship

The absolute value of the correlation coefficient denotes the strength of the relationship. Since absolute correlation is very high it means that the relationship is strong between X1 and Y.

what's the difference between Sigmoid, Tanh, and ReLU

The range of SIGMOID function is [0,1]. The range of the tanh function is [-1,1]. The range of the RELU function is [0, infinity].

[True or False] A Pearson correlation between two variables is zero but, still their values can still be related to each other.

True, Y=X2. Note that, they are not only associated, but one is a function of the other and Pearson correlation between them is 0.

Linear Regression is mainly used for Regression.

True, Linear Regression has dependent variables that have continuous values.

Is Pearson coefficient sensitive to outliers?

Yes. Even a single outlier can change the direction of the coefficient.

Why is naive Bayes so 'naive' ?

naive Bayes is so 'naive' because it assumes that all of the features in a data set are equally important and independent. As we know, these assumption are rarely true in real world scenario.

True-False: Is it possible to design a logistic regression algorithm using a Neural Network Algorithm?

"Solution: True True, Neural network is a is a universal approximator so it can implement linear regression algorithm. "

Which of the following is true about Residuals ? A) Lower is better B) Higher is better C) A or B depends on the situation D) None of these

(A) Residuals refer to the error values of the model. Therefore lower residuals are desired.

Does correlation and dependency mean the same thing? In simple words if two events have correlation of zero, does this convey they are not dependent and vice-versa?

"A non-dependency between two variable means a zero correlation. However the inverse is not true. A zero correlation can even have a perfect dependency. Must remember tip: Correlation quantifies the linear dependence of two variables. It cannot capture non-linear relationship between two variables. "

How would you explain the difference between correlation and covariance?

"Correlation is simply the normalized co-variance with the standard deviation of both the factors. This is done to ensure we get a number between +1 and -1. Co-variance is very difficult to compare as it depends on the units of the two variable. It might come out to be the case that marks of student is more correlated to his toe nail in mili- meters than it is to his attendance rate. "

Does causation imply correlation

"No, because causation can also lead to a non-linear relationship. The pic shows density of water from 0 to 12 degree Celsius. We know that density is an effect of changing temperature. But, density can reach its maximum value at 4 degree Celsius. Therefore, it will not be linearly correlated to the temperature."

How to choose between Pearson and Spearman correlation?

"Pearson captures how linearly dependent are the two variables whereas Spearman captures the monotonic behavior of the relation between the variables. you should only begin with Spearman when you have some initial hypothesis of the relation being non-linear."

What's the difference between correlation and simple linear regression?

"The square of Pearson's correlation coefficient is the same as the one in simple linear regression SubsDcroibwen!load Brochure Neither simple linear regression nor correlation answer questions of causality directly. This point is important, because I've met people thinking that simple regression can magically allow an inference that X causes. That's preposterous belief. What's the difference between correlation and simple linear regression? Now let's think of few differences between the two. Simple linear regression gives much more information about the relationship than Pearson Correlation. Here are a few things which regression will give but correlation coefficient will not. The slope in a linear regression gives the marginal change in output/target variable by changing the independent variable by unit distance. Correlation has no slope. The intercept in a linear regression gives the value of target variable if one of the input/independent variable is set zero. Correlation does not have this information. Linear regression can give you a prediction given all the input variables. Correlation analysis does not predict anything."

What do you mean by a hard margin in SVM? A) The SVM allows very low error in classification B) The SVM allows high amount of error in classification C) None of the above

A hard margin means that an SVM is very rigid in classification and tries to work extremely well in the training set, causing overfitting.

Which of the following can act as possible termination conditions in K-Means? 1. For a fixed number of iterations. 2. Assignment of observations to clusters does not change between iterations. Except for cases with a bad local minimum. 3. Centroids do not change between successive iterations. 4. Terminate when RSS falls below a threshold.

All four conditions can be used as possible termination condition in K-Means clustering: 1. This condition limits the runtime of the clustering algorithm, but in some cases the quality of the clustering will be poor because of an insufficient number of iterations. 2. Except for cases with a bad local minimum, this produces a good clustering, but runtimes may be unacceptably long. 3. This also ensures that the algorithm has converged at the minima. 4. Terminate when RSS falls below a threshold. This criterion ensures that the clustering is of a desired quality after termination. Practically, it's a good practice to combine it with a bound on the number of iterations to guarantee termination.

What is the minimum no. of variables/ features required to perform clustering?

Ans: 1 At least a single variable is required to perform clustering analysis. Clustering analysis with a single variable can be visualized with the help of a histogram

You are given a data set. The data set contains many variables, some of which are highly correlated and you know about it. Your manager has asked you to run PCA. Would you remove correlated variables first? Why?

Answer: Chances are, you might be tempted to say No, but that would be incorrect. Discarding correlated variables have a substantial effect on PCA because, in presence of correlated variables, the variance explained by a particular component gets inflated. For example: You have 3 variables in a data set, of which 2 are correlated. If you run PCA on this data set, the first principal component would exhibit twice the variance than it would exhibit with uncorrelated variables. Also, adding correlated variables lets PCA put more importance on those variable, which is misleading.

Explain prior probability, likelihood and marginal likelihood in context of naiveBayes algorithm?

Answer: Prior probability is nothing but, the proportion of dependent (binary) variable in the data set. It is the closest guess you can make about a class, without any further information. For example: In a data set, the dependent variable is binary (1 and 0). The proportion of 1 (spam) is 70% and 0 (not spam) is 30%. Hence, we can estimate that there are 70% chances that any new email would be classified as spam. Likelihood is the probability of classifying a given observation as 1 in presence of some other variable. For example: The probability that the word 'FREE' is used in previous spam message is likelihood. Marginal likelihood is, the probability that the word 'FREE' is used in any message.

You are given a train data set having 1000 columns and 1 million rows. The data set is based on a classification problem. Your manager has asked you to reduce the dimension of this data so that model computation time can be reduced. Your machine has memory constraints. What would you do? (You are free to make practical assumptions.)

Answer: Processing a high dimensional data on a limited memory machine is a strenuous task, your interviewer would be fully aware of that. Following are the methods you can use to tackle such situation: 1. Since we have lower RAM, we should close all other applications in our machine, including the web browser, so that most of the memory can be put to use. 2. We can randomly sample the data set. This means, we can create a smaller data set, let's say, having 1000 variables and 300000 rows and do the computations. 3. To reduce dimensionality, we can separate the numerical and categorical variables and remove the correlated variables. For numerical variables, we'll use correlation. For categorical variables, we'll use chi-square test. 4. Also, we can use PCA and pick the components which can explain the maximum variance in the data set. 5. Using online learning algorithms like Vowpal Wabbit (available in Python) is a possible option. 6. Building a linear model using Stochastic Gradient Descent is also helpful. 7. We can also apply our business understanding to estimate which all predictors can impact the response variable. But, this is an intuitive approach, failing to identify useful predictors might result in significant loss of information.

You are working on a time series data set. You manager has asked you to build a high accuracy model. You start with the decision tree algorithm, since you know it works fairly well on all kinds of data. Later, you tried a time series regression model and got higher accuracy than decision tree model. Can this happen? Why?

Answer: Time series data is known to posses linearity. On the other hand, a decision tree algorithm is known to work best to detect non - linear interactions. The reason why decision tree failed to provide robust predictions because it couldn't map the linear relationship as good as a regression model did. Therefore, we learned that, a linear regression model can provide robust prediction given the data set satisfies its linearity assumptions

A feature F1 can take certain value: A, B, C, D, E, & F and represents grade of students from a college. 1) Which of the following statement is true in following case? A) Feature F1 is an example of nominal variable. B) Feature F1 is an example of ordinal variable. C) It doesn't belong to any of the above category. D) Both of these

B Ordinal variables are the variables which has some order in their categories. For example, grade A should be consider as high grade than grade B.

You are given a data set on cancer detection. You've build a classification model and achieved an accuracy of 96%.Why shouldn't you be happy with your model performance? What can you do about it?

If you have worked on enough data sets, you should deduce that cancer detection results in imbalanced data. In an imbalanced data set, accuracy should not be used as a measure of performance because 96% (as given) might only be predicting majority class correctly, but our class of interest is minority class (4%) which is the people who actually got diagnosed with cancer. Hence, in order to evaluate model performance, we should use Sensitivity (True Positive Rate), Specificity (True Negative Rate), F measure to determine class wise performance of the classifier. If the minority class performance is found to to be poor, we can undertake the following steps: 1. We can use undersampling, oversampling or SMOTE to make the data balanced. 2. We can alter the prediction threshold value by doing probability caliberation and finding a optimal threshold using AUC-ROC curve. 3. We can assign weight to classes such that the minority classes gets larger weight. 4. We can also use anomaly detection.

What is C parameter in SVM?

In a SVM you are searching for two things: a hyperplane with the largest minimum margin, and a hyperplane that correctly separates as many instances as possible. The problem is that you will not always be able to get both things. The c parameter determines how great your desire is for the latter. I have drawn a small example below to illustrate this. To the left you have a low c which gives you a pretty large minimum margin (purple). However, this requires that we neglect the blue circle outlier that we have failed to classify correct. On the right you have a high c. Now you will not neglect the outlier and thus end up with a much smaller margin. So which of these classifiers are the best? That depends on what the future data you will predict looks like, and most often you don't know that of course

Which of the following methods do we use to find the best fit line for data in Linear Regression? A) Least Square Error B) Maximum Likelihood C) Logarithmic Loss D) Both A and B

In linear regression, we try to minimize the least square errors of the model to identify the line of best fit.

You came to know that your model is suffering from low bias and high variance. Which algorithm should you use to tackle it? Why?

In such situations, we can use bagging algorithm (like random forest) to tackle high variance problem. Bagging algorithms divides a data set into subsets made with repeated randomized sampling. Then, these samples are used to generate a set of models using a single learning algorithm. Later, the model predictions are combined using voting (classification) or averaging (regression). Also, to combat high variance, we can: 1. Use regularization technique, where higher model coefficients get penalized, hence lowering model complexity. 2. Use top n features from variable importance chart. May be, with all the variable in the data set, the algorithm is having difficulty in finding the meaningful signal

Is Correlation Transitive?

No, If the two known correlation are in the A zone, the third correlation will be positive. If they lie in the B zone, the third correlation will be negative. Inside the circle, we cannot say anything about the relationship. A very interesting insight here is that even if C(y,z) and C(z,x) are 0.5, C(x,y) can actually also be negative.

Which of the following evaluation metrics can be used to evaluate a model while modeling a continuous output variable? A) AUC-ROC B) Accuracy C) Logloss D) Mean-Squared-Error

Since linear regression gives output as continuous values, so in such case we use mean squared error metric to evaluate the model performance. Remaining options are use in case of a classification problem. keep in mind that LSE is a method that builds a model and MSE is a metric that evaluate your model's performances.

You are given a data set. The data set has missing values which spread along 1 standard deviation from the median. What percentage of data would remain unaffected? Why?

Since, the data is spread across median, let's assume it's a normal distribution. We know, in a normal distribution, ~68% of the data lies in 1 standard deviation from mean (or mode, median), which leaves ~32% of the data unaffected. Therefore, ~32% of the data would remain unaffected by missing values.

Which of the following is an example of a deterministic algorithm? A) PCA B) K-Means C) None of the above

Solution: (A) A deterministic algorithm is that in which output does not change on different runs. PCA would give the same result if we run again, but not k-means.

Can decision trees be used for performing clustering? A. True B. False

Solution: (A) Decision trees can also be used to for clusters in the data but clustering often generates natural clusters and is not dependent on any objective function.

Which of the following statement(s) is / are true for Gradient Decent (GD) and Stochastic Gradient Decent (SGD)? 1. In GD and SGD, you update a set of parameters in an iterative manner to minimize the error function. 2. In SGD, you have to run through all the samples in your training set for a single update of a parameter in each iteration. 3. In GD, you either use the entire data or a subset of training data to update a parameter in each iteration. A) Only 1 B) Only 2 C) Only 3 D) 1 and 2 E) 2 and 3 F) 1,2 and 3

Solution: (A) In SGD for each iteration you choose the batch which is generally contain the random sample of data But in case of GD each iteration contain the all of the training observations.

Which of the following algorithm is most sensitive to outliers? A. K-means clustering algorithm B. K-medians clustering algorithm C. K-modes clustering algorithm D. K-medoids clustering algorithm

Solution: (A) Out of all the options, K-Means clustering algorithm is most sensitive to outliers as it uses the mean of cluster data points to find the cluster center.

Suppose that you have a dataset D1 and you design a linear regression model of degree 3 polynomial and you found that the training and testing error is "0" or in another terms it perfectly fits the data. What will happen when you fit degree 4 polynomial in linear regression? A) There are high chances that degree 4 polynomial will over fit the data B) There are high chances that degree 4 polynomial will under fit the data C) Can't say D) None of these

Solution: (A) Since is more degree 4 will be more complex(overfit the data) than the degree 3 model so it will again perfectly fit the data. In such case training error will be zero but test error may not be zero.

Which of the following statement is true about outliers in Linear regression? A) Linear regression is sensitive to outliers B) Linear regression is not sensitive to outliers C) Can't say D) None of these

Solution: (A) The slope of the regression line will change due to outliers in most of the cases. So Linear Regression is sensitive to outliers.

Suppose you plotted a scatter plot between the residuals and predicted values in linear regression and you found that there is a relationship between them. Which of the following conclusion do you make about this situation? A) Since the there is a relationship means our model is not good B) Since the there is a relationship means our model is good C) Can't say D) None of these

Solution: (A) There should not be any relationship between predicted values and residuals. If there exists any relationship between them,it means that the model has not perfectly captured the information in the data.

Which of the following offsets, do we use in linear regression's least square line fit? Suppose horizontal axis is independent variable and vertical axis is dependent variable. A) Vertical offset B) Perpendicular offset C) Both, depending on the situation D) None of above

Solution: (A) We always consider residuals as vertical offsets. We calculate the direct differences between actual value and the Y labels. Perpendicular offset are useful in case of PCA.

Is it possible that Assignment of observations to clusters does not change between successive iterations in K-Means A. Yes B. No C. Can't say D. None of these

Solution: (A) When the K-Means algorithm has reached the local or global minima, it will not alter the assignment of data points to clusters for two successive iterations.

Suppose that you have a dataset D1 and you design a linear regression model of degree 3 polynomial and you found that the training and testing error is "0" or in another terms it perfectly fits the data. What will happen when you fit degree 2 polynomial in linear regression? A) It is high chances that degree 2 polynomial will over fit the data B) It is high chances that degree 2 polynomial will under fit the data C) Can't say D) None of these

Solution: (B) If a degree 3 polynomial fits the data perfectly, it's highly likely that a simpler model(degree 2 polynomial) might under fit the data.

For two runs of K-Mean clustering is it expected to get same clustering results? A. Yes B. No

Solution: (B) K-Means clustering algorithm instead converses on local minima which might also correspond to the global minima in some cases but not always. Therefore, it's advised to run the K-Means algorithm multiple times before drawing inferences about the clusters. However, note that it's possible to receive same clustering results from K-means by setting the same seed value for each run. But that is done by simply making the algorithm choose the set of same random no. for each run.

Suppose Pearson correlation between V1 and V2 is zero. In such case, is it right to conclude that V1 and V2 do not have any relation between them? A) TRUE B) FALSE

Solution: (B) Pearson correlation coefficient between 2 variables might be zero even when they have a relationship between them. If the correlation coefficient is zero, it just means that that they don't move together. We can take examples like y=|x| or y=x^2.

Which of the following hyper parameter(s), when increased may cause random forest to over fit the data? 1. Number of Trees 2. Depth of Tree 3. Learning Rate A) Only 1 B) Only 2 C) Only 3 D) 1 and 2 E) 2 and 3 F) 1,2 and 3

Solution: (B) Usually, if we increase the depth of tree it will cause overfitting. Learning rate is not an hyperparameter in random forest. Increase in the number of tree will cause under fitting.

Suppose that you have a dataset D1 and you design a linear regression model of degree 3 polynomial and you found that the training and testing error is "0" or in another terms it perfectly fits the data. In terms of bias and variance. Which of the following is true when you fit degree 2 polynomial? A) Bias will be high, variance will be high B) Bias will be low, variance will be high C) Bias will be high, variance will be low D) Bias will be low, variance will be low

Solution: (C) Since a degree 2 polynomial will be less complex as compared to degree 3, the bias will be high and variance will be low.

Which of the following statement is true about sum of residuals of A and B? Below graphs show two fitted regression lines (A & B) on randomly generated data. Now, I want to find the sum of residuals in both cases A and B. Note: 1. Scale is same in both graphs for both axis. 2. X axis is independent variable and Y-axis is dependent variable. A) A has higher sum of residuals than B B) A has lower sum of residual than B C) Both have same sum of residuals D) None of these

Solution: (C) Sum of residuals will always be zero, therefore both have same sum of residuals

Let's say, you are working with categorical feature(s) and you have not looked at the distribution of the categorical variable in the test data. You want to apply one hot encoding (OHE) on the categorical feature(s). What challenges you may face if you have applied OHE on a categorical variable of train dataset? A) All categories of categorical variable are not present in the test dataset. B) Frequency distribution of categories is different in train as compared to the test dataset. C) Train and Test always have same distribution. D) Both A and B E) None of these

Solution: (D) Both are true, The OHE will fail to encode the categories which is present in test but not in train so it could be one of the main challenges while applying OHE. The challenge given in option B is also true you need to more careful while applying OHE if frequency distribution doesn't same in train and test.

If you are given the two variables V1 and V2 and they are following below two characteristics. 1. If V1 increases then V2 also increases 2. If V1 decreases then V2 behavior is unknown A) Pearson correlation will be close to 1 B) Pearson correlation will be close to -1 C) Pearson correlation will be close to 0 D) None of these

Solution: (D) We cannot comment on the correlation coefficient by using only statement 1. We need to consider the both of these two statements. Consider V1 as x and V2 as |x|. The correlation coefficient would not be close to 1 in such a case.

Which of the following is the most appropriate strategy for data cleaning before performing clustering analysis, given less than desirable number of data points: 1. Capping and flouring of variables 2. Removal of outliers

Solution: 1 Removal of outliers is not recommended if the data points are few in number. In this scenario, capping and flouring of variables is the most appropriate strategy.

Which of the following clustering algorithms suffers from the problem of convergence at local optima? 1. K- Means clustering algorithm 2. Agglomerative clustering algorithm 3. Expectation-Maximization clustering algorithm 4. Diverse clustering algorithm

Solution: 1 + 3 Out of the options given, only K-Means clustering algorithm and EM clustering algorithm has the drawback of converging at local minima.

We can also compute the coefficient of linear regression with the help of an analytical method called "Normal Equation". Which of the following is/are true about Normal Equation? 1. We don't have to choose the learning rate 2. It becomes slow when number of features is very large 3. There is no need to iterate

Solution: 1, 2, 3 Instead of gradient descent, Normal Equation can also be used to find coefficients.

Sentiment Analysis is an example of: 1. Regression 2. Classification 3. Clustering 4. Reinforcement Learning

Solution: 1,2,4 Sentiment analysis at the fundamental level is the task of classifying the sentiments represented in an image, text or speech into a set of defined sentiment classes like happy, sad, excited, positive, negative, etc. It can also be viewed as a regression problem for assigning a sentiment score of say 1 to 10 for a corresponding image, text or speech. Another way of looking at sentiment analysis is to consider it using a reinforcement learning perspective where the algorithm constantly learns from the accuracy of past sentiment analysis performed to improve the future performance.

Movie Recommendation systems are an example of: 1. Classification 2. Clustering 3. Reinforcement Learning 4. Regression

Solution: 2,3 Generally, movie recommendation systems cluster the users in a finite number of similar groups based on their previous activities and profile. Then, at a fundamental level, people in the same cluster are made similar recommendations. In some scenarios, this can also be approached as a classification problem for assigning the most appropriate movie class to the user of a specific group of users. Also, a movie recommendation system can be viewed as a reinforcement learning problem where it learns by its previous recommendations and improves the future recommendations.

Below are the 8 actual values of target variable in the train file. [0,0,0,1,1,1,1,1] What is the entropy of the target variable? A) -(5/8 log(5/8) + 3/8 log(3/8)) B) 5/8 log(5/8) + 3/8 log(3/8) C) 3/8 log(5/8) + 5/8 log(3/8) D) 5/8 log(3/8) - 3/8 log(5/8)

Solution: A

In above question what do you think which function would make p between (0,1)? A) logistic function B) Log likelihood function C) Mixture of both D) None of them

Solution: A

True-False: Is it possible to apply a logistic regression algorithm on a 3-class Classification problem?

Solution: A Yes, we can apply logistic regression on 3 classification problem, We can use One Vs all method for 3 class classification in logistic regression. On their own, logistic regressions are only binary classifiers, meaning they cannot handle target vectors with more than two classes. However, there are clever extensions to logistic regression to do just that. In one-vs-rest logistic regression (OVR) a separate model is trained for each class predicted whether an observation is that class or not (thus making it a binary classification problem). It assumes that each classification problem (e.g. class 0 or not) is independent.

When the C parameter is set to infinite in SVM, which of the following holds true? A) The optimal hyperplane if exists, will be the one that completely separates the data B) The soft-margin classifier will separate the data C) None of the above

Solution: A At such a high level of misclassification penalty, soft margin will not hold existence as there will be no room for error.

Which of the following algorithms do we use for Variable Selection? A) LASSO B) Ridge C) Both D) None of these

Solution: A In case of lasso we apply a absolute penality, after increasing the penality in lasso some of the coefficient of variables may become zero.

One of the very good methods to analyze the performance of Logistic Regression is AIC, which is similar to R-Squared in Linear Regression. Which of the following is true about AIC? A) We prefer a model with minimum AIC value B) We prefer a model with maximum AIC value C) Both but depend on the situation D) None of these

Solution: A We select the best model in logistic regression which can least AIC.

Imagine, you are working with "Analytics Vidhya" and you want to develop a machine learning algorithm which predicts the number of views on the articles. Your analysis is based on features like author name, number of articles written by the same author on Analytics Vidhya in past and a few other features. Which of the following evaluation metric would you choose in that case? 1. Mean Square Error 2. Accuracy 3. F1 Score

Solution: A You can think that the number of views of articles is the continuous target variable which fall under the regression problem. So, mean squared error will be used as an evaluation metrics.

What do you mean by generalization error in terms of the SVM? A) How far the hyperplane is from the support vectors B) How accurately the SVM can predict outcomes for unseen data C) The threshold amount of error in an SVM

Solution: B Generalisation error in statistics is generally the out-of-sample error which is the measure of how accurately a model can predict values for previously unseen data.

Which of the following methods do we use to best fit the data in Logistic Regression? A) Least Square Error B) Maximum Likelihood C) Jaccard distance D) Both A and B

Solution: B Logistic regression uses maximum likely hood estimate for training a logistic regression.

[True-False] Standardisation of features is required before training a Logistic Regression. A) TRUE B) FALSE

Solution: B Standardization isn't required for logistic regression. The main goal of standardizing features is to help convergence of the technique used for optimization.

Suppose you are using RBF kernel in SVM with high Gamma value. What does this signify? A) The model would consider even far away points from hyperplane for modeling B) The model would consider only the points close to the hyperplane for modeling C) The model would not be affected by distance of points from hyperplane for modeling D) None of the above

Solution: B The gamma parameter in SVM tuning signifies the influence of points either near or far away from the hyperplane. For a low gamma, the model will be too constrained and include all points of the training dataset, without really capturing the shape. For a higher gamma, the model will capture the shape of the dataset well.

Consider a following model for logistic regression: P (y =1|x, w)= g(w0 + w1x) where g(z) is the logistic function. In the above equation the P (y =1|x; w) , viewed as a function of x, that we can get by changing the parameters w. What would be the range of p in such case? A) (0, inf) B) (-inf, 0 ) C) (0, 1) D) (-inf, inf)

Solution: C For values of x in the range of real number from −∞ to +∞ Logistic function will give the output between (0,1)

The SVM's are less effective when: A) The data is linearly separable B) The data is clean and ready to use C) The data is noisy and contains overlapping points

Solution: C When the data has noise and overlapping points, there is a problem in drawing a clear hyperplane without misclassifying.

Which of the following evaluation metrics can not be applied in case of logistic regression output to compare with target? A) AUC-ROC B) Accuracy C) Logloss D) Mean-Squared-Error

Solution: D Since, Logistic Regression is a classification algorithm so it's output can not be real time value so mean squared error can not use for evaluating it

True-False: Is Logistic regression mainly used for Regression?

Solution: False Logistic regression is a classification algorithm, don't confuse with the name regression.

Suppose you are using a Linear SVM classifier with 2 class classification problem. Now you have been given the following data in which some points are circled red that are representing support vectors. If you remove the non-red circled(non-support vector) points from the data, the decision boundary will change?

Solution: False On the other hand, rest of the points in the data won't affect the decision boundary much.

True-False: Is Logistic regression a supervised machine learning algorithm? A) TRUE B) FALSE

Solution: True True, Logistic regression is a supervised learning algorithm because it uses true labels for training. Supervised learning algorithm should have input variables (x) and an target variable (Y) when you train the model .

Suppose you are using a Linear SVM classifier with 2 class classification problem. Now you have been given the following data in which some points are circled red that are representing support vectors. If you remove the following any one support vectors from the data. Will the decision boundary change?

Sulution: True These three examples are positioned such that removing any one of them introduces slack in the constraints. So the decision boundary would completely change.

The effectiveness of an SVM depends upon: A) Selection of Kernel B) Kernel Parameters C) Soft Margin Parameter C D) All of the above

The effectiveness of an SVM depends upon: A) Selection of Kernel B) Kernel Parameters C) Soft Margin Parameter C D) All of the above

The minimum time complexity for training an SVM is O(n2). According to this fact, what sizes of datasets are not best suited for SVM's? A) Large datasets B) Small datasets C) Medium sized datasets D) Size does not matter

The minimum time complexity for training an SVM is O(n2). According to this fact, what sizes of datasets are not best suited for SVM's? A) Large datasets B) Small datasets C) Medium sized datasets D) Size does not matter

(True/False)Lasso Regularization can be used for variable selection in Linear Regression

True, In case of lasso regression we apply absolute penalty which makes some of the coefficients zero.

It is possible to design a Linear regression algorithm using a neural network?

True. A Neural network can be used as a universal approximator, so it can definitely implement a linear regression algorithm.

Pearson Coefficient

Used between two variables measured on continuous scale (parametric) It simply is the ratio of co-variance of two variables to a product of variance (of the variables). It takes a value between +1 and -1. An extreme value on both the side means they are strongly correlated with each other. A value of zero indicates a NIL correlation but not a non-dependence.

Linear Regression is a supervised machine learning algorithm.

Yes, Linear regression is a supervised learning algorithm because it uses true labels for training. Supervised learning algorithm should have input variable (x) and an output variable (Y) for each example.

Is rotation necessary in PCA? If yes, Why? What will happen if you don't rotate the components?

Yes, rotation (orthogonal) is necessary because it maximizes the difference between variance captured by the component. This makes the components easier to interpret. Not to forget, that's the motive of doing PCA where, we aim to select fewer components (than features) which can explain the maximum variance in the data set. By doing rotation, the relative location of the components doesn't change, it only changes the actual coordinates of the points. If we don't rotate the components, the effect of PCA will diminish and we'll have to select more number of components to explain variance in the data set.

You are assigned a new project which involves helping a food delivery company save more money. The problem is, company's delivery team aren't able to deliver food on time. As a result, their customers get unhappy. And, to keep them happy, they end up delivering food for free. Which machine learning algorithm can save them?

You might have started hopping through the list of ML algorithms in your mind. But, wait! Such questions are asked to test your machine learning fundamentals. This is not a machine learning problem. This is a route optimization problem. A machine learning problem consist of three things: 1. There exist a pattern. 2. You cannot solve it mathematically (even by writing exponential equations). 3. You have data on it. Always look for these three factors to decide if machine learning is a tool to solve a particular problem.


Conjuntos de estudio relacionados

Meaning and Origin of Civilization

View Set

Teaching & Learning / Patient Education

View Set

Marketing Medicare Advantage and Part D Plans

View Set

True or False Questions For ELA Exam

View Set