ds interview

Réussis tes devoirs et examens dès maintenant avec Quizwiz!

Error metric for classificaton?

AUC - aggregated measure of performance of a binary classifier on all possible threshold values tells how much model is capable of distinguishing between classes. Higher the AUC, better the model is at predicting 0s as 0s and 1s as 1s

what is an ensemble learning model

combine set of learners(Individual models) together to obtain better predictive performance

What is selection bias ?

bias introduced by the selection of data that is not random, If the selection bias is not considered, then conlclusions from the analysis may not be accurate.

What are the two types of ensemble methods

Boosting(iterative technique which adjust the weight on the accuracy of prediction) leads to overfitting Bagging(combine similar learners on small sample populations and then takes a mean of all the predictions.

What is Specificity?

#number of classes you CORRECTLY predicted as negative / # number of classes that are negative

What true positive rate?

(sensitivity or recall) are used to measure the percentage of actual positives which are correctly indentified True Positives/Positives

What is the curse of dimensionality?

As dimensions grows, dimensions space increases exponentially causing high sparsity in the data set, increases storage space and processing time for the modeling algorithm

How will you define the number of clusters in a clustering algorithm?

For each k, plot the total within-cluster sum of square (wss).(variance of the observations within the cluster) (Elbow Curve) the point after which you don't see any reduction in WSS . That's the bending point and the best taken as K

What is the Central Limit Theorem?

Given a population with mean μ and standard deviation σ and Take decently large random samples from the population with replacement, the distribution of the sample means will be approximately normally distributed, with a mean of μ

Bias-Variance tradeoff

High bias-low variance: train on consistent, inaccurate models--less complex, simple structure Underfit High variance, low bias: train on accurate, but inconsistent models--more complex, flexible structure Overfit

How is k-NN different from k-means clustering?

K-means : Given some data you cluster them in K-clusters, part of family of moving centroid algorithms K-NN: classification algorithm, where the k describes the number of neighboring data points that influence the classification of a given observation

What are the different kernels functions in SVM ?

Linear Kernel Polynomial kernel Radial basis kernel Sigmoid kernel

Error metric for regression?

Mean squared error - average squared error between the predicted and actual values Mean absolute error - average absolute distance between the predicted and target values (fails to punish large errors in prediction) r**2 - proportion of the variance in the response that is explained by the features

What is the p-value

Probability that the result obtained was due to chance. low pvalue- ASSUMING THE NULL HYPO IS TRUE,THE PROBABLITY OF GETTING THIS TEST STATISTIC THIS SMALL IS NEAR IMPOSSIBLE

What is precision?

True_Positive/ (True_Positive+ False_Positive) #samples CORRECTLY predicted as positive /#samples predicted as positive

What are the differences between supervised and unsupervised learning?

Unsupervised- Uses unlabeled data as input. finds patterns or groupings, no right solution. no feedback Supervised - Uses known and labeled data as input, feedback, learning a mapping from features

How can you avoid the overfitting your model?

Use k folds cross-validation Use regularization techniques, such as LASSO, that penalize certain model parameters take fewer variables into account, using feature selection(permutation importance) /dimensional (principal component analysis)

Can you explain the difference between a Test Set and a Validation Set?

Validation set is part of the training set as it is used for hyperparameter tuning and to avoid Overfitting test set is used for evaluating the performance of a trained machine leaning model

What is a decision tree?

a supervised learning algorithm that is used for both classification and regression, maximize information gain each node denotes the test on an attribute, each edge denotes the outcome of that test, each leaf node holds the class label,

What is a confusion matrix?

a table which is used to estimate the performance of a classification model It tabulates the actual values and the predicted values in a 2×2 matrix.

What is a random forest model?

ensemble learning method for classification or regression that operate by constructing a multitude of decision trees(random hyperparameters) outputting the prediciton which may be the mode, mean , or median of predictions of the individual trees

What are recommender systems?

filtering system that predicts "rating" a user would give to an item

different types of hyperparameters?

gridsearchcv randomsearchcv

What is the difference between Regression and classification ML

if our label data has discrete values then it will a classification problem but if they are continuous values then it will be regression

what is a gradient descent

iterative optimization algorithm to find the minimum of a function (to optimize the loss of error) minimise the cost function(error function)

What is a logistic regression?

measures the binary relationship between the dependent variable and one or more feature variables estimating probability using its underlying logistic function

What is the variance of a model

model learns noise also from the training dataset and performs bad on test dataset.\ lead high sensitivity and overfitting

What is overfitting

modeling error that occurs when a model is too closely fit to a limited set of data points. model is too close to its training data than testing data, training score alot greater than test score

What is A/B Testing

process of showing two verisions of the same web page to different segments of website visitors at the same time and comparing which version drives more conversions.(desired action)

What is a false positive rate?

ratio between the number of false events categorized as positive upon the total number of actual events False Positives/Negatives

What is the ROC Curve

shows the true positive rate against the false positive rate for various threshold values

What is law of large numbers

states that as the number of trials increases, the observed relative frequency eventually converges to the theoritical probability

What is a linear regression?

supervised learning algorithm, which helps in finding the linear relationship between two variables. One is the predictor and the other is the response variable

What is TF/IDF vectorization ?

term frequency-inverse document frequency, is a numerical statistic that is intended to reflect how important a word is to a DOCUMENT in a COLLECTION the value increases by the number of times a word appears in the DOCUMENT offset by the frequency of the word in the COLLECTION

What is bias

the model makes simplified assumptions to make the target function easier to understand Leads to underfitting

How do k-fold cross validation

the sample is randomly partitioned into k equal sized subsamples. Of the k subsamples, a single subsample is retained as the validation data for testing the model, and the remaining k − 1 subsamples are used as training data

what is regularization

tuning a parameter to a model to induce smoothness in order to prevent overfitting This constant is either the L1 (Lasso) or L2 (ridge)

What is the difference between type I vs type II error?

type I error occurs when the null hypothesis is true, but is rejected. A type II error occurs when the null hypothesis is false, but erroneously fails to be rejected

Imbalanced Classes/ data bias?

type of error in which certain elements of a dataset are more heavily represented than others

chi squared goodness of fit test

used to compare the observed sample distribution with the expected probability distribution determines how well theoretical distribution fits the empirical distribution

What is the F1 score

weighted average of the precision and recall scores. The F1 measure penalizes classifiers with imbalanced precision and recall scores. If F1 = 1, then precision and recall are accurate, < 1, then precision or recall is less accurate


Ensembles d'études connexes

more random econ practice questions

View Set

Chapter 23 - Poverty, Homelessness, Mental Illness, Teen Pregnancy

View Set

Internet comm/research Chap 3- Searching the Web

View Set

Palestinian Colloquial Arabic Vocabulary - Chapter 3 Love, Marriage, and Sex

View Set

United States Major Mountains and Ranges

View Set

Beginning Medical Terminology - Homework 3

View Set

The 거야/거예요 Ending in Korean - How does it work?

View Set