ds interview
Error metric for classificaton?
AUC - aggregated measure of performance of a binary classifier on all possible threshold values tells how much model is capable of distinguishing between classes. Higher the AUC, better the model is at predicting 0s as 0s and 1s as 1s
what is an ensemble learning model
combine set of learners(Individual models) together to obtain better predictive performance
What is selection bias ?
bias introduced by the selection of data that is not random, If the selection bias is not considered, then conlclusions from the analysis may not be accurate.
What are the two types of ensemble methods
Boosting(iterative technique which adjust the weight on the accuracy of prediction) leads to overfitting Bagging(combine similar learners on small sample populations and then takes a mean of all the predictions.
What is Specificity?
#number of classes you CORRECTLY predicted as negative / # number of classes that are negative
What true positive rate?
(sensitivity or recall) are used to measure the percentage of actual positives which are correctly indentified True Positives/Positives
What is the curse of dimensionality?
As dimensions grows, dimensions space increases exponentially causing high sparsity in the data set, increases storage space and processing time for the modeling algorithm
How will you define the number of clusters in a clustering algorithm?
For each k, plot the total within-cluster sum of square (wss).(variance of the observations within the cluster) (Elbow Curve) the point after which you don't see any reduction in WSS . That's the bending point and the best taken as K
What is the Central Limit Theorem?
Given a population with mean μ and standard deviation σ and Take decently large random samples from the population with replacement, the distribution of the sample means will be approximately normally distributed, with a mean of μ
Bias-Variance tradeoff
High bias-low variance: train on consistent, inaccurate models--less complex, simple structure Underfit High variance, low bias: train on accurate, but inconsistent models--more complex, flexible structure Overfit
How is k-NN different from k-means clustering?
K-means : Given some data you cluster them in K-clusters, part of family of moving centroid algorithms K-NN: classification algorithm, where the k describes the number of neighboring data points that influence the classification of a given observation
What are the different kernels functions in SVM ?
Linear Kernel Polynomial kernel Radial basis kernel Sigmoid kernel
Error metric for regression?
Mean squared error - average squared error between the predicted and actual values Mean absolute error - average absolute distance between the predicted and target values (fails to punish large errors in prediction) r**2 - proportion of the variance in the response that is explained by the features
What is the p-value
Probability that the result obtained was due to chance. low pvalue- ASSUMING THE NULL HYPO IS TRUE,THE PROBABLITY OF GETTING THIS TEST STATISTIC THIS SMALL IS NEAR IMPOSSIBLE
What is precision?
True_Positive/ (True_Positive+ False_Positive) #samples CORRECTLY predicted as positive /#samples predicted as positive
What are the differences between supervised and unsupervised learning?
Unsupervised- Uses unlabeled data as input. finds patterns or groupings, no right solution. no feedback Supervised - Uses known and labeled data as input, feedback, learning a mapping from features
How can you avoid the overfitting your model?
Use k folds cross-validation Use regularization techniques, such as LASSO, that penalize certain model parameters take fewer variables into account, using feature selection(permutation importance) /dimensional (principal component analysis)
Can you explain the difference between a Test Set and a Validation Set?
Validation set is part of the training set as it is used for hyperparameter tuning and to avoid Overfitting test set is used for evaluating the performance of a trained machine leaning model
What is a decision tree?
a supervised learning algorithm that is used for both classification and regression, maximize information gain each node denotes the test on an attribute, each edge denotes the outcome of that test, each leaf node holds the class label,
What is a confusion matrix?
a table which is used to estimate the performance of a classification model It tabulates the actual values and the predicted values in a 2×2 matrix.
What is a random forest model?
ensemble learning method for classification or regression that operate by constructing a multitude of decision trees(random hyperparameters) outputting the prediciton which may be the mode, mean , or median of predictions of the individual trees
What are recommender systems?
filtering system that predicts "rating" a user would give to an item
different types of hyperparameters?
gridsearchcv randomsearchcv
What is the difference between Regression and classification ML
if our label data has discrete values then it will a classification problem but if they are continuous values then it will be regression
what is a gradient descent
iterative optimization algorithm to find the minimum of a function (to optimize the loss of error) minimise the cost function(error function)
What is a logistic regression?
measures the binary relationship between the dependent variable and one or more feature variables estimating probability using its underlying logistic function
What is the variance of a model
model learns noise also from the training dataset and performs bad on test dataset.\ lead high sensitivity and overfitting
What is overfitting
modeling error that occurs when a model is too closely fit to a limited set of data points. model is too close to its training data than testing data, training score alot greater than test score
What is A/B Testing
process of showing two verisions of the same web page to different segments of website visitors at the same time and comparing which version drives more conversions.(desired action)
What is a false positive rate?
ratio between the number of false events categorized as positive upon the total number of actual events False Positives/Negatives
What is the ROC Curve
shows the true positive rate against the false positive rate for various threshold values
What is law of large numbers
states that as the number of trials increases, the observed relative frequency eventually converges to the theoritical probability
What is a linear regression?
supervised learning algorithm, which helps in finding the linear relationship between two variables. One is the predictor and the other is the response variable
What is TF/IDF vectorization ?
term frequency-inverse document frequency, is a numerical statistic that is intended to reflect how important a word is to a DOCUMENT in a COLLECTION the value increases by the number of times a word appears in the DOCUMENT offset by the frequency of the word in the COLLECTION
What is bias
the model makes simplified assumptions to make the target function easier to understand Leads to underfitting
How do k-fold cross validation
the sample is randomly partitioned into k equal sized subsamples. Of the k subsamples, a single subsample is retained as the validation data for testing the model, and the remaining k − 1 subsamples are used as training data
what is regularization
tuning a parameter to a model to induce smoothness in order to prevent overfitting This constant is either the L1 (Lasso) or L2 (ridge)
What is the difference between type I vs type II error?
type I error occurs when the null hypothesis is true, but is rejected. A type II error occurs when the null hypothesis is false, but erroneously fails to be rejected
Imbalanced Classes/ data bias?
type of error in which certain elements of a dataset are more heavily represented than others
chi squared goodness of fit test
used to compare the observed sample distribution with the expected probability distribution determines how well theoretical distribution fits the empirical distribution
What is the F1 score
weighted average of the precision and recall scores. The F1 measure penalizes classifiers with imbalanced precision and recall scores. If F1 = 1, then precision and recall are accurate, < 1, then precision or recall is less accurate