ML Interview Questions Deck 1
what is dunn index in clustering
2. Dunn Index : So inertia doesn't take into account the property of clusters that different clusters should be as different from each other as possible. So Dunn Index takes care of that where it also takes in account of the inter cluster distance i.e. distance between the centroids of different clusters.
what is correlation
Correlation quantifies the relationship between two random variables and has only three specific values, i.e., 1, 0, and -1.
what is covariance
Covariance measures how two variables are related to each other and how one would vary with respect to changes in the other variable. If the value is positive it means there is a direct relationship between the variables and one would increase or decrease with an increase or decrease in the base variable respectively, given that all other conditions remain constant.
why is sampling important
Data analysis can not be done on a whole volume of data at a time especially when it involves larger datasets. It becomes crucial to take some data samples that can be used for representing the whole population and then perform analysis on it. While doing this, it is very much necessary to carefully take sample data out of the huge data that truly represents the entire dataset
what is elbow method in clustering
For the k-means clustering method, the most common approach for answering this question is the so-called elbow method. It involves running the algorithm multiple times over a loop, with an increasing number of cluster choice and then plotting a clustering score as a function of the number of clusters. but it can be hard to determine where is the elbow
what is hierarchal based clustering
Hierarchy-based clustering algorithms work by creating a tree of clusters. That is done by constructing a hierarchical relation between the different data points. So, in the beginning, each data [point in the dataset is its own cluster, and then, every two close clusters are merged together to form a new cluster until one cluster remains. Unlike partition-based algorithms, a hierarchy-based algorithm doesn't need a fixed number of clusters. The algorithms will produce a visual representation of the resultant clusters for better interpretation and understanding
what is inertia in clustering
Inertia : So inertia actually calculates sum of distances between the points in a cluster to the centroid of that cluster. This distance within the clusters is called the intra-cluster distance. Now sum of all these distances is the final inertia value. So for a good cluster the inertia value should be as small as possible.
what is the difference between knn and k means clustering
K-Nearest Neighbors is a supervised classification algorithm, while k-means clustering is an unsupervised clustering algorithm The critical difference here is that KNN needs labeled points and is thus supervised learning, while k-means doesn't—and is thus unsupervised learning.
what is hard clustering
K-means is that it is a hard clustering method, which means that it will associate each point to one and only one cluster. A limitation to this approach is that there is no uncertainty measure or probability that tells us how much a data point is associated with a specific cluster.
What is the difference between likelihood and probabilities
Likelihood is the probability of a discrete value while probabilities is more specific to the probabilities of continuous values (often used interchangeably though)
explain ine hot encoding and label encoding
One-hot encoding is the representation of categorical variables as binary vectors. Label Encoding is converting labels/words into numeric form. Using one-hot encoding increases the dimensionality of the data set. Label encoding doesn't affect the dimensionality of the data set. One-hot encoding creates a new variable for each level in the variable whereas, in Label encoding, the levels of a variable get encoded as 1 and 0.
what is partition based clustering
Partition-based algorithms — aka, centroid-based clustering — are a group of clustering algorithms that divide data into non-hierarchical clusters. e.g.kmeans,clara,pam,clarans
What Are the Different Types of Machine Learning?
Supervised Learning In supervised machine learning, a model makes predictions or decisions based on past or labeled data. Labeled data refers to sets of data that are given tags or labels, and thus made more meaningful. Unsupervised Learning In unsupervised learning, we don't have labeled data. A model can identify patterns, anomalies, and relationships in the input data. Reinforcement Learning Using reinforcement learning, the model can learn based on the rewards it received for its previous action.
What is Supervised Learning?
Support Vector Machines Regression Naive Bayes Decision Trees K-nearest Neighbour Algorithm and Neural Networks.
What is the F1 Score?
The F1 score is a metric that combines both Precision and Recall. It is also the weighted average of precision and recall. F1 = 2 * (P * R) / (P + R) The F1 score is one when both Precision and Recall scores are one. The F1 score is a good metric for imbalanced datasets
what is naive in naive bayes
The fundamental Naïve Bayes assumption is that each feature makes an: independent (all features are independent) equal (all features contribute equally to the outcome) contribution to the outcome. e.g. word order is not taken into account if classifying text this leads to naive bayes having high bias (simplifying assumptions made) but low variance (how generalisable it is) due to the simplifying assumptions made
what are the prediction errors
The prediction error for any machine learning algorithm can be broken down into three parts: Bias Error Variance Error Irreducible Error
what is T, P, E wrt ML?
Tom Mitchell provides a more modern definition: "A computer program is said to learn from experience E with respect to some class of tasks T and performance measure P, if its performance at tasks in T, as measured by P, improves with experience E."
how do we apply ml to hardware
We have to build ML algorithms in System Verilog which is a Hardware development Language and then program it onto an FPGA to apply Machine Learning to hardware.
how to achieve optimal bias variance
dimensionality reduction --> decrease variance regularisation ensemble learning optimal k for knn
what is the bias and variance of linear vs non linear ml models
Linear machine learning algorithms often have a high bias but a low variance. Nonlinear machine learning algorithms often have a low bias but a high variance
what is low and high variance
Low Variance: Suggests small changes to the estimate of the target function with changes to the training dataset. High Variance: Suggests large changes to the estimate of the target function with changes to the training dataset. low bias models tend to have high variance high variance models tend to have low bias
what is marginal, joint and conditional probability,
Marginal Probability: The probability of an event irrespective of the outcomes of other random variables, e.g. P(A). Joint Probability: Probability of two (or more) simultaneous events, e.g. P(A and B) or P(A, B). Conditional Probability: Probability of one (or more) event given the occurrence of another event, e.g. P(A given B) or P(A | B)
what is the bias variance tradeoff
bias increase, variance decrease more simplifying assumptions so less accurate but more consistent (less accurate to training dataset, more applicable to other datasets bias decrease, variance increase more accurate to training data, less applicable to other data (test) best is low low low bias --> less simplifying asumptions low variance --> more consistent performabce across different datasets
what are the different clustering algorithms
partition based hierarchical
what is precision and recall
precision is tp/tp+fp -- percentage of positives predicted that were actually true positives recall is tp/tp+fn -- percentage of all positives predicted as positive
What is bic
Bayesian Information Criterion (BIC) but they can be applied only if we are willing to extend the clustering algorithm beyond k-means to the more generalized version — Gaussian Mixture Model (GMM).
What is statistical power?
Power is the probability that we will correctly reject the null hypothesis
What is a prior probability in Naive Bayes?
Prior probability refers to an initial guess E.g. we want to classify if an email is spam or normal, we first take a prior probability which is p(N) where N means normal or p(S) where S is spam
how do u do feature selection
There are various means to select important variables from a data set that include the following: Identify and discard correlated variables before finalizing on important variables The variables could be selected based on 'p' values from Linear Regression Forward, Backward, and Stepwise selection Lasso Regression Random Forest and plot variable chart Top features can be selected based on information gain for the available set of features
Explain the difference between L1 and L2 regularization.
Answer: L2 regularization tends to spread error among all the terms, while L1 is more binary/sparse, with many variables either being assigned a 1 or 0 in weighting. L1 corresponds to setting a Laplacean prior on the terms, while L2 corresponds to a Gaussian prior.
explain the difference between AI, ML and DL
Artificial Intelligence (AI) is the domain of producing intelligent machines. ML refers to systems that can assimilate from experience (training data) and Deep Learning (DL) states to systems that learn from experience on large data sets. ML can be considered as a subset of AI. Deep Learning (DL) is ML but useful to large data sets. something that is ai but is not ML is a rule based chatbot something that is ai, ml and dl is nlp and asr (automatic speech recognition)
what is bayes theorem
Bayes Theorem: Principled way of calculating a conditional probability without the joint probability. useful either when the joint probability is challenging to calculate (which is most of the time), or when the reverse conditional probability is available or easy to calculate. useful when P(B) is not available
what is bias, give examples of high bias and low bias models
Bias are the simplifying assumptions made by a model to make the target function easier to learn. low bias = more assumptions Examples of low-bias machine learning algorithms include: Decision Trees, k-Nearest Neighbors and Support Vector Machines. Examples of high-bias machine learning algorithms include: Linear Regression, Linear Discriminant Analysis and Logistic Regression
what is density based clustering
In density-based clustering, algorithms form the different clusters based on the data region's density at any given location. Using this approach, arbitrary-shaped distributions are formed in dense areas of the data. These types of algorithms struggle with data of varying densities or data with high dimensionality. used in contact tracing
what r distribution based clustering algo
In distribution-based algorithms, the same cluster's data points need to belong to the same probability distribution. The most commonly used distribution is Gaussian distribution. For example, one application can divide the dataset into various Gaussian distributions; each has its own properties.
what is fuzzy based algo
In fuzzy algorithms, some data points are assigned to specific clusters based on their level of belonging. This level of belonging is obtained using discrete values of {0->belong,1->doesn't belong} that are then changed to continuous values with the interval [0,1] to describe the belonging relationship more accurately.
What is Overfitting, and How Can You Avoid It?
Overfitting is a situation that occurs when a model learns the training set too well, taking up random fluctuations in the training data as concepts. These impact the model's ability to generalize and don't apply to new data. There are multiple ways of avoiding overfitting, such as: Regularization. It involves a cost term for the features involved with the objective function Making a simple model. With lesser variables and parameters, the variance can be reduced Cross-validation methods like k-folds can also be used If some model parameters are likely to cause overfitting, techniques for regularization like LASSO can be used that penalize these parameters
what are some examples of sampling
Probability Sampling techniques: Clustered sampling, Simple random sampling, Stratified sampling. Non-Probability Sampling techniques: Quota sampling, Convenience sampling, snowball sampling, etc.
If applying naive bayes for e.g. to classify emails as spam or normal, why do we add 1 to the count of all words?
Procedure: Get histogram of each word in the normal messages vs the spam messages For e.g. message is "lunch money money money money", to decide if the message is spam, we calculate Pr(S | Lunch money money money money) = Pr(S) * Pr(Lunch| S) * Pr(Money|S)^4 If "Lunch:" does not appear in training set of Spam messages then any message with lunch will not be classified as spam This is known as the zero frequency problem
explain how an roc curve works
The ROC curve is a graphical representation of the contrast between true positive rates and the false positive rate at various thresholds. It's often used as a proxy for the trade-off between the sensitivity of the model (true positives) vs the fall-out or the probability it will trigger a false alarm (false positives).
what is the silhouette coefficient
The Silhouette Coefficient is calculated using the mean intra-cluster distance (a) and the mean inter-cluster distance (b) for each sample. The Silhouette Coefficient for a sample is (b - a) / max(a, b). To clarify, b is the distance between a sample and the nearest cluster that the sample is not a part of. We can compute the mean Silhouette Coefficient over all samples and use this as a metric to judge the number of clusters. range of -1 to 1 (-1 =wrong, 0 = undifferent, 1=distinguished) 1 if intra cluster distance = 0 and intercluster distance is high so b-a/b = 1 0 if b = a -1 if a>b