Data Science Interview Questions
What do you understand by Imbalanced Data?
Data is said to be highly imbalanced if it is distributed unequally across different categories. These datasets result in an error in model performance and result in inaccuracy.
Data Science vs Data Analytics: What's the Difference?
Data science uses insights extracted from data to solve specific business problems. Data analytics is a more exploratory practice of unearthing the correlations and patterns that exist in a dataset.
Which feature selection techniques do you know?
Here are some of the feature selections: Principal Component Analysis Neighborhood Component Analysis ReliefF Algorithm
Differentiate between box plot and histogram.
Histograms are the bar chart representation of information that represents the frequency of numerical variable values that are useful in estimating probability distribution, variations and outliers.Boxplots are used for communicating different aspects of data distribution where the shape of the distribution is not seen but still the insights can be gathered. These are useful for comparing multiple charts at the same time as they take less space when compared to histograms.
What is ensemble learning?
However, sometimes some datasets are very complex, and it is difficult for one model to be able to grasp the underlying trends in these datasets. In such situations, we combine several individual models together to improve performance. This is what is called ensemble learning.
What are the main parameters in the gradient boosting model?
learning_rate=0.1 (shrinkage). n_estimators=100 (number of trees). max_depth=3. min_samples_split=2. min_samples_leaf=1. subsample=1.0.
What is an ANOVA test
Analysis of Variance(ANOVA) is a statistical test which we perform to see relationship between two or more groups. ANOVA test basically used to compare the mean of different samples/groups so we can conclude that which group is among all the groups is more effective.
Explain What a Recommender System Does.
A recommender system uses historical behavior to predict how a user will rate a specific item.
What Is the Difference Between a Type I and Type II Error?
A type I error is a false positive, which means that a positive result was predicted but the result is negative. A type II error is a false negative, which means that a negative result was predicted but the actual result is positive.
What is bias in Data Science?
Bias is a type of error that occurs in a Data Science model because of using an algorithm that is not strong enough to capture the underlying patterns or trends that exist in the data. In other words, this error occurs when the data is too complicated for the algorithm to understand, so it ends up building a model that makes simple assumptions. This leads to lower accuracy because of underfitting. Algorithms that can lead to high bias are linear regression, logistic regression, etc.==
What is clustering? When do we need it?
Clustering algorithms group objects such that similar feature points are put into the same groups (clusters) and dissimilar feature points are put into different clusters.
What Is Collaborative Filtering?
Collaborative filtering is a form of content filtering that uses similarities between different users to make recommendations
What is Data Science?
Data Science is a field of computer science that explicitly deals with turning data into information and extracting meaningful insights out of it.
What defines the analysis of data objects not complying with general data behaviour?
Evolution Analysis
When do we need to perform feature normalization for linear models? When it's okay not to do it?
Feature normalization is necessary for L1 and L2 regularizations. The idea of both methods is to penalize all the features relatively equally. This can't be done effectively if every feature is scaled differently. Linear regression without regularization techniques can be used without feature normalization. Also, regularization can help to make the analytical solution more stable, — it adds the regularization matrix to the feature matrix before inverting it.
What is entropy in a decision tree algorithm?
In a decision tree algorithm, entropy is the measure of impurity or randomness. The entropy of a given dataset tells us how pure or impure the values of the dataset are. In simple terms, it tells us about the variance in the dataset.
Which regularization techniques do you know?
L1 Regularization (Lasso regularization) - Adds the sum of absolute values of the coefficients to the cost function. L2 Regularization (Ridge regularization) - Adds the sum of squares of coefficients to the cost function.
What does it mean when the p-values are high and low?
Low p-value which means values ≤ 0.05 means that the null hypothesis can be rejected and the data is unlikely with true null. High p-value, i.e values ≥ 0.05 indicates the strength in favor of the null hypothesis. It means that the data is like with true null. p-value = 0.05 means that the hypothesis can go either way.
What are MSE and RMSE?
MSE stands for Mean Square Error while RMSE stands for Root Mean Square Error. They are metrics with which we can evaluate models.
What are necessary conditions for weakly stationary time series data?
Mean doesn't depend on time. utocovariance depends on the moments in time s and t as |s-t| Weakly stationary time series given is finite variance process.
Difference between Normalisation and Standardization?
Normalization, also known as min-max scaling, is a technique where all the data values are converted such that they lie between 0 and 1. So, while normalization rescales the data into the range from 0 to 1 only, standardization ensures data follows the standard normal distribution.
How can we deal with outliers?
Outliers can be dealt with in several ways. One way is to drop them. We can only drop the outliers if they have values that are incorrect or extreme. In case the outliers are not that extreme, then we can try: A different kind of model. For example, if we were using a linear model, then we can choose a non-linear model Normalizing the data, which will shift the extreme values closer to other data points Using algorithms that are not so affected by outliers, such as random forest, etc.
What does a linear equation having 3 variables represent?
Plane
What is pruning in a decision tree algorithm?
Pruning a decision tree is the process of removing the sections of the tree that are not necessary or are redundant. Pruning leads to a smaller decision tree, which performs better and gives higher accuracy and speed.
What is the ROC curve? When to use it?
ROC stands for Receiver Operating Characteristics. The diagrammatic representation that shows the contrast between true positive rate vs false positive rate. It is used when we need to predict the probability of the binary outcome.
W
SVMS—or support vector machines—are used for predictive or classification tasks. They employ what are known as hyperplanes to differentiate two different variable classes.
Can you describe the various techniques used for data sampling?
Simple Random Sampling Systematic Sampling Cluster Sampling Purposive Sampling Quota Sampling Convenience Sampling Sampling is more cost- and time-efficient than studying a full dataset. It lets you analyze a subset of that data, which is easier while providing insights into the whole dataset.
How do we handle categorical variables in decision trees?
Some decision tree algorithms can handle categorical variables out of the box, others cannot. However, we can transform categorical variables, e.g. with a binary or a one-hot encoder.
What is supervised learning? (Easy)
Supervised learning is a type of machine learning in which our algorithms are trained using well-labeled training data, and machines predict the output based on that data.
What is the area under the PR curve? Is it a useful metric?
The Precision-Recall AUC is just like the ROC AUC, in that it summarizes the curve with a range of threshold values as a single score. A high area under the curve represents both high recall and high precision, where high precision relates to a low false positive rate, and high recall relates to a low false negative rate.
What Feature Selection Methods Are Used To Select the Right Variables?
The following are some of the techniques used for feature selection in data analysis: Pearson's Correlation Chi-Square Recursive Feature Elimination Backward Elimination Lasso Regression Ridge Regression
How do we know how many trees we need in random forest?
The number of trees in random forest is worked by n_estimators, and a random forest reduces overfitting by increasing the number of trees. There is no fixed thumb rule to decide the number of trees in a random forest, it is rather fine tuned with the data, typically starting off by taking the square of the number of features (n) present in the data followed by tuning until we get the optimal results.
How is time series different from the usual regression problem?
The principle behind causal forecasting is that the value that has to be predicted is dependant on the input features (causal factors). In time series forecasting, the to be predicted value is expected to follow a certain pattern over time.
Why is data with high dimensions considered hard to deal with?
The reason why data with high dimensions is considered so difficult to deal with is that it leads to high time consumption while processing the data and training a model on it. Reducing dimensions speeds up this process, removes noise, and also leads to better model accuracy.
Define and explain selection bias?
The selection bias occurs in the case when the researcher has to make a decision on which participant to study. The selection bias is associated with those researches when the participant selection is not random. Sampling Bias: As a result of a population that is not random at all, some members of a population have fewer chances of getting included than others, resulting in a biased sample. This causes a systematic error known as sampling bias. Time interval: Trials may be stopped early if we reach any extreme value but if all variables are similar invariance, the variables with the highest variance have a higher chance of achieving the extreme value. Data: It is when specific data is selected arbitrarily and the generally agreed criteria are not followed. Attrition: Attrition in this context means the loss of the participants. It is the discounting of those subjects that did not complete the trial.
What do you understand by Survivorship Bias?
This bias refers to the logical error while focusing on aspects that survived some process and overlooking those that did not work due to lack of prominence. This bias can lead to deriving wrong conclusions.
How Can Time-Series Data Be Declared As Stationery?
Time series data being declared stationary implies that the data being collected does not change over time. This may be because there are no time-based or seasonal trends in the data.
Precision-recall trade-off
Tradeoff means increasing one parameter would lead to decreasing of other. Precision-recall tradeoff occur due to increasing one of the parameter(precision or recall) while keeping the model same. In an ideal scenario where there is a perfectly separable data, both precision and recall can get maximum value of 1.0. But in most of the practical situations, there is noise in the dataset and the dataset is not perfectly separable. There might be some points of positive class closer to the negative class and vice versa. In such cases, shifting the decision boundary can either increase the precision or recall but not both. Increasing one parameter leads to decreasing of the other.
What Is Variance in Data Science?
Variance is the distance between a value in a dataset and the mean value.
What is precision?
When we are implementing algorithms for the classification of data or the retrieval of information, precision helps us get a portion of positive class values that are positively predicted. Basically, it measures the accuracy of correct positive predictions. Precision = True positives/true positives +false positives
How do we check if a variable follows the normal distribution?
1. Plot a histogram out of the sampled data. If you can fit the bell-shaped "normal" curve to the histogram, then the hypothesis that the underlying random variable follows the normal distribution can not be rejected. 2. Check Skewness and Kurtosis of the sampled data. Skewness = 0 and kurtosis = 3 are typical for a normal distribution, so the farther away they are from these values, the more non-normal the distribution. 3. Use Kolmogorov-Smirnov or/and Shapiro-Wilk tests for normality. They take into account both Skewness and Kurtosis simultaneously. 4. Check for Quantile-Quantile plot. It is a scatterplot created by plotting two sets of quantiles against one another. Normal Q-Q plot place the data points in a roughly straight line.
Can you explain the difference between a Validation Set and a Test Set?
A Validation set can be considered as a part of the training set as it is used for parameter selection and to avoid overfitting of the model being built. On the other hand, a Test Set is used for testing or evaluating the performance of a trained machine learning model. In simple terms, the differences can be summarized as; training set is to fit the parameters i.e. weights and test set is to assess the performance of the model i.e. evaluating the predictive power and generalization.
What's the normal distribution? Why do we care about it?
A normal distribution is a probability distribution where the values are symmetric on either side of the mean of the data. This implies that values closer to the mean are more common than values that are further away from it. The normal distribution derives its importance from the Central Limit Theorem, which states that if we draw a large enough number of samples, their mean will follow a normal distribution regardless of the initial distribution of the sample. It is important that each sample is independent from the other. This is powerful because it helps us study processes whose population distribution is unknown to us.
What is the PR (precision-recall) curve?
A precision-recall curve (or PR Curve) is a plot of the precision (y-axis) and the recall (x-axis) for different probability thresholds. Precision-recall curves (PR curves) are recommended for highly skewed domains where ROC curves may provide an excessively optimistic view of the performance.
What is sigmoid? What does it do?
A sigmoid function is a type of activation function, and more specifically defined as a squashing function. Squashing functions limit the output to a range between 0 and 1, making these functions useful in the prediction of probabilities.
What is the confusion table? What are the cells in this table?
A table that is used to estimate the performance of a model True Positives (TP): When the actual class of the observation is 1 (True) and the prediction is 1 (True) True Negative (TN): When the actual class of the observation is 0 (False) and the prediction is 0 (False) False Positive (FP): When the actual class of the observation is 0 (False) and the prediction is 1 (True) False Negative (FN): When the actual class of the observation is 1 (True) and the prediction is 0 (False)
What is a time series?
A time series is a set of observations ordered in time usually collected at regular intervals.
Why Is A/B Testing Conducted?
A/B testing gives businesses the ability to peek into customers' minds and get an idea of their preferences. You can quantify the amount of interest that different offerings garner in test groups, which lets you go to market with the final product with confidence.
What kind of regularization techniques are applicable to linear models?
AIC/BIC, Ridge regression, Lasso, Elastic Net, Basis pursuit denoising, Rudin-Osher-Fatemi model (TV), Potts model, RLAD, Dantzig Selector,SLOPE
How to interpret the AU ROC score?
AUC score is the value of Area Under the ROC Curve. If we assume ROC curve consists of dots, then an excellent model has AUC near to the 1 which means it has good measure of separability. A poor model has AUC near to the 0 which means it has worst measure of separability. When AUC score is 0.5, it means model has no class separation capacity whatsoever.
What is AUC (AU ROC)? When to use it?
AUC stands for Area Under the ROC Curve. ROC is a probability curve and AUC represents degree or measure of separability. It's used when we need to value how much model is capable of distinguishing between classes. The value is between 0 and 1, the higher the better.
Is accuracy always a good metric?
Accuracy is not a good performance metric when there is imbalance in the dataset. For example, in binary classification with 95% of A class and 5% of B class, a constant prediction of A class would have an accuracy of 95%. In case of imbalance dataset, we need to choose Precision, recall, or F1 Score depending on the problem we are trying to solve.
What is the difference between an error and a residual error?
An error occurs in values while the prediction gives us the difference between the observed values and the true values of a dataset. Whereas, the residual error is the difference between the observed values and the predicted values. The reason we use the residual error to evaluate the performance of an algorithm is that the true values are never known. Hence, we use the observed values to measure the error using residuals. It helps us get an accurate estimate of the error.
What Is a Multi-layer Perceptron(MLP)?
As in Neural Networks, MLPs have an input layer, a hidden layer, and an output layer. It has the same structure as a single layer perceptron with one or more hidden layers. A single layer perceptron can classify only linear separable classes with binary output (0,1), but MLP can classify nonlinear classes.
Can you cite some examples where a false negative important than a false positive?
Assume there is an airport 'A' which has received high-security threats and based on certain characteristics they identify whether a particular passenger can be a threat or not. Due to a shortage of staff, they decide to scan passengers being predicted as risk positives by their predictive model. What will happen if a true threat customer is being flagged as non-threat by airport model?
Explain bagging in Data Science.
Bagging is an ensemble learning method. It stands for bootstrap aggregating. In this technique, we generate some data using the bootstrap method, in which we use an already existing dataset and generate multiple samples of the N size. This bootstrapped data is then used to train multiple models in parallel, which makes the bagging model more robust than a simple model. Once all the models are trained, when it's time to make a prediction, we make predictions using all the trained models and then average the result in the case of regression, and for classification, we choose the result, generated by models, that have the highest frequency. used to reduce the amount of variance in a noisy dataset
What is the bias-variance trade-off?
Bias is the error introduced by approximating the true underlying function, which can be quite complex, by a simpler model. Variance is a model sensitivity to changes in the training dataset. Bias-variance trade-off is a relationship between the expected test error and the variance and the bias - both contribute to the level of the test error and ideally should be as small as possible. As a model complexity increases, the bias decreases and the variance increases which leads to overfitting. And vice versa, model simplification helps to decrease the variance but it increases the bias which leads to underfitting.
Explain boosting in Data Science.
Boosting is one of the ensemble learning methods. Unlike bagging, it is not a technique used to parallell train our models. In boosting, we create multiple models and sequentially train them by combining weak models iteratively in a way that training a new model depends on the models trained before it. In doing so, we take the patterns learned by a previous model and test them on a dataset when training the new model. In each iteration, we give more importance to observations in the dataset that are incorrectly handled or predicted by previous models. Boosting is useful in reducing bias in models as well. Used to strengthen a weak learning model.
What do we do with categorical variables?
Categorical variables must be encoded before they can be used as features to train a machine learning model. There are various encoding techniques, including: One-hot encoding Label encoding Ordinal encoding Target encoding
Can we use Chi square with Numerical dataset? If yes, give example. If no, give Reason?
Chi square is mostly used on categorical features, it can be used on numerical features but since the chi square test is based on frequencies the numerical data would need to be split into various categories like bins. Since chi square test is based on frequencies.
Where we can use chi square and have used this test anywhere in your application?
Chi square test is a statistical test which we used to compare/understand the relationship between our categorical features. In M.L we use this test as a feature selection technique
What is classification? Which models would you use to solve a classification problem?
Classification problems are problems in which our prediction space is discrete, i.e. there is a finite number of values the output variable can be. Some models which can be used to solve classification problems are: logistic regression, decision tree, random forests, multi-layer perceptron, one-vs-all, amongst others.
What is Correlation?
Correlation is considered or described as the best technique for measuring and also for estimating the quantitative relationship between two variables. Correlation measures how strongly two variables are related.
Explain cross-validation.
Cross-validation is a model validation technique for evaluating how the outcomes of statistical analysis will generalize to an independent dataset. Mainly used in backgrounds where the objective is forecast and one wants to estimate how accurately a model will accomplish in practice. The goal of cross-validation is to term a data set to test the model in the training phase (i.e. validation data set) in order to limit problems like overfitting and get an insight on how the model will generalize to an independent data set.
What Is Cross-Validation?
Cross-validation is a technique that demonstrates the efficacy of a machine learning model when analyzing unseen data. It tells us whether the model performs as well on test data as it did on training data.
Can you explain how cross-validation works?
Cross-validation is the process to separate your total training set into two subsets: training and validation set, and evaluate your model to choose the hyperparameters. But you do this process iteratively, selecting differents training and validation set, in order to reduce the bias that you would have by selecting only one validation set.
What are some of the techniques used for sampling? What is the main advantage of sampling?
Data analysis can not be done on a whole volume of data at a time especially when it involves larger datasets. It becomes crucial to take some data samples that can be used for representing the whole population and then perform analysis on it. While doing this, it is very much necessary to carefully take sample data out of the huge data that truly represents the entire dataset Probability Sampling techniques: Clustered sampling, Simple random sampling, Stratified sampling. Non-Probability Sampling techniques: Quota sampling, Convenience sampling, snowball sampling, etc.
What is the curse of dimensionality? Why do we care about it?
Data in only one dimension is relatively tightly packed. Adding a dimension stretches the points across that dimension, pushing them further apart. Additional dimensions spread the data even further making high dimensional data extremely sparse. We care about it, because it is difficult to use machine learning in sparse spaces.
What if we want to build a model for predicting prices? Are prices distributed normally? Do we need to do any pre-processing for prices?
Data is not normal. Specially, real-world datasets or uncleaned datasets always have certain skewness. Same goes for the price prediction. Price of houses or any other thing under consideration depends on a number of factors. So, there's a great chance of presence of some skewed values i.e outliers if we talk in data science terms.
What is the difference between data analytics and data science?
Data science involves the task of transforming data by using various technical analysis methods to extract meaningful insights using which a data analyst can apply to their business scenarios. Data analytics deals with checking the existing hypothesis and information and answers questions for a better and effective business-related decision-making process.
What's Deep Learning?
Deep learning is a subset of machine learning concerned with supervised, unsupervised, and semi-supervised learning based on artificial neural networks.
How do we evaluate classification models?
Depending on the classification problem, we can use the following evaluation metrics: Accuracy Precision Recall F1 Score Logistic loss (also known as Cross-entropy loss) Jaccard similarity coefficient score
How do you approach tuning parameters in XGBoost or LightGBM?
Depending upon the dataset, parameter tuning can be done manually or using hyperparameter optimization frameworks such as optuna and hyperopt. In manual parameter tuning, we need to be aware of max-depth, min_samples_leaf and min_samples_split so that our model does not overfit the data but try to predict generalized characteristics of data (basically keeping variance and bias low for our model).
What is dimensionality reduction?
Dimensionality reduction is the process of converting a dataset with a high number of dimensions (fields) to a dataset with a lower number of dimensions. This is done by dropping some fields or columns from the dataset. However, this is not done haphazardly. In this process, the dimensions or fields are dropped only after making sure that the remaining information will still be enough to succinctly describe similar information.
What is the benefit of dimensionality reduction?
Dimensionality reduction reduces the dimensions and size of the entire dataset. It drops unnecessary features while retaining the overall information in the data intact. Reduction in dimensions leads to faster processing of the data.
How to select K for K-means?
Domain knowledge, i.e. an expert knows the value of k Elbow method: compute the clusters for different values of k, for each k, calculate the total within-cluster sum of square, plot the sum according to the number of clusters and use the band as the number of clusters. Average silhouette method: compute the clusters for different values of k, for each k, calculate the average silhouette of observations, plot the silhouette according to the number of clusters and select the maximum as the number of clusters.
What Is the Difference Between Epoch, Batch, and Iteration in Deep Learning?
Epoch - Represents one iteration over the entire dataset (everything put into the training model). Batch - Refers to when we cannot pass the entire dataset into the neural network at once, so we divide the dataset into several batches. Iteration - if we have 10,000 images as data and a batch size of 200. then an epoch should run 50 iterations (10,000 divided by 50).
What is the F1 score and how to calculate it?
F1 score helps us calculate the harmonic mean of precision and recall that gives us the test's accuracy. If F1 = 1, then precision and recall are accurate. If F1 < 1 or equal to 0, then precision or recall is less accurate, or they are completely inaccurate.
What is feature selection? Why do we need it?
Feature Selection is a method used to select the relevant features for the model to train on. We need feature selection to remove the irrelevant features which leads the model to under-perform.
What is gradient boosting trees?
Gradient boosting is a machine learning technique for regression and classification problems, which produces a prediction model in the form of an ensemble of weak prediction models, typically decision trees.
What is gradient descent? How does it work?
Gradient descent is an algorithm that uses calculus concept of gradient to try and reach local or global minima. It takes the negative of the gradient in a point of a given function, and updates that point repeatedly using the calculated negative gradient, until the algorithm reaches a local or global minimum, which will cause future iterations of the algorithm to return values that are equal or too close to the current point.
Which hyper-parameter tuning strategies (in general) do you know
Grid Search is an exhaustive approach such that for each hyper-parameter, the user needs to manually give a list of values for the algorithm to try. After these values are selected, grid search then evaluates the algorithm using each and every combination of hyper-parameters and returns the combination that gives the optimal result (i.e. lowest MAE). Because grid search evaluates the given algorithm using all combinations, it's easy to see that this can be quite computationally expensive and can lead to sub-optimal results specifically since the user needs to specify specific values for these hyper-parameters, which is prone for error and requires domain knowledge.
How is the grid search parameter different from the random search tuning strategy?
Grid Search: Here, every combination of a preset list of hyperparameters is tried out and evaluated. The search pattern is similar to searching in a grid where the values are in a matrix and a search is performed. Each parameter set is tried out and their accuracy is tracked. after every combination is tried out, the model with the highest accuracy is chosen as the best one. The main drawback here is that, if the number of hyperparameters is increased, the technique suffers. The number of evaluations can increase exponentially with each increase in the hyperparameter. This is called the problem of dimensionality in a grid search. Random Search: In this technique, random combinations of hyperparameters set are tried and evaluated for finding the best solution. For optimizing the search, the function is tested at random configurations in parameter space as shown in the image below. In this method, there are increased chances of finding optimal parameters because the pattern followed is random. There are chances that the model is trained on optimized parameters without the need for aliasing. This search works the best when there is a lower number of dimensions as it takes less time to find the right set.
Is it good to do dimensionality reduction before fitting a Support Vector Model?
If the features number is greater than observations then doing dimensionality reduction improves the SVM (Support Vector Model)
Why do we need one-hot encoding?
If we simply encode categorical variables with a Label encoder, they become ordinal which can lead to undesirable consequences. In this case, linear models will treat category with id 4 as twice better than a category with id 2. One-hot encoding allows us to represent a categorical variable in a numerical vector space which ensures that vectors of each category have equal distances between each other. This approach is not suited for all situations, because by using it with categorical variables of high cardinality (e.g. customer id) we will encounter problems that come into play because of the curse of dimensionality.
What is Covariance?
In covariance two items vary together and it's a measure that indicates the extent to which two random variables change in cycle. It is a statistical term; it explains the systematic relation between a pair of random variables, wherein changes in one variable reciprocal by a corresponding change in another variable.
What is k-fold cross-validation?
In k-fold cross-validation, we divide the dataset into k equal parts. After this, we loop over the entire dataset k times. In each iteration of the loop, one of the k parts is used for testing, and the other k − 1 parts are used for training. Using k-fold cross-validation, each one of the k parts of the dataset ends up being used for training and testing purposes.
Can you cite some examples where both false positive and false negatives are equally important?
In the Banking industry giving loans is the primary source of making money but at the same time if your repayment rate is not good you will not make any profit, rather you will risk huge losses. Banks don't want to lose good customers and at the same point in time, they don't want to acquire bad customers. In this scenario, both the false positives and false negatives become very important to measure.
Can you cite some examples where a false positive is important than a false negative?
In the medical field, assume you have to give chemotherapy to patients. Assume a patient comes to that hospital and he is tested positive for cancer, based on the lab prediction but he actually doesn't have cancer. This is a case of false positive. Here it is of utmost danger to start chemotherapy on this patient when he actually does not have cancer. In the absence of cancerous cell, chemotherapy will do certain damage to his normal healthy cells and might lead to severe diseases, even cancer.
Define fsck.
It is an abbreviation for "file system check." This command can be used for searching for possible errors in the file.
What is a recall?
It is the set of all positive predictions out of the total number of positive instances. Recall helps us identify the misclassified positive predictions. We use the below formula to calculate recall: Recall = true positives/true postives +false negatives
What is K-fold cross-validation?
K fold cross validation is a method of cross validation where we select a hyperparameter k. The dataset is now divided into k parts. Now, we take the 1st part as validation set and remaining k-1 as training set. Then we take the 2nd part as validation set and remaining k-1 parts as training set. Like this, each part is used as validation set once and the remaining k-1 parts are taken together and used as training set. It should not be used in a time series data.
K-Means Clustering vs Linear Regression vs K-NN (K-Nearest Neighbor) vs Decision Trees: Which Machine Learning Algorithms Can Be Used for Inputting Missing Values of Both Categorical and Continuous Variables?
K-NN algorithms work best when it comes to inputting values in categorical and continuous data.
Define the terms KPI, lift, model fitting, robustness and DOE.
KPI: KPI stands for Key Performance Indicator that measures how well the business achieves its objectives. Lift: This is a performance measure of the target model measured against a random choice model. Lift indicates how good the model is at prediction versus if there was no model. Model fitting: This indicates how well the model under consideration fits given observations. Robustness: This represents the system's capability to handle differences and variances effectively. DOE: stands for the design of experiments, which represents the task design aiming to describe and explain information variation under hypothesized conditions to reflect variables.
What do you understand by a kernel trick?
Kernel functions are generalized dot product functions used for the computing dot product of vectors xx and yy in high dimensional feature space. Kernal trick method is used for solving a non-linear problem by using a linear classifier by transforming linearly inseparable data into separable ones in higher dimensions.
How does L2 regularization look like in a linear model?
L2 regularization adds a penalty term to our cost function which is equal to the sum of squares of models coefficients multiplied by a lambda hyperparameter. This technique makes sure that the coefficients are close to zero and is widely used in cases when we have a lot of features that might correlate with each other.
What is linear regression?
Linear regression is a model that assumes a linear relationship between the input variables (X) and the single output variable (y). Simple: y = B0 + B1*x1 Multiple: y = B0 + B1*x1 + ... + Bn * xN
What is logistic regression? When do we need to use it?
Logistic regression is a Machine Learning algorithm that is used for binary classification. You should use logistic regression when your Y variable takes only two values, e.g. True and False, "spam" and "not spam", "churn" and "not churn" and so on. The variable is said to be a "binary" or "dichotomous".
Differentiate between the long and wide format data.
Long format Data: Here, each row of the data represents the one-time information of a subject. Each subject would have its data in different/ multiple rows.The data can be recognized by considering rows as groups. Wide format data: Here, the repeated responses of a subject are part of separate columns. The data can be recognized by considering columns as groups.
How do you select the number of trees in the gradient boosting model?
Most implementations of gradient boosting are configured by default with a relatively small number of trees, such as hundreds or thousands. Using scikit-learn we can perform a grid search of the n_estimators model parameter
Explain Naive Bayes
Naive Bayes is a classification algorithm that works on the assumption that every feature under consideration is independent. It is called naive because of that very same assumption, which is often unrealistic for data in the real world.
What kind of problems neural nets can solve?
Neural nets are good at solving non-linear problems. Some good examples are problems that are relatively easy for humans (because of experience, intuition, understanding, etc), but difficult for traditional regression models: speech recognition, handwriting recognition, image identification, etc.
Can we use L2 regularization for feature selection?
No, Because L2 regularization doesnot make the weights zero but only makes them very very small. L2 regularization can be used to solve multicollinearity since it stablizes the model.
What is the normal equation?
Normal equations are equations obtained by setting equal to zero the partial derivatives of the sum of squared errors (least squares); normal equations allow one to estimate the parameters of a multiple linear regression.
How can we know which features are more important for the decision tree model?
Often, we want to find a split such that it minimizes the sum of the node impurities. The impurity criterion is a parameter of decision trees. Popular methods to measure the impurity are the Gini impurity and the entropy describing the information gain.
How to validate your models?
One of the most common approaches is splitting data into train, validation and test parts. Models are trained on train data, hyperparameters (for example early stopping) are selected based on the validation data, the final measurement is done on test dataset. Another approach is cross-validation: split dataset into K folds and each time train models on training folds and measure the performance on the validation folds. Also you could combine these approaches: make a test/holdout dataset and do cross-validation on the rest of the data. The final quality is measured on test dataset.
Drawbacks of Linear Regression
Only the mean of the dependent variable is taken into consideration. It assumes that the data is independent. The method is sensitive to outlier data values.
List down the conditions for Overfitting and Underfitting.
Overfitting: The model performs well only for the sample training data. If any new data is given as input to the model, it fails to provide any result. These conditions occur due to low bias and high variance in the model. Decision trees are more prone to overfitting. Underfitting: Here, the model is so simple that it is not able to identify the correct relationship in the data, and hence it does not perform well even on the test data. This can happen due to high bias and low variance. Linear regression is more prone to Underfitting.
What is a p-value?
P-value is the measure of the statistical importance of an observation. It is the probability that shows the significance of output to the data. We compute the p-value to know the test statistics of a model. Typically, it helps us choose whether we can accept or reject the null hypothesis.
Do you know how K-means works?
Partition points into k subsets. Compute the seed points as the new centroids of the clusters of the current partitioning. Assign each point to the cluster with the nearest seed point. Go back to step 2 or stop when the assignment does not change.
What's the difference between L2 and L1 regularization?
Penalty terms: L1 regularization uses the sum of the absolute values of the weights, while L2 regularization uses the sum of the weights squared. Feature selection: L1 performs feature selection by reducing the coefficients of some predictors to 0, while L2 does not. Computational efficiency: L2 has an analytical solution, while L1 does not. Multicollinearity: L2 addresses multicollinearity by constraining the coefficient norm.
What are precision, recall, and F1-score?
Precision and recall are classification evaluation metrics: P = TP / (TP + FP) and R = TP / (TP + FN). Where TP is true positives, FP is false positives and FN is false negatives In both cases the score of 1 is the best: we get no false positives or false negatives and only true positives. F1 is a combination of both precision and recall in one score (harmonic mean): F1 = 2 * PR / (P + R). Max F score is 1 and min is 0, with 1 being the best.
Do you know any dimensionality reduction techniques?
Principal Component Analysis (PCA) Linear Discriminant Analysis (LDA) T-distributed Stochastic Neighbor Embedding (t-SNE)
Why do we clean data?
Properly cleaned data increases the accuracy of the model and provides very good predictions. If the dataset is very large, then it becomes cumbersome to run data on it. The data cleanup step takes a lot of time (around 80% of the time) if the data is huge. It cannot be incorporated with running the model. Hence, cleaning data before running the model, results in increased speed and efficiency of the model. Data cleaning helps to identify and fix any structural issues in the data. It also helps in removing any duplicates and helps to maintain the consistency of the data.
What is RMSE?
RMSE stands for the root mean square error. It is a measure of accuracy in regression. RMSE allows us to calculate the magnitude of error produced by a regression model. The way RMSE is calculated is as follows: First, we calculate the errors in the predictions made by the regression model. For this, we calculate the differences between the actual and the predicted values. Then, we square the errors. After this step, we calculate the mean of the squared errors, and finally, we take the square root of the mean of these squared errors. This number is the RMSE, and a model with a lower value of RMSE is considered to produce lower errors, i.e., the model will be more accurate.
What is random forest?
Random Forest is a machine learning method for regression and classification which is composed of many decision trees. Random Forest belongs to a larger class of ML algorithms called ensemble methods (in other words, it involves the combination of several models to solve a single prediction problem).
What are the problems with using trees for solving time series problems?
Random Forest models are not able to extrapolate time series data and understand increasing/decreasing trends. It will provide us with average data points if the validation data has values greater than the training data points.
What's the difference between random forest and gradient boosting?
Random Forests builds each tree independently while Gradient Boosting builds one tree at a time. Random Forests combine results at the end of the process (by averaging or "majority rules") while Gradient Boosting combines results along the way.
Why do we need randomization in random forest?
Random forest in an extention of the bagging algorithm which takes random data samples from the training dataset (with replacement), trains several models and averages predictions. In addition to that, each time a split in a tree is considered, random forest takes a random sample of m features from full set of n features (with replacement) and uses this subset of features as candidates for the split (for example, m = sqrt(n)). Training decision trees on random data samples from the training dataset reduces variance. Sampling features for each split in a decision tree decorrelates trees.
What is better - random forest or multiple decision trees?
Random forest is better than multiple decision trees as random forests are much more robust, accurate, and lesser prone to overfitting as it is an ensemble method that ensures multiple weak decision trees learn strongly.
What is regression? Which models can you use to solve a regression problem? (Easy)
Regression is a part of supervised ML. Regression models investigate the relationship between a dependent (target) and independent variable (s) (predictor)
What is regularization? Why do we need it?
Regularization is used to reduce overfitting in machine learning models. It helps the models to generalize well and make them robust to outliers and noise in the data.
How do we select the right regularization parameters?
Regularization parameters can be chosen using a grid search, for example scikit-learn has one formula for the implementing for regularization, alpha in the formula mentioned can be found by doing a RandomSearch or a GridSearch on a set of values and selecting the alpha which gives the least cross validation or validation error.
When is resampling done?
Resampling is a methodology used to sample data for improving accuracy and quantify the uncertainty of population parameters. It is done to ensure the model is good enough by training the model on different patterns of a dataset to ensure variations are handled. It is also done in the cases where models need to be validated using random subsets or when substituting labels on data points while performing tests.
How can we select an appropriate value of k in k-means clustering?
Selecting the correct value of k is an important aspect of k-means clustering. We can make use of the elbow method to pick the appropriate k value. To do this, we run the k-means algorithm on a range of values, e.g., 1 to 15. For each value of k, we compute an average score. This score is also called inertia or the inter-cluster variance. This is calculated as the sum of squares of the distances of all values in a cluster. As k starts from a low value and goes up to a high value, we start seeing a sharp decrease in the inertia value. After a certain value of k, in the range, the drop in the inertia value becomes quite small. This is the value of k that we need to choose for the k-means clustering algorithm.
Which models do you know for solving time series problems?
Simple Exponential Smoothing: approximate the time series with an exponentional function Trend-Corrected Exponential Smoothing (Holt's Method): exponential smoothing that also models the trend Trend- and Seasonality-Corrected Exponential Smoothing (Holt-Winter's Method): exponential smoothing that also models trend and seasonality Time Series Decomposition: decomposed a time series into the four components trend, seasonal variation, cycling varation and irregular component Autoregressive models: similar to multiple linear regression, except that the dependent variable y_t depends on its own previous values rather than other independent variables. Deep learning approaches (RNN, LSTM, etc.)
How do we train decision trees
Start at the root node. For each variable X, find the set S_1 that minimizes the sum of the node impurities in the two child nodes and choose the split {X*,S*} that gives the minimum over all X and S. If a stopping criterion is reached, exit. Otherwise, apply step 2 to each child node in turn.
How Would You Approach a Dataset That's Missing More Than 30 Percent of Its Values?
The approach will depend on the size of the dataset. If it is a large dataset, then the quickest method would be to simply remove the rows containing the missing values. Since the dataset is large, this won't affect the ability of the model to produce results. If the dataset is small, then it is not practical to simply eliminate the values. In that case, it is better to calculate the mean or mode of that particular feature and input that value where there are missing entries. Another approach would be to use a machine learning algorithm to predict the missing values. This can yield accurate results unless there are entries with a very high variance from the rest of the dataset.
How do we select the depth of the trees in random forest?
The greater the depth, the greater amount of information is extracted from the tree, however, there is a limit to this, and the algorithm even if defensive against overfitting may learn complex features of noise present in data and as a result, may overfit on noise. Hence, there is no hard thumb rule in deciding the depth, but literature suggests a few tips on tuning the depth of the tree to prevent overfitting: limit the maximum depth of a tree limit the number of test nodes limit the minimum number of objects at a node required to split do not split a node when, at least, one of the resulting subsample sizes is below a given threshold stop developing a node if it does not sufficiently improve the fit.
Why do we need to split our data into three parts: train, validation, and test?
The training set is used to fit the model, i.e. to train the model with the data. The validation set is then used to provide an unbiased evaluation of a model while fine-tuning hyperparameters. This improves the generalization of the model. Finally, a test data set which the model has never "seen" before should be used for the final evaluation of the model. This allows for an unbiased evaluation of the model. The evaluation should never be performed on the same data that is used for training. Otherwise the model performance would not be representative.
What are the assumptions of linear regression?
There are several assumptions of linear regression. If any of them is violated, model predictions and misleading. 1. Linear relationship between features and target variable. 2. Additivity means that the effect of changes in one of the features on the target variable does not depend on values of other features. 3. Features are not correlated (no collinearity) since it can be difficult to separate out the individual effects of collinear features on the target variable. 4. Errors are independently and identically normally distributed (yi = B0 + B1*x1i + ... + errori): i. No correlation between errors (consecutive errors in the case of time series data). ii. Constant variance of errors - homoscedasticity. For example, in case of time series, seasonal patterns can increase errors in seasons with higher activity. iii. Errors are normally distributed, otherwise some features will have more influence on the target variable than to others. If the error distribution is significantly non-normal, confidence intervals may be too wide or too narrow.
How do we choose K in K-fold cross-validation? What's your favorite K?
There are two things to consider while deciding K: the number of models we get and the size of validation set. We do not want the number of models to be too less, like 2 or 3. At least 4 models give a less biased decision on the metrics. On the other hand, we would want the dataset to be at least 20-25% of the entire data. So that at least a ratio of 3:1 between training and validation set is maintained. I tend to use 4 for small datasets and 5 for large ones as K.
What are the decision trees?
This is a type of supervised learning algorithm that is mostly used for classification problems. It works for both categorical and continuous dependent variables. In this algorithm, we split the population into two or more homogeneous sets. This is done based on most significant attributes/ independent variables to make as distinct groups as possible. A decision tree is a flowchart-like tree structure, where each internal node (non-leaf node) denotes a test on an attribute, each branch represents an outcome of the test, and each leaf node (or terminal node) holds a value for the target variable. Various techniques : like Gini, Information Gain, Chi-square, entropy.
How to detect if the time series data is stationary?
Time series data is considered stationary when variance or mean is constant with time. If the variance or mean does not change over a period of time in the dataset, then we can draw the conclusion that, for that period, the data is stationary.
How can we handle missing data?
To be able to handle missing data, we first need to know the percentage of data missing in a particular column so that we can choose an appropriate strategy to handle the situation. For example, if in a column the majority of the data is missing, then dropping the column is the best option, unless we have some means to make educated guesses about the missing values. However, if the amount of missing data is low, then we have several strategies to fill them up. One way would be to fill them all up with a default value or a value that has the highest frequency in that column, such as 0 or 1, etc. This may be useful if the majority of the data in that column contains these values. Another way is to fill up the missing values in the column with the mean of all the values in that column. This technique is usually preferred as the missing values have a higher chance of being closer to the mean than to the mode. Finally, if we have a huge dataset and a few rows have values missing in some columns, then the easiest and fastest way is to drop those columns. Since the dataset is large, dropping a few columns should not be a problem anyway.
What Are the Steps Involved in Maintaining a Deployed Model?
Train the model using new data values. Choose additional or different features on which to retrain the data. In instances where the model begins to produce inaccurate results, develop a new model.
What is unsupervised learning?
Unsupervised learning aims to detect patterns in data where no labels are given.
What is variance in Data Science?
Variance is a type of error that occurs in a Data Science model when the model ends up being too complex and learns features from data, along with the noise that exists in it. This kind of error can occur if the algorithm used to train the model has high complexity, even though the data and the underlying patterns and trends are quite easy to discover. This makes the model a very sensitive one that performs well on the training dataset but poorly on the testing dataset, and on any kind of data that the model has not yet seen. Variance generally leads to poor accuracy in testing and results in overfitting.
If there's a trend in our series, how we can remove it? And why would we want to do it?
We can explicitly model the trend (and/or seasonality) with approaches such as Holt's Method or Holt-Winter's Method. We want to explicitly model the trend to reach the stationarity property for the data. Many time series approaches require stationarity. Without stationarity,the interpretation of the results of these analyses is problematic.
What happens to our linear regression model if we have three columns in our data: x, y, z — and z is a sum of x and y?
We would not be able to perform the regression. Because z is linearly dependent on x and y so when performing the regression would be a singular (not invertible) matrix
In which cases AU PR is better than AU ROC?
What is different however is that AU ROC looks at a true positive rate TPR and false positive rate FPR while AU PR looks at positive predictive value PPV and true positive rate TPR. Typically, if true negatives are not meaningful to the problem or you care more about the positive class, AU PR is typically going to be more useful; otherwise, If you care equally about the positive and negative class or your dataset is quite balanced, then going with AU ROC is a good idea.
What information is gained in a decision tree algorithm?
When building a decision tree, at each step, we have to create a node that decides which feature we should use to split data, i.e., which feature would best separate our data so that we can make predictions. This decision is made using information gain, which is a measure of how much entropy is reduced when a particular feature is used to split the data. The feature that gives the highest information gain is the one that is chosen to split the data.
What Will Happen If the Learning Rate Is Set inaccurately (Too Low or Too High)?
When your learning rate is too low, training of the model will progress very slowly as we are making minimal updates to the weights. It will take many updates before reaching the minimum point. If the learning rate is set too high, this causes undesirable divergent behaviour to the loss function due to drastic updates in weights. It may fail to converge (model can give a good output) or even diverge (data is too chaotic for the network to train).
What is overfitting?
When your model perform very well on your training set but can't generalize the test set, because it adjusted a lot to the training set.
How to deal with unbalanced binary classification?
While doing binary classification, if the data set is imbalanced, the accuracy of the model can't be predicted correctly using only the R2 score.
f a weight for one variable is higher than for another — can we say that this variable is more important?
Yes - if your predictor variables are normalized. Without normalization, the weight represents the change in the output per unit change in the predictor. If you have a predictor with a huge range and scale that is used to predict an output with a very small range - for example, using each nation's GDP to predict maternal mortality rates - your coefficient should be very small. That does not necessarily mean that this predictor variable is not important compared to the others.
Is feature selection important for linear models?
Yes, It is. It can make model performance better through selecting the most importance features and remove irrelanvant features in order to make a prediction and it can also avoid overfitting, underfitting and bias-variance tradeoff.
Is logistic regression a linear model? Why?
Yes, Logistic Regression is considered a generalized linear model because the outcome always depends on the sum of the inputs and parameters. Or in other words, the output cannot depend on the product (or quotient, etc.) of its parameters.
Can we use L1 regularization for feature selection?
Yes, because the nature of L1 regularization will lead to sparse coefficients of features. Feature selection can be done by keeping only features with non-zero coefficients.
Can You Avoid Overfitting Your Model? If Yes, Then How?
Yes, it is possible to overfit data models. The following techniques can be used for that purpose. Bring more data into the dataset being studied so that it becomes easier to parse the relationships between input and output variables. Use feature selection to identify key features or parameters to be studied. Employ regularization techniques, which reduce the amount of variance in the results that a data model produces. In rare cases, some noisy data is added to datasets to make them more stable. This is known as data augmentation.
Give me a scenario where you can use Z test and T test.
a)We use Z test when population standard deviation is given and sample size is greater than equal to 30. b) We use T test when population standard deviation is not given and sample size is less than 30. Here we use sample standard deviation. Both Tests are used for comparison of mean.
What are the benefits of a single decision tree compared to more complex models?
easy to implement fast training fast inference good explainability
What are the other clustering algorithms do you know?
k-medoids: Takes the most central point instead of the mean value as the center of the cluster. This makes it more robust to noise. Agglomerative Hierarchical Clustering (AHC): hierarchical clusters combining the nearest clusters starting with each point as its own cluster. DIvisive ANAlysis Clustering (DIANA): hierarchical clustering starting with one cluster containing all points and splitting the clusters until each point describes its own cluster. Density-Based Spatial Clustering of Applications with Noise (DBSCAN): Cluster defined as maximum set of density-connected points.
What are the main parameters of the random forest model?
max_depth: Longest Path between root node and the leaf min_sample_split: The minimum number of observations needed to split a given node max_leaf_nodes: Conditions the splitting of the tree and hence, limits the growth of the trees min_samples_leaf: minimum number of samples in the leaf node n_estimators: Number of trees max_sample: Fraction of original dataset given to any individual tree in the given model max_features: Limits the maximum number of features provided to trees in random forest model
What are the main parameters of the decision tree model?
maximum tree depth minimum samples per leaf node impurity criterion
What method best depicts hierarchical data in nested format?
treemaps