Data Science Interview
P-value interpretation and significance
A p-value measures the probability of obtaining the observed results, assuming that the null hypothesis (parameter estimator is zero, or the independant variable has no effect on target) is true. The lower the p-value, the greater the statistical significance of the observed difference. you can reject the null hypothesis for the entire population. p-value typically ≤ 0.05 This indicates strong evidence against the null hypothesis; so you reject the null hypothesis. p-value typically > 0.05 This indicates weak evidence against the null hypothesis, so you accept the null hypothesis. p-value at cutoff 0.05 This is considered to be marginal, meaning it could go either way.
What is the difference between Regression and classification ML techniques?
Both Regression and classification machine learning techniques come under Supervised machine learning algorithms. In Supervised machine learning algorithm, we have to train the model using labelled data set, While training we have to explicitly provide the correct labels (target variable) and algorithm tries to learn the pattern from input to output. If our labels are discrete values then it will a classification problem, e.g A,B etc. but if our labels are continuous values then it will be a regression problem, e.g 1.23, 1.333 etc.
Cook's distance test for outlier
Cook's distance attempts to identify the points which have more influence than other points. Such influential points tends to have a sizable impact of the regression line. In other words, adding or removing such points from the model can completely change the model statistics. But, can these influential observations be treated as outliers? This question can only be answered after looking at the data. Therefore, in this plot, the large values marked by cook's distance might require further investigation. Solution: For influential observations which are nothing but outliers, if not many, you can remove those rows. Alternatively, you can scale down the outlier observation with maximum value in data or else treat those values as missing values.
What is correlation and covariance in statistics?
Covariance and Correlation are two mathematical concepts; these two approaches are widely used in statistics. Both Correlation and Covariance establish the relationship and also measure the dependency between two random variables. Though the work is similar between these two in mathematical terms, they are different from each other.
What is the difference between a box plot and a histogram?
Histograms and box plots are graphical representations for the frequency of numeric data values. They aim to describe the data and explore the central tendency and variability before using advanced statistical analysis techniques. histograms are better in determining the underlying distribution of the data, box plots allow you to compare multiple data sets better than histograms as they are less detailed and take up less space.
Write the equation and calculate the precision and recall rate.
Precision: (positive predictive value) Precision is used to calculate the model's ability to classify positive values correctly. Precision = (True positive) / (True Positive + False Positive) = 262 / 277 = 0.94 Negative Predictive value = TN/ (FN+TN) =347/(347+26) Recall (sensitivity) It is used to calculate the model's ability to predict positive values. "How often does the model predict the correct positive values?". It is the true positives divided by the total number of actual positive values. Recall Rate (sensitivity) = (True Positive) / (True Positive + False Negative) = 262 / 288 = 0.90 F1-Score: It is the harmonic mean of Recall and Precision. It is useful when you need to take both Precision and Recall into account. F1-Score: = 2*Precision*Recal/(Precision+Recall)
How is logistic regression done?
Like linear model, logistical model is used to describe the relationship between dependent variable (want to predict) and independent variables (features). Unlike linear model, the dependent variable has value of 0 or 1 (probability of event) . and the log odds ( ln(Prob/(1-Prob)) has a linear relationship with the features. So the signoid function translate the log odds to probability of event.
How do you find RMSE and MSE in a linear regression model?
MSE indicates the Mean Square Error. It represents the difference between the original and predicted values which are extracted by squaring the average difference over the data set. It is a measure of how close a fitted line is to actual data points. The lesser the Mean Squared Error, the closer the fit is to the data set. The MSE has the units squared of whatever is plotted on the vertical axis. RMSE indicates the Root Mean Square Error. It simply is the square root of MSE. The lower the RMSE, the better a model fits a dataset. MSE = Σ(ŷi - yi)2 / n MSE = ((14-12)2+(15-15)2+(18-20)2+(19-16)2+(25-20)2+(18-19)2+(12-16)2+(12-20)2+(15-16)2+(22-16)2) / 10 MSE = 16 The mean squared error is 16. This tells us that the average squared difference between the predicted values made by the model and the actual values is 16. The root mean squared error (RMSE) would simply be the square root of the MSE: RMSE = √MSE RMSE = √16 RMSE = 4 we use the RMSE more often because it is measured in the same units as the response variable. MSE is the squared value of the difference.
What is Machine Learning
There are several ML algorithms. Let me list out the most widely used ones: 1. Ensemble — Random Forest, Bagging and Boosting 2. Neural Networks — Perceptron, Back-Propagation 3. Regularization — LASSO and Ridge 4. Regression — Linear Regression, Logistic Regression, Least square methods 5. Clustering algorithms — k-Means Clustering, Expectation Maximization 6. Instance Based — k-Nearest Neighbour 7. Dimension Reduction — Principal Component Analysis, Linear Discriminant Analysis 8. Decision Tree — Classification and Regression Tree 9. Bayesian — Naive Bayes
How can you select k for k-means
We use the elbow method to select k for k-means clustering. Calculate the Within-Cluster-Sum of Squared Errors (WSS) for different values of k, and choose the k for which WSS becomes first starts to diminish. In the plot of WSS-versus-k, this is visible as an elbow. WSS = sum over m datapoint (xi - ci)^2 where xi = data point, and ci is xi 's predicted cluster centroid. we calcuale WSS for various values of k
What is the law of large numbers
as a sample size grows, its mean gets closer to the average of the whole population.
gradient descent methods
gradient descent is a first-order iterative optimization algorithm for finding a local minimum of a differentiable function. The idea is to take repeated steps in the opposite direction of the gradient (or approximate gradient) of the function at the current point, because this is the direction of steepest descent. Conversely, stepping in the direction of the gradient will lead to a local minimum of that function; the procedure is then known as gradient ascent.
We want to predict the probability of death from heart disease based on three risk factors: age, gender, and blood cholesterol level. What is the most appropriate algorithm for this case? Logistic Regression Linear Regression K-means clustering Apriori algorithm
logistic regression
Machine Learning types
supervised and unsupervised
What are the feature selection methods used to select the right variables?
the process of selecting the subset of features to be used for training a machine learning model.
What are various assumptions used in linear regression? What would happen if they are violated?
The sample data used for modeling represents the entire population. Random Sample There exists a linear relationship between the X-axis variable and the mean of the Y variable. The residual variance is the same for any X values. This is called homoscedasticity The observations are independent of one another. i.i.d Y is distributed normally for any value of X. Extreme violations of the above assumptions lead to redundant results. Smaller violations of these result in greater variance or bias of the estimates
k-means clustering
iterative analytics technique that seeks to allocate each observation to the cluster closest to it Given K clusters, Step 1: Choose K random points as cluster centers called centroids. Step 2: Assign each x(i) to the closest cluster by implementing euclidean distance (i.e., calculating its distance to each centroid) Step 3: Identify new centroids by taking the average of the assigned points. Step 4: Keep repeating step 2 and step 3 until convergence is achieved
What are the types of biases that can occur during sampling?
1. Selection bias is a problematic situation in which error is introduced due to a non-random population sample. 2. Undercoverage bias occurs when a piece of information from your sample responses goes missing or uncovered in the results. This often happens when a large significant entity goes unselected or has zero chance of getting in your representing sample. 3. Survivorship bias is the logical error of focusing on aspects that support surviving a process and casually overlooking those that did not because of their lack of prominence. This can lead to wrong conclusions in numerous ways.
What is the difference between the long format data and wide format data?
A dataset can be written in two different formats: wide and long. A wide format contains values that do not repeat in the first column. A long format contains values that do repeat in the first column. Notice that in the wide dataset, each value in the first column is unique. As a rule of thumb, if you're analyzing data then you typically will use a wide data format. if you're visualizing multiple variables in a plot using statistical software such as R you typically must convert your data to a long format in order for the software to create the plot.
How do you build a random forest model?
A random forest is built up of a number of decision trees. If you split the data into different packages and make a decision tree in each of the different groups of data, the random forest brings all those trees together. Steps to build a random forest model: Step 1: Select random samples from a given data or training set. Step 2: This algorithm will construct a decision tree for every random sample training data 1. Randomly select 'k' features from a total of 'm' features where k << m 2. Among the 'k' features, calculate the node D using the best split point 3. Split the node into daughter nodes using the best split 4. Repeat steps two and three until leaf nodes are finalized 5. Build forest by repeating steps one to four for every random samples from step 1. Step 3: Voting will take place by averaging the decision tree. Step 4: Finally, select the most voted prediction result as the final prediction result. This combination of multiple models is called Ensemble. Ensemble uses two methods: Bagging: Creating a different training subset from sample training data with replacement is called Bagging. The final output is based on majority voting. This is a parallel process Boosting: Combing weak learners into strong learners by creating sequential models such that the final model has the highest accuracy is called Boosting. This is sequential process. Example: ADA BOOST, XG BOOST. https://www.simplilearn.com/tutorials/machine-learning-tutorial/random-forest-algorithm
How can you calculate accuracy using a confusion matrix?
Accuracy: The accuracy is used to find the portion of correctly classified values. It tells us how often our classifier is right. It is the sum of all true values divided by total values. Accuracy = (True Positive + True Negative) / Total Observations = (262 + 347) / 650 = 609 / 650 = 0.93 As a result, we get an accuracy of 93 percent.
Supervised Model Feature Selection 1. Wrapper feature selection methods
Backward selection, in which we start with a full model comprising all available features. In subsequent iterations, we remove one feature at a time, always the one that yields the largest gain in a model performance metric, until we reach the desired number of features. from sklearn.feature_selection import SequentialFeatureSelector knn = KNeighborsClassifier(n_neighbors=3) sfs = SequentialFeatureSelector(knn, n_features_to_select=3, direction="forward") sfs.fit(X, y) X_selection = sfs.transform(X) Forward selection, which works in the opposite direction: we start from a null model with zero features and add them greedily one at a time to maximize the model's performance. Recursive Feature Elimination, or RFE, which is similar in spirit to backward selection. It also starts with a full model and iteratively eliminates the features one by one. The difference is in the way the features to discard are chosen. Instead of relying on a model performance metric from a hold-out set, RFE makes its decision based on feature importance extracted from the model. This could be feature weights in linear models, impurity decrease in tree-based models, or permutation importance (which is applicable to any model type). from sklearn.feature_selection import RFE svc = SVC(kernel="linear") rfe = RFE(svc, n_features_to_select=3) rfe.fit(X, y) X_selection = rfe.transform(X)
What is a bias-variance trade-off
Bias: Due to an oversimplification of a Machine Learning Algorithm, an error occurs in our model, This can lead to an issue of underfitting and might lead to oversimplified assumptions at the model training time to make target functions easier and simpler to understand. Some of the popular machine learning algorithms which are low on the bias scale are - Support Vector Machines (SVM), K-Nearest Neighbors (KNN), and Decision Trees. Algorithms that are high on the bias scale - Logistic Regression and Linear Regression. Variance: Because of a complex machine learning algorithm, a model performs really badly on a test data set as the model learns even noise from the training data set. This error that occurs in the Machine Learning model is called Variance and can generate overfitting and hyper-sensitivity in Machine Learning models. While trying to get over bias in our model, we try to increase the complexity of the machine learning algorithm. Though it helps in reducing the bias, after a certain point, it generates an overfitting effect on the model hence resulting in hyper-sensitivity and high variance.
Difference between Confidence Interval: & Point Estimates:
Confidence Interval: A range of values likely containing the population parameter is given by the confidence interval. Further, it even tells us how likely that particular interval can contain the population parameter. The Confidence Coefficient (or Confidence level) is denoted by 1-alpha, which gives the probability or likeness. The level of significance is given by alpha. Point Estimates: An estimate of the population parameter is given by a particular value called the point estimate. Some popular methods used to derive Population Parameters' Point estimators are - Maximum Likelihood estimator and the Method of Moments. To conclude, the bias and variance are inversely proportional to each other, i.e., an increase in bias results in a decrease in the variance, and an increase in variance results in a decrease in bias.
Violation of Classic linear regression assumptions - Normal Distribution of error term
If the error terms are non- normally distributed, confidence intervals may become too wide or narrow. Once confidence interval becomes unstable, it leads to difficulty in estimating coefficients based on minimization of least squares. Presence of non - normal distribution suggests that there are a few unusual data points which must be studied closely to make a better model. How to check: You can look at QQ plot (shown below). You can also perform statistical tests of normality such as Kolmogorov-Smirnov test, Shapiro-Wilk test. This q-q or quantile-quantile is a scatter plot which helps us validate the assumption of normal distribution in a data set. Using this plot we can infer if the data comes from a normal distribution. If yes, the plot would show fairly straight line. Absence of normality in the errors can be seen with deviation in the straight line. If you are wondering what is a 'quantile', here's a simple definition: Think of quantiles as points in your data below which a certain proportion of data falls. Quantile is often referred to as percentiles. For example: when we say the value of 50th percentile is 120, it means half of the data lies below 120. Solution: If the errors are not normally distributed, non - linear transformation of the variables (response or predictors) can bring improvement in the model.
A certain couple tells you that they have two children, at least one of which is a girl. What is the probability that they have two girls?
In the case of two children, there are 4 equally likely possibilities BB, BG, GB and GG; where B = Boy and G = Girl and the first letter denotes the first child. From the question, we can exclude the first case of BB. Thus from the remaining 3 possibilities of BG, GB & BB, we have to find the probability of the case with two girls. Thus, P(Having two girls given one girl) = 1 / 3
Data Measurement Levels
Nominal features, such as color ("red", "green" or "blue") have no ordering between the values; they simply group observations based on them. Ordinal features, such as education level ("primary", "secondary", "tertiary") denote order, but not the differences between particular levels (we cannot say that the difference between "primary" and "secondary" is the same as the one between "secondary" and "tertiary"). Interval features, such as temperature in degrees Celsius, keep the intervals equal (the difference between 25 and 20 degrees is the same as between 30 and 25). Finally, ratio features, such as price in USD, are characterized by a meaningful zero, which allows us to calculate ratios between two data points: we can say that $4 is twice as much as $2. In order to choose the right statistical tool to measure the relation between two variables, we need to think about their measurement levels.
In your choice of language, write a program that prints the numbers ranging from one to 50. But for multiples of three, print "Fizz" instead of the number, and for the multiples of five, print "Buzz." For numbers which are multiples of both three and five, print "FizzBuzz"
Note that the range mentioned is 51, which means zero to 50. However, the range asked in the question is one to 50. Therefore, in the above code, you can include the range as (1,51).
What is pruning in a decision tree algorithm
Pruning simplifies the decision tree by reducing the rules. Pruning helps to avoid complexity and improves accuracy. Reduced error Pruning, cost complexity pruning etc. are the different types of Pruning. The simplest technique is to prune out portions of the tree that result in the least information gain. The process of IG-based pruning requires us to identify "twigs", nodes whose children are all leaves. "Pruning" a twig removes all of the leaves which are the children of the twig, and makes the twig a leaf. The algorithm for pruning is as follows: 1. Catalog all twigs in the tree 2. Count the total number of leaves in the tree. 3. While the number of leaves in the tree exceeds the desired number:Find the twig with the least Information Gain 4. Remove all child nodes of the twig.Relabel twig as a leaf.Update the leaf count. The pseudocode for this pruning algorithm is below. # Count leaves in a tree def countLeaves(decisiontree): if decisiontree.isLeaf: return 1 else n = 0 for each child in decisiontree.children: n += countLeaves(child) return n # Check if a node is a twig def isTwig(decisionTree) for each child in decisiontree.children: if not child.isLeaf: return False return True # Make a heap of twigs. The default heap is empty def collectTwigs(decisionTree, heap=[]) if isTwig(decisionTree): heappush(heap,(decisionTree.@nodeInformationGain, decisionTree)) else for each child in decisiontree.children: collectTwigs(child,heap) return heap # Prune a tree to have nLeaves leaves # Assuming heappop pops smallest value def prune(dTree, nLeaves): totalLeaves = countLeaves(dTtree) twigHeap = collectTwigs(dTree) while totalLeaves > nLeaves: twig = heappop(twigHeap) totalLeaves -= (length(twig.@children) - 1) #Trimming the twig removes #numChildren leaves, but adds #the twig itself as a leaf twig.@chidren = null # Kill the chilren twig.@isLeaf = True twig.@nodeInformationGain = 0 # Check if the parent is a twig and, if so, put it in the heap parent = twig.@parent if isTwig(parent): heappush(twigHeap,(parent.@nodeInformationGain, parent)) return
What is an RNN (recurrent neural network)?
RNN is an algorithm that uses sequential data. RNN is used in language translation, voice recognition, image capturing etc. There are different types of RNN networks such as one-to-one, one-to-many, many-to-one and many-to-many. RNN is used in Google's Voice search and Apple's Siri.
Sampling and it's type
Sampling is the selection of individual members or a subset of the population to estimate the characters of the whole population. Types of Sampling Probability sampling involves random selection, allowing you to make strong statistical inferences about the whole group. Non-probability sampling involves non-random selection based on convenience or other criteria, allowing you to easily collect data.. https://www.scribbr.com/methodology/sampling-methods/#:~:text=There%20are%20two%20types%20of,you%20to%20easily%20collect%20data.
What is skewed Distribution & uniform distribution?
Skewed distribution occurs when if data is distributed on any one side of the plot whereas uniform distribution is identified when the data is spread is equal in the range.
Violation of Classic linear regression assumptions - Heteroskedasticity:
The presence of non-constant variance in the error terms results in heteroskedasticity. Generally, non-constant variance arises in presence of outliers or extreme leverage values. Look like, these values get too much weight, thereby disproportionately influences the model's performance. When this phenomenon occurs, the confidence interval for out of sample prediction tends to be unrealistically wide or narrow. How to check: You can look at residual vs fitted values plot. If heteroskedasticity exists, the plot would exhibit a funnel shape pattern (shown in next section). Also, you can use Breusch-Pagan / Cook - Weisberg test or White general test to detect this phenomenon.
Good fit in Machine Learning
To find the good fit model, you need to look at the performance of a machine learning model over time with the training data. As the algorithm learns over time, the error for the model on the training data reduces, as well as the error on the test dataset. If you train the model for too long, the model may learn the unnecessary details and the noise in the training set and hence lead to overfitting. In order to achieve a good fit, you need to stop training at a point where the error starts to increase in test dataset
How do you identify if a coin is biased
To identify this, we perform a hypothesis test as below:According to the null hypothesis, the coin is unbiased if the probability of head flipping is 50%. According to the alternative hypothesis, the coin is biased and the probability is not equal to 500. Perform the below steps: Flip coin 500 times Calculate p-value. Compare the p-value against the alpha -> result of two-tailed test (0.05/2 = 0.025). Following two cases might occur:p-value > alpha: Then null hypothesis holds good and the coin is unbiased.p-value < alpha: Then the null hypothesis is rejected and the coin is biased.
Difference between Normalisation and Standardization
When To Standardize Data? when we want all the variables of comparable units. It is usually applied when the data has gaussian distribution. X' = (X - 𝞵) / 𝞼 When To Normalize Data? Normalization of data is a type of Feature scaling and is only required when the data distribution is unknown or the data doesn't have Gaussian Distribution. This type of scaling technique is used when the data has a diversified scope and the algorithms on which the data are being trained do not make presumptions about the data distribution such as Artificial Neural Network. The main goal of normalization is to make the data homogenous over all records and fields. It helps in creating a linkage between the entry data which in turn helps in cleaning and improving data quality. X' = (X - Xmin) / (Xmax - Xmin) Here, Xmin - feature's minimum value, Xmax - feature's maximum value. .
Underfit in Machine Learning
When a model has not learned the patterns in the training data well and is unable to generalize well on the new data, it is known as underfitting. An underfit model has poor performance on the training data and will result in unreliable predictions. Underfitting occurs due to high bias and low variance.
Why do you need to perform resampling?
1.Estimating the accuracy of sample statistics by drawing randomly with replacement from a set of the data point or using as subsets of accessible data 2. Substituting labels on data points when performing necessary tests 3. Validating models by using random subsets
association rules algorithm
Association rules analysis is a technique to uncover how items are associated to each other. There are three common ways to measure association. https://www.kdnuggets.com/2016/04/association-rules-apriori-algorithm-tutorial.html
Unsupervised feature selection methods
1. Zero or near-zero variance. Features that are (almost) constant provide little information to learn from and thus are irrelevant. from sklearn.feature_selection import VarianceThreshold sel = VarianceThreshold(threshold=0.05) X_selection = sel.fit_transform(X) 2. Many missing values. While dropping incomplete features is not the preferred way to handle missing data, it is often a good start, and if too many entries are missing, it might be the only sensible thing to do since such features are likely inconsequential. X_selection = X.dropna(axis=1) 3. High multicollinearity; multicollinearity means a strong correlation between different features, which might signal redundancy issues. A popular multicollinearity measure is the Variance Inflation Factor or VIF from statsmodels.stats.outliers_influence import variance_inflation_factor vif_scores = [variance_inflation_factor(X.values, feature)for feature in range(len(X.columns))]
Embedded feature selection methods
1. Lasso Regression: Lasso puts a constraint on the sum of the absolute values of the model parameters: the sum has to be less than a fixed value (upper bound). In order to do this, the method applies a shrinking (regularisation) process where it penalises the coefficients of the regression variables, shrinking some of them to zero. The regularisation process is controlled by the alpha parameter in the Lasso model. The higher the value of alpha, the feature coefficients are zero. When alpha is set to zero, Lasso regression produces the same coefficients as linear regression. https://medium.com/mlearning-ai/how-lasso-regression-is-a-valuable-feature-selection-tool-aac502819f99 2. auto-encoders with a bottleneck layer force the network to disregard some of the least useful features of the image and focus on the most important ones. Other than that, there aren't many useful examples. https://towardsdatascience.com/autoencoders-from-vanilla-to-variational-6f5bb5537e4a
Violation of Classic linear regression assumptions - Linear relationhsip
1. Linear and Additive: If you fit a linear model to a non-linear, non-additive data set, the regression algorithm would fail to capture the trend mathematically, thus resulting in an inefficient model. Also, this will result in erroneous predictions on an unseen data set. How to check: Look for residual vs fitted value plots (explained below). Also, you can include polynomial terms (X, X², X³) in your model to capture the non-linear effect. This scatter plot shows the distribution of residuals (errors) vs fitted values (predicted values). It is one of the most important plot which everyone must learn. It reveals various useful insights including outliers. The outliers in this plot are labeled by their observation number which make them easy to detect. There are two major things which you should learn: If there exist any pattern (may be, a parabolic shape) in this plot, consider it as signs of non-linearity in the data. It means that the model doesn't capture non-linear effects. If a funnel shape is evident in the plot, consider it as the signs of non constant variance i.e. heteroskedasticity. Solution: To overcome the issue of non-linearity, you can do a non linear transformation of predictors such as log (X), √X or X² transform the dependent variable. To overcome heteroskedasticity, a possible way is to transform the response variable such as log(Y) or √Y. Also, you can use weighted least square method to tackle heteroskedasticity.
Explain the steps in making a decision tree
1. Take the entire data set as input 2. Calculate entropy of the target variable, as well as the predictor attributes 3. Calculate your information gain of all attributes (we gain information on sorting different objects from each other) 4. Choose the attribute with the highest information gain as the root node 5. Repeat the same procedure on every branch until the decision node of each branch is finalized
Explain the steps for a Data analytics project
1. Understand the Business problem 2. Explore the data and study it carefully. 3. Prepare the data for modeling by finding missing values and transforming variables. 4. Start running the model and analyze the Big data result. 5. Validate the model with new data set. 6. Implement the model and track the result to analyze the performance of the model for a specific period.
What is ROC Curve?
A plot of sensitivity vs (1- specificity) used for points on the curve represent different cut off points used for testing positive. A test that gave a ROC curve such as the yellow line would be no better than random guessing, pale blue is good, but a test represented by the dark blue line would be excellent. It would make cutoff determination relatively simple and yield a high true positive rate at very low false positives rate - sensitive and specific.
what is outlier, cause of outlier, How can outlier values be detected and treated
An outlier may be defined as a observation that deviates drastically from the central tendency (Mean, median and mode) of the data set. An outlier may be caused 1. simply by chance, measurement error/ experimental/ human error (treatment; simply deleted, capped or median imputed) 2. heavy-tailed distribution (genuine values can't be simply deleted, capped or median imputed) So need to figure out if an extreme value is genuine, and to assess its impact on statistics of interest. Detecting: 1. Boxplots 2. Z-score using the formula (Xi-mean)/std. 3. Inter Quantile Range(IQR) Criteria: data points that lie 1.5 times of IQR(=Q3-Q1) above Q3 and below Q1 are outliers. Handling Outliers 1. Trimming/removing the outlier (random, or measurement error) 2. Quantile based flooring and capping 3. Median imputation For genuine values: 4. Reducing the weights of outliers 5. normalizing or log transform to induce normality (for heavy tailed) 6. fit a nonlinear model 7. use algorithms that are less affected by outliers such as random forests.
Artificial Neural Networks
Artificial Neural Networks, are computational models based on the structure and functions of biological neural networks. It is like an artificial human nervous system for receiving, processing, and transmitting information in terms of Computer Science. Each neuron's input signal is actually a weighted combination of potentially many input signals, and the weighting of each input means that that input can have a different influence on any subsequent calculations, and ultimately on the final output of the entire network. In addition, each neuron applies a function or transformation to the weighted inputs, which means that the combined weighted input signal is transformed mathematically prior to evaluating if the activation threshold has been exceeded. This combination of weighted input signals and the functions applied are typically either linear or nonlinear. These input signals can originate in many ways, with our senses being some of the most important, as well as ingestion of gases (breathing), liquids (drinking), and solids (eating) for example. A single neuron may receive hundreds of thousands of input signals at once that undergo the summation process to determine if the message gets passed along, and ultimately causes the brain to instruct actions, memory recollection, and so on. The 'thinking' or processing that our brain carries out, and the subsequent instructions given to our muscles, organs, and body are the result of these neural networks in action. In addition, the brain's neural networks continuously change and update themselves in many ways, including modifications to the amount of weighting applied between neurons. This happens as a direct result of learning and experience.
After studying the behavior of a population, you have identified four specific individual types that are valuable to your study. You would like to find all users who are most similar to each individual type. Which algorithm is most appropriate for this study? Choose the correct option: K-means clustering Linear regression Association rules Decision trees
As we are looking for grouping people together specifically by four different similarities, it indicates the value of k. Therefore, K-means clustering (answer A) is the most appropriate algorithm for this study.
You are given a dataset on cancer detection. You have built a classification model and achieved an accuracy of 96 percent. Why shouldn't you be happy with your model performance? What can you do about it?
Cancer detection results in imbalanced data. In an imbalanced dataset, accuracy should not be based as a measure of performance. It is important to focus on the remaining four percent, which represents the patients who were wrongly diagnosed. Early diagnosis is crucial when it comes to cancer detection, and can greatly improve a patient's prognosis. Hence, to evaluate model performance, we should use Sensitivity (True Positive Rate), Specificity (True Negative Rate), F measure to determine the class wise performance of the classifier.
What is a confusion matrix?
Classification Models have multiple categorical outputs. The model might misclassify some categories more than others, but we cannot see this using a standard accuracy measure (total error). A confusion matrix presents a table layout of the different outcomes of the prediction and results of a classification problem and helps visualize its outcomes. True Positive: The number of times our actual positive values are equal to the predicted positive. You predicted a positive value, and it is correct. False Positive: The number of times our model wrongly predicts negative values as positives. You predicted a negative value, and it is actually positive. True Negative: The number of times our actual negative values are equal to predicted negative values. You predicted a negative value, and it is actually negative. False Negative: The number of times our model wrongly predicts negative values as positives. You predicted a negative value, and it is actually positive.
What are Eigenvectors and Eigenvalues?
Eigenvectors are column vectors or unit vectors whose length/magnitude is equal to 1. They are also called right vectors. Eigenvalues are coefficients that are applied on eigenvectors which give these vectors different values for length or magnitude.
What are some examples when false positive has proven important than false negative?
False Positives are those cases that were wrongly identified as an event even if they were not. They are called Type I errors. False Negatives are those cases that were wrongly identified as non-events despite being an event. They are called Type II errors. Some examples where false positives were important than false negatives are: In the medical field: Consider that a lab report has predicted cancer to a patient even if he did not have cancer. This is an example of a false positive error. It is dangerous to start chemotherapy for that patient as he doesn't have cancer as starting chemotherapy would lead to damage of healthy cells and might even actually lead to cancer. In the e-commerce field: Suppose a company decides to start a campaign where they give $100 gift vouchers for purchasing $10000 worth of items without any minimum purchase conditions. They assume it would result in at least 20% profit for items sold above $10000. What if the vouchers are given to the customers who haven't purchased anything but have been mistakenly marked as those who purchased $10000 worth of products. This is the case of false-positive error.
Give one example where both false positives and false negatives are important equally?
In Banking fields: Lending loans are the main sources of income to the banks. But if the repayment rate isn't good, then there is a risk of huge losses instead of any profits. So giving out loans to customers is a gamble as banks can't risk losing good customers but at the same time, they can't afford to acquire bad customers. This case is a classic example of equal importance in false positive and false negative scenarios
What is k-fold cross-validation?
It is a data partitioning strategy so that you can effectively use your dataset to build a more generalized model which can perform well on unseen data. 1. Split training data into K equal parts 2. Fit the model on k-1 parts and calculate test error using the fitted model on the kth part 3. Repeat k times, using each data subset as the test set once. (usually k= 5~20)
What is Deep Learning?
Machine Learning → Artificial Neural Networks → Deep Learning Deep Learning automates the task of predictions i.e. it helps design a model through which we can pass our dataset. They process this data through many layers of nonlinear transformations of the input data in order to calculate a target output. Unsupervised feature extraction is also an area where deep learning excels. Feature extraction is when an algorithm is able to automatically derive or construct meaningful features of the data to be used for further learning, generalization, and understanding. The burden is traditionally on the data scientist or programmer to carry out the feature extraction process in most other machine learning approaches, along with feature selection and engineering. Feature extraction usually involves some amount dimensionality reduction as well, which is reducing the amount of input features and data required to generate meaningful results. This has many benefits, which include simplification, computational and memory power reduction, and so on. More generally, deep learning falls under the group of techniques known as feature learning or representation learning. As discussed so far, feature extraction is used to 'learn' which features to focus on and use in machine learning solutions. The machine learning algorithms themselves 'learn' the optimal parameters to create the best performing model. Feature learning algorithms allow a machine to learn for a specific task using a well-suited set of features, and also learn the features themselves. In other words, these algorithms learn how to learn!
Describe Markov chains
Markov Chains defines that a state's future probability depends only on its current state. A perfect example of the Markov Chains is the system of word recommendation. In this system, the model recognizes and recommends the next word based on the immediately previous word and not anything before that. The Markov Chains take the previous paragraphs that were similar to training data-sets and generates the recommendations for the current paragraphs accordingly based on the previous word.
Your organization has a website where visitors randomly receive one of two coupons. It is also possible that visitors to the website will not receive a coupon. You have been asked to determine if offering a coupon to website visitors has any impact on their purchase decisions. Which analysis method should you use? One-way ANOVA K-means clustering Association rules Student's t-test
One Way Anova Use a one-way ANOVA when you have collected data about one categorical independent variable and one quantitative dependent variable. The independent variable should have at least three levels (i.e. at least three different groups or categories). ANOVA tells you if the dependent variable changes according to the level of the independent variable. For example: *Your independent variable is social media use, and you assign groups to low, medium, and high levels of social media use to find out if there is a difference in hours of sleep per night. *Your independent variable is brand of soda, and you collect data on Coke, Pepsi, Sprite, and Fanta to find out if there is a difference in the price per 100ml. *You independent variable is type of fertilizer, and you treat crop fields with mixtures 1, 2 and 3 to find out if there is a difference in crop yield. The null hypothesis (H0) of ANOVA is that there is no difference among group means. The alternate hypothesis (Ha) is that at least one group differs significantly from the overall mean of the dependent variable. One-way ANOVA R code one.way <- aov(yield ~ fertilizer, data = crop.data) summary(one.way) Tukey test R code TukeyHSD(one.way)
How can you avoid overfitting your model?
Overfitting refers to a model that is only set for a very small amount of data and ignores the bigger picture. It performs very well for training data but has poor performance with test (new) data. There are three main methods to avoid overfitting: 1. Keep the model simple—take fewer variables into account, thereby removing some of the noise in the training data 2. Use cross-validation techniques, such as k folds cross-validation 3. Use regularization techniques, such as LASSO, that penalize certain model parameters if they're likely to cause overfitting
What is Principal Component Analysis, HOW DO YOU DO A PRINCIPAL COMPONENT ANALYSIS?
PCA is reduce the number of variables of a data set, while preserving as much information as possible. Principal components are new variables that are constructed as linear combinations or mixtures of the initial variables. Each principal are uncorrelated and most of the information within the initial variables is squeezed into the first component. https://www.youtube.com/watch?v=9z2OtPOi8T0&feature=emb_rel_end 1. Standardize the range of continuous initial variables so that each one of them contributes equally to the analysis. because PCA is very sensitive to variances of initial variables. 2. Compute the covariance matrix to identify correlations, variables are highly correlated in such a way that they contain redundant information. So, in order to identify these correlations, we compute the covariance matrix. the sign means positive or negative correaltion 3. Compute the eigenvectors and eigenvalues of the covariance matrix to identify the principal components Geometrically speaking, principal components represent the directions of the data that explain a maximal amount of variance, that is to say, the lines that capture most information of the data. The relationship between variance and information here, is that, the larger the variance carried by a line, the larger the dispersion of the data points along it, and the larger the dispersion along a line, the more information it has. To put all this simply, just think of principal components as new axes that provide the best angle to see and evaluate the data, so that the differences between the observations are better visible. (watch the youtube video) eigenvectors of the Covariance matrix are actually the directions of the axes where there is the most variance(most information) and that we call Principal Components. And eigenvalues are simply the coefficients attached to eigenvectors, which give the amount of variance carried in each Principal Component. By ranking your eigenvectors in order of their eigenvalues, highest to lowest, you get the principal components in order of significance. 4. Create a feature vector to decide which principal components to keep feature vector is simply a matrix that has as columns the eigenvectors of the components that we decide to keep. for example, keeping only p principal components out of n initial varaibles. 5. Recast the data along the principal components axes the aim is to use the feature vector formed using the eigenvectors of the covariance matrix, to reorient the data from the original axes (in terms of initial variables) to the ones represented by the principal components (hence the name Principal Components Analysis). This can be done by multiplying the transpose of the original data set by the transpose of the feature vector. Your output data is now a projection of the original data onto p eigenvectors. Thus, the projected data dimension has been reduced to p. suppose the original dataset is m records(rows) *n variables (columns), the transpose is nx m the covariance matrix is nxn matrix, if we select p component, then it is nx p, the transpose is px n the product of the two is pxm transpose to mxp, dimension reduced to p from n. https://builtin.com/data-science/step-step-explanation-principal-component-analysis
Dimensionality Reduction Techniques
Principal Component Analysis. PCA extracts a new set of variables from an existing, more extensive set. The new set is called "principal components." Backward Feature Elimination. This five-step technique defines the optimal number of features required for a machine learning algorithm by choosing the best model performance and the maximum tolerable error rate. Forward Feature Selection. This technique follows the inverse of the backward feature elimination process. Thus, we don't eliminate the feature. Instead, we find the best features that produce the highest increase in the model's performance. Missing Value Ratio. This technique sets a threshold level for missing values. If a variable exceeds the threshold, it's dropped. Low Variance Filter. Like the Missing Value Ratio technique, the Low Variance Filter works with a threshold. However, in this case, it's testing data columns. The method calculates the variance of each variable. All data columns with variances falling below the threshold are dropped since low variance features don't affect the target variable. High Correlation Filter. This method applies to two variables carrying the same information, thus potentially degrading the model. In this method, we identify the variables with high correlation and use the Variance Inflation Factor (VIF) to choose one. You can remove variables with a higher value (VIF > 5). Decision Trees. Decision trees are a popular supervised learning algorithm that splits data into homogenous sets based on input variables. This approach solves problems like data outliers, missing values, and identifying significant variables. Random Forest. This method is like the decision tree strategy. However, in this case, we generate a large set of trees (hence "forest") against the target variable. Then we find feature subsets with the help of each attribute's usage statistics of each attribute. Factor Analysis. This method places highly correlated variables into their own group, symbolizing a single factor or construct.
Consider a case where you know the probability of finding at least one shooting star in a 15-minute interval is 30%. Evaluate the probability of finding at least one shooting star in a one-hour duration?
Probability of finding atleast 1 shooting star in 15 min = P(sighting in 15min) = 30% = 0.3 Hence, Probability of not sighting any shooting star in 15 min = 1-P(sighting in 15min) = 1-0.3 = 0.7 Probability of not finding shooting star in 1 hour = 0.7^4 = 0.1372 Probability of finding atleast 1 shooting star in 1 hour = 1-0.1372 = 0.8628
Write a basic SQL query that lists all orders with customer information. Usually, we have order tables and customer tables that contain the following columns: Order Table Orderid customerId OrderNumber TotalAmount Customer Table Id FirstName LastName City Country
SELECT OrderNumber, TotalAmount, FirstName, LastName, City, Country FROM Order JOIN Customer ON Order.CustomerId = Customer.Id
Sensitivity and Specificity
Sensitivity = (TP/TP +FN) - same as recall rate. = 262 / 288 = 0.90 how well the test identifies observation with true outcome (for example, with disease) Specificity = (TN/TN +FP) =347/ (347+15) how well the test identifies observation without true outcome (True negative, for example, without disease)
Supervised Model Feature Selection Filter feature selection methods
Simply analyze its statistical relation with the model's target, using measures such as correlation or mutual information as a proxy for the model performance metric. 1. To keep the top 2 features with the strongest Pearson correlation with the target, we can run: from sklearn.feature_selection import r_regression, SelectKBest X_selection = SelectKBest(r_regression, k=2).fit_transform(X, y) top 20% features from sklearn.feature_selection import r_regression, SelectPercentile X_selection = SelectPercentile(r_regression, percentile=30).fit_transform(X, y) 2. Spearman's Rho, Kendall Tau, and point-biserial correlation are all available in the scipy package. scipy import stats rho_corr = [stats.spearmanr(X[:, f], y).correlation for f in range(X.shape[1])] tau_corr = [stats.kendalltau(X[:, f], y).correlation for f in range(X.shape[1])] pbs_corr = [stats.pointbiserialr(X[:, f], y).correlation for f in range(X.shape[1])] 3. Chi-Squared, Mutual Information, and ANOVA F-score are all in scikit-learn. from sklearn.feature_selection import chi2 from sklearn.feature_selection import mutual_info_regression from sklearn.feature_selection import mutual_info_classif from sklearn.feature_selection import f_classif chi2_corr = chi2(X, y)[0] f_corr = f_classif(X, y)[0] mi_reg_corr = mutual_info_regression(X, y) mi_class_corr = mutual_info_classif(X, y) 4. Cramer's V can be obtained from a recent scipy version from scipy.stats.contingency import association v_corr = [association(np.hstack([X[:, f].reshape(-1, 1), y.reshape(-1, 1)]), method="cramer") for f in range(X.shape[1])]
Measuring correlations for various data types
Spearman's rank correlation (Spearman's Rho), it only looks at the rank values, i.e. it compares the two variables in terms of the relative positions of particular data points within the variables. It is able to capture non-linear relations, but there are no free lunches: we lose some information due to only considering the rank instead of the exact data points. Kendall rank correlation (Kendall Tau). (Kendall's calculations are based on concordant and discordant pairs of values, as opposed to Spearman's calculations based on deviations). Kendall is often regarded as more robust to outliers in the data. If at least one of the compared variables is of ordinal type, Spearman's or Kendall rank correlation is the way to go. Due to the fact that ordinal data contains only the information on the ranks, they are both a perfect fit, while Pearson's linear correlation is of little use. Another scenario is when both variables are nominal. In this case, we can choose from a couple of different correlation measures: Cramer's V, which captures the association between the two variables into a number ranging from zero (no association) to one (one variable completely determined by the other). Chi-Squared statistic commonly used for testing for dependence between two variables. Lack of dependence suggests the particular feature is not useful. Mutual information a measure of mutual dependence between two variables that seeks to quantify the amount of information that one can extract from one variable about the other. ANOVA F-score, a chi-squared equivalent for the case when one of the variables is continuous while the other is nominal, Point-biserial correlation a correlation measure especially designed to evaluate the relationship between a binary and a continuous variable.
Which of the following machine learning algorithms can be used for inputting missing values of both categorical and continuous variables? K-means clustering Linear regression K-NN (k-nearest neighbor) Decision trees
The K nearest neighbor algorithm can be used because it can compute the nearest neighbor and if it doesn't have a value, it just computes the nearest neighbor based on all the other features. When you're dealing with K-means clustering or linear regression, you need to do that in your pre-processing, otherwise, they'll crash. Decision trees also have the same problem, although there is some variance.
You have run the association rules algorithm on your dataset, and the two rules {banana, apple} => {grape} and {apple, orange} => {grape} have been found to be relevant. What else must be true? Choose the right answer: {banana, apple, grape, orange} must be a frequent itemset {banana, apple} => {orange} must be a relevant rule {grape} => {banana, apple} must be a relevant rule {grape, apple} must be a frequent itemset
The answer is A: {grape, apple} must be a frequent itemset
Name three disadvantages of using a linear model
The assumption of linearity of the errors. You can't use this model for binary or count outcomes There are plenty of overfitting problems that it can't solve
How will you calculate eigenvalues and eigenvectors of the following 3x3 matrix? -2 -4 2 -2 1 2 4 2 5
The characteristic equation is as shown: Expanding determinant: (-2 - λ) [(1-λ) (5-λ)-2x2] + 4[(-2) x (5-λ) -4x2] + 2[(-2) x 2-4(1-λ)] =0 - λ3 + 4λ2 + 27λ - 90 = 0, λ3 - 4 λ2 -27 λ + 90 = 0 Here we have an algebraic equation built from the eigenvectors. By hit and trial: 33 - 4 x 32 - 27 x 3 +90 = 0 Hence, (λ - 3) is a factor: λ3 - 4 λ2 - 27 λ +90 = (λ - 3) (λ2 - λ - 30) Eigenvalues are 3,-5,6: (λ - 3) (λ2 - λ - 30) = (λ - 3) (λ+5) (λ-6), Calculate eigenvector for λ = 3 For X = 1, -5 - 4Y + 2Z =0, -2 - 2Y + 2Z =0 Subtracting the two equations: 3 + 2Y = 0, Subtracting back into second equation: Y = -(3/2) Z = -(1/2) Similarly, we can calculate the eigenvectors for -5 and 6.
What is ensemble learning?
The ensemble is a method of combining a diverse set of learners together to improvise on the stability and predictive power of the model. Two types of Ensemble learning methods are: Bagging Bagging method helps you to implement similar learners on small sample populations. It helps you to make nearer predictions. Boosting Boosting is an iterative method which allows you to adjust the weight of an observation depends upon the last classification. Boosting decreases the bias error and helps you to build strong predictive models.
Walkthrough the probability fundamentals
The possibility of the occurrence of an event, among all the possible outcomes, is known as its probability. The probability of an event always lies between(including) 0 and 1. Factorial -it is used to find the total number of ways n number of things can be arranged in n places without repetition. Its value is n multiplied by all natural numbers till n-1.eg. 5!=5X4X3X2X1=120 Permutation- It is used when replacement is not allowed, and the order of items is important. Its formula is- Where, n is the total number of items R is the number of ways items are being selected Combination- It is used when replacement is not allowed, and the order of items is not important. Its formula is- Some rules for probability are- Addition Rule P(A or B)= P(A) + P(B) - P(A and B) Conditional probability It is the probability of event B occurring, assuming that event A has already occurred. P(A and B)= P(A) . P(B|A) Central Limit theorem It states that when we draw random samples from a large population, and take the mean of these samples, they form a normal distribution.
Violation of Classic linear regression assumptions -Autocorrelation: not independently distributed
The presence of correlation in error terms drastically reduces model's accuracy. This usually occurs in time series models where the next instant is dependent on previous instant. If the error terms are correlated, the estimated standard errors tend to underestimate the true standard error. If this happens, it causes confidence intervals and prediction intervals to be narrower. Narrower confidence interval means that a 95% confidence interval would have lesser probability than 0.95 that it would contain the actual value of coefficients. Let's understand narrow prediction intervals with an example: For example, the least square coefficient of X¹ is 15.02 and its standard error is 2.08 (without autocorrelation). But in presence of autocorrelation, the standard error reduces to 1.20. As a result, the prediction interval narrows down to (13.82, 16.22) from (12.94, 17.10). Also, lower standard errors would cause the associated p-values to be lower than actual. This will make us incorrectly conclude a parameter to be statistically significant. How to check: Look for Durbin - Watson (DW) statistic. It must lie between 0 and 4. If DW = 2, implies no autocorrelation, 0 < DW < 2 implies positive autocorrelation while 2 < DW < 4 indicates negative autocorrelation. Also, you can see residual vs time plot and look for the seasonal or correlated pattern in residual values.
Types of Clustering
The various types of clustering are: * Hierarchical clustering * Partitioning clustering Hierarchical clustering is further subdivided into: Agglomerative clustering Divisive clustering Partitioning clustering is further subdivided into: K-Means clustering
You are given a data set consisting of variables with more than 30 percent missing values. How will you deal with them?
Two types: Missing at Random: delete the value if it is a large dataset. Mean, median mode imputation for smaller dataset. #impute by mean missing_col = ['GPA']for i in missing_col:df.loc[df.loc[:,i].isnull(),i]=df.loc[:,i].mean() #impute by median missing_col = ['IELTS']for i in missing_col:df.loc[df.loc[:,i].isnull(),i]=df.loc[:,i].median() Mising not at Random: For example, student failed exam do not have records of attendence. Imputation: regression type. https://www.analyticsvidhya.com/blog/2021/10/guide-to-deal-with-missing-values/ https://medium.com/analytics-vidhya/missing-values-in-data-science-8e3989fc5e79
Differentiate between univariate, bivariate, and multivariate analysis.
Univariate data contains only one variable. The purpose of the univariate analysis is to describe the data in terms of mean, standard deviation, skewness, range, mode median and histogram for continuous features, describe the frequency for categorical features. and find patterns that exist within it. Bivariate data involves two different variables. The analysis of this type of data deals with causes and relationships and the analysis is done to determine the relationship between the two variables. Correlation is reported from this analysis. Multivariate It is similar to a bivariate but contains more than one independent variable. It is better than Bivariate analysis because it considers correlation between the independent variables (features), and correlation between depedent and independent features.
Feature Selection Methods
Unsupervised feature selection methods 1. Zero or near-zero variance. 2. Many missing values. 3. High multicollinearity; Supervised Model Feature Selection 1. Wrapper feature selection methods 2.Filter feature selection methods 3. Embedded feature selection methods https://neptune.ai/blog/feature-selection-methods
Toss the selected coin 10 times from a jar of 1000 coins. Out of 1000 coins, 999 coins are fair and 1 coin is double-headed, assume that you see 10 heads. Estimate the probability of getting a head in the next coin toss.
We know that there are two types of coins - fair and double-headed. Hence, there are two possible ways of choosing a coin. The first is to choose a fair coin and the second is to choose a coin having 2 heads. P(selecting fair coin) = 999/1000 = 0.999P(selecting double headed coin) = 1/1000 = 0.001 Using Bayes rule, P(selecting 10 heads in row) = P(selecting fair coin)* Getting 10 heads + P(selecting double headed coin) P(selecting 10 heads in row) = P(A)+P(B) P (A) = 0.999 * (1/2)^10 = 0.999 * (1/1024) = 0.000976 P (B) = 0.001 * 1 = 0.001 P( A / (A + B) ) = 0.000976 / (0.000976 + 0.001) = 0.4939 P( B / (A + B)) = 0.001 / 0.001976 = 0.5061 P(selecting head in next toss) = P(A/A+B) * 0.5 + P(B/A+B) * 1 = 0.4939 * 0.5 + 0.5061 = 0.7531
How should I balance sensitivity with specificity?
Where results are given on a sliding scale of values, rather than a definitive positive or negative, sensitivity and specificity values are especially important. They allow you to determine where to draw cut-offs for calling a result positive or negative, or maybe even suggest a grey area where a retest would be recommended. For example, by putting the cutoff for a positive result at a very low level (blue dashed line), you may capture all positive samples, and so the test is very sensitive. However, this may mean many samples that are actually negative could be regarded as positive, and so the test would be deemed to have poor specificity. Finding a balance is therefore vital for an effective and usable test.
What are dimensionality reduction and its benefits?
dimension -# of independent variables (features) to predict target. suppose your target is a quarter you dropped on your walkway. If you walked in a straight line (one feature) for 50 yards, You will probably find it fast. But now, let's say your search area covers a square 50 yards by 50 yards (two features). Now your search will take days! Now, make that search area a cube that's 50 by 50 by 50 yards (3 features), you may want to say "goodbye" to that quarter. If many dimensions reside in the feature space, that results in a large volume of space. the points in the space and rows of data may represent only a tiny, non-representative sample. This imbalance can negatively affect machine learning algorithm performance. This condition is known as "the curse of dimensionality." The bottom line, a data set with vast input features complicates the predictive modeling task, putting performance and accuracy at risk. https://www.simplilearn.com/what-is-dimensionality-reduction-article
Below are the eight actual values of the target variable in the train file. What is the entropy of the target variable? [0, 0, 0, 1, 1, 1, 1, 1]
entropy of a dataset in terms of the probability distribution of observations in the dataset belonging to one class or another, e.g. two classes in the case of a binary classification dataset. in a binary classification problem (two classes), we can calculate the entropy of the data sample as follows: Entropy = -(p(0) * log(P(0)) + p(1) * log(P(1))) entropy can be used as a calculation of the purity of a dataset, e.g. how balanced the distribution of classes happens to be. An entropy of 0 bits indicates a dataset containing one class; high level of purity. an entropy of 1 or more (depending on the number of classes) suggests high level of disorder (low level of purity). for binomial classification,(0,1), entropy of 1 means the data is half 0 and half 1. information gain, is simply the expected reduction in entropy caused by partitioning the examples according to this attribute. It is commonly used in the construction of decision trees from a training dataset, by evaluating the information gain for each variable, and selecting the variable that maximizes the information gain, which in turn minimizes the entropy and best splits the dataset into groups for effective classification. The target variable, in this case, is 1. The formula for calculating the entropy is: Putting p=5 and n=8, we get Entropy = A = -(5/8 log(5/8) + 3/8 log(3/8))
For the given points, how will you calculate the Euclidean distance in Python? plot1 = [1,3] plot2 = [2,5]
euclidean_distance = sqrt( (plot1[0]-plot2[0])**2 + (plot1[1]-plot2[1])**2 )
What are the feature selection methods used to select the right variables?
feature selection is the process of selecting the subset of features to be used for training a machine learning model. 2.Filter feature selection methods * To keep the top 2 features with the strongest Pearson correlation with the target, we can run: from sklearn.feature_selection import r_regression, SelectKBest X_selection = SelectKBest(r_regression, k=2).fit_transform(X, y) Keep top 30% from sklearn.feature_selection import r_regression, SelectPercentile X_selection = SelectPercentile(r_regression, percentile=30).fit_transform(X, y) Spearman's Rho, Kendall Tau, and point-biserial correlation from scipy import stats rho_corr = [stats.spearmanr(X[:, f], y).correlation for f in range(X.shape[1])] tau_corr = [stats.kendalltau(X[:, f], y).correlation for f in range(X.shape[1])] pbs_corr = [stats.pointbiserialr(X[:, f], y).correlation for f in range(X.shape[1])] Chi-Squared, Mutual Information, and ANOVA F-score are all in scikit-learn. from sklearn.feature_selection import chi2 from sklearn.feature_selection import mutual_info_regression from sklearn.feature_selection import mutual_info_classif from sklearn.feature_selection import f_classif chi2_corr = chi2(X, y)[0] f_corr = f_classif(X, y)[0] mi_reg_corr = mutual_info_regression(X, y) mi_class_corr = mutual_info_classif(X, y) Cramer's V can be obtained from a recent scipy from scipy.stats.contingency import association v_corr = [association(np.hstack([X[:, f].reshape(-1, 1), y.reshape(-1, 1)]), method="cramer") for f in range(X.shape[1])] Embedded feature selection methods
Information Gain
measures the change in entropy due to any amount of new information being added. Gain=Entropy of parent−Entropy of children. https://www.section.io/engineering-education/entropy-information-gain-machine-learning/
K nearest neighbor
one of the simplest supervised machine learning algorithms used for classification. It classifies a data point based on its neighbors' classifications. It stores all available cases and classifies new cases based on similar features. The KNN algorithm is used in the following scenarios: Data is labeled Data is noise-free Dataset is small, as KNN is a lazy learner Pros and Cons of Using KNN Pros: 1. Since the KNN algorithm requires no training before making predictions, new data can be added seamlessly, which will not impact the accuracy of the algorithm. 2. KNN is very easy to implement. There are only two parameters required to implement KNN—the value of K and the distance function (e.g. Euclidean, Manhattan, etc.) Cons: 1 The KNN algorithm does not work well with large datasets. The cost of calculating the distance between the new point and each existing point is huge, which degrades performance. 2. Feature scaling (standardization and normalization) is required before applying the KNN algorithm to any dataset. Otherwise, KNN may generate wrong predictions.
What does NLP stand for?
short for Natural Language Processing. It deals with the study of how computers learn a massive amount of textual data through programming. A few popular examples of NLP are Stemming, Sentimental Analysis, Tokenization, removal of stop words, etc.
How can time-series data be declared as stationery?
stationary when the variance and mean of the series are constant with time. In the first graph, the variance is constant with time. Here, X is the time factor and Y is the variable. The value of Y goes through the same points all the time; in other words, it is stationary. In the second graph, the waves get bigger, which means it is non-stationary and the variance is changing with time.
What is entropy of the target variable in decision tree?
what happens when all observations belong to the same class? In such a case, the entropy will always be zero. E=−(1*log1)=0 if we have a dataset with say, two classes, half made up of yellow and the other half being purple, the entropy will be one. E=−((0.5log0.5)+(0.5log0.5))=1
Violation of Classic linear regression assumptions - Multicollinearity:
when the independent variables are found to be moderately or highly correlated. In a model with correlated variables, it becomes a tough task to figure out the true relationship of a predictors with response variable. In other words, it becomes difficult to find out which variable is actually contributing to predict the response variable. Another point, with presence of correlated predictors, the standard errors tend to increase. And, with large standard errors, the confidence interval becomes wider leading to less precise estimates of slope parameters. Also, when predictors are correlated, the estimated regression coefficient of a correlated variable depends on which other predictors are available in the model. If this happens, you'll end up with an incorrect conclusion that a variable strongly / weakly affects target variable. Since, even if you drop one correlated variable from the model, its estimated regression coefficients would change. That's not good! How to check: You can use scatter plot to visualize correlation effect among variables. Also, you can also use VIF factor. VIF value <= 4 suggests no multicollinearity whereas a value of >= 10 implies serious multicollinearity. Above all, a correlation table should also solve the purpose.