Z-Data Science Essentials - copied
Explain what resampling methods are and why they are useful. Also explain their limitations.
Classical statistical parametric tests compare observed statistics to theoretical sampling distributions. Resampling a data-driven, not theory-driven methodology which is based upon repeated sampling within the same sample. Resampling refers to methods for doing one of these Estimating the precision of sample statistics (medians, variances, percentiles) by using subsets of available data (jackknifing) or drawing randomly with replacement from a set of data points (bootstrapping) Exchanging labels on data points when performing significance tests (permutation tests, also called exact tests, randomization tests, or re-randomization tests) Validating models by using random subsets (bootstrapping, cross validation)
What is Bias and Variance Tradeoff?
Conceptual Definition Error due to Bias: The error due to bias is taken as the difference between the expected (or average) prediction of our model and the correct value which we are trying to predict. Of course you only have one model so talking about expected or average prediction values might seem a little strange. However, imagine you could repeat the whole model building process more than once: each time you gather new data and run a new analysis creating a new model. Due to randomness in the underlying data sets, the resulting models will have a range of predictions. Bias measures how far off in general these models' predictions are from the correct value. Error due to Variance: The error due to variance is taken as the variability of a model prediction for a given data point. Again, imagine you can repeat the entire model building process multiple times. The variance is how much the predictions for a given point vary between different realizations of the model.
Explain the difference between test sensitivity and test specificity in the context of a medical diagnosis.
In medical diagnosis, test sensitivity is the ability of a test to correctly identify those with the disease (true positive rate), whereas test specificity is the ability of the test to correctly identify those without the disease (true negative rate).
What is the difference between "long" ("tall") and "wide" format data?
In most data mining / data science applications there are many more records (rows) than features (columns) - such data is sometimes called "tall" (or "long") data. In some applications like genomics or bioinformatics you may have only a small number of records (patients), eg 100, but perhaps 20,000 observations for each patient. The standard methods that work for "tall" data will lead to overfitting the data, so special approaches are needed. The problem is not just reshaping the data, but avoiding false positives by reducing the number of features to find most relevant ones.
Explain type I and type II errors.
In statistics there are type I errors and type II errors. Relative to true positive and false positive terminology, a type I error occurs when you reject the null hypothesis (as false) when it is actually true, which by convention corresponds to a false positive. A type II error occurs when you accept the null hypothesis (as true) when it is actually false, which by convention corresponds to a false negative.
What is cardinality in database?
In the context of databases, cardinality refers to the uniqueness of data values contained in a column. High cardinality means that the column contains a large percentage of totally unique values. For example, high-cardinality column values are typically identification numbers, email addresses, or user names. Low cardinality means that the column contains a lot of "repeats" in its data range.
How would you screen for outliers and what should you do if you find one?
Inter Quartile Range An outlier is a point of data that lies over 1.5 IQRs below the first quartile (Q1) or above third quartile (Q3) in a given data set. High = (Q3) + 1.5 IQR Low = (Q1) - 1.5 IQR When you find outliers, you should not remove it without a qualitative assessment because that way you are altering the data and making it no longer pure. It is important to understand the context of analysis or importantly "The Why question - Why an outlier is different from other data points?" This reason is critical. If outliers are attributed to error, you may throw it out but if they signify a new trend, pattern or reveal a valuable insight into the data you should retain it.
Is it better to have too many false positives, or too many false negatives? Explain.
It depends on the question as well as on the domain for which we are trying to solve the question. In medical testing, false negatives may provide a falsely reassuring message to patients and physicians that disease is absent, when it is actually present. This sometimes leads to inappropriate or inadequate treatment of both the patient and their disease. So, it is desired to have too many false positive. For spam filtering, a false positive occurs when spam filtering or spam blocking techniques wrongly classify a legitimate email message as spam and, as a result, interferes with its delivery. While most anti-spam tactics can block or filter a high percentage of unwanted emails, doing so without creating significant false-positive results is a much more demanding task. So, we prefer too many false negatives over many false positives.
What is Multiple Regression Analysis?
Multiple regression is an extension of simple linear regression. It is used when we want to predict the value of a variable based on the value of two or more other variables. The variable we want to predict is called the dependent variable (or sometimes, the outcome, target or criterion variable). The variables we are using to predict the value of the dependent variable are called the independent variables (or sometimes, the predictor, explanatory or regressor variables). For example, you could use multiple regression to understand whether exam performance can be predicted based on revision time, test anxiety, lecture attendance and gender. Alternately, you could use multiple regression to understand whether daily cigarette consumption can be predicted based on smoking duration, age when started smoking, smoker type, income and gender. Multiple regression also allows you to determine the overall fit (variance explained) of the model and the relative contribution of each of the predictors to the total variance explained. For example, you might want to know how much of the variation in exam performance can be explained by revision time, test anxiety, lecture attendance and gender "as a whole", but also the "relative contribution" of each independent variable in explaining the variance.
You created a predictive model for a quantitative outcome variable using multiple regression. How would you validate this model?
Proposed methods for model validation: 1) If the values predicted by the model are far outside of the response variable range, this would immediately indicate poor estimation or model inaccuracy. 2) If the values seem to be reasonable, examine the parameters; any of the following would indicate poor estimation or multi-collinearity: opposite signs of expectations, unusually large or small values, or observed inconsistency when the model is fed new data. 3) Use the model for prediction by feeding it new data, and use the coefficient of determination (R squared) as a model validity measure. 4) Use data splitting to form a separate dataset for estimating model parameters, and another for validating predictions. 5) Use jackknife resampling if the dataset contains a small number of instances, and measure validity with R squared and mean squared error (MSE).
Explain what precision and recall are. How do they relate to the ROC curve?
RECALL (aka SENSITIVITY; aka TRUE POSITIVE RATE) = True positives / (True positives + False negatives) PRECISION = True positives / (True positives + False positives) FALSE POSITIVE RATE (FPR) = False Positives / (False Positives + True Negatives) SPECIFICITY = True negatives/(true negatives + false positives) = 1 - FPR ROC curve represents a relation between sensitivity (RECALL) and specificity (NOT PRECISION) and is commonly used to measure the performance of binary classifiers. However, when dealing with highly skewed datasets, Precision-Recall (PR) curves give a more representative picture of performance.
What is the difference between Receiver Operating Characteristic (ROC) and Precision-Recall (PR) curves in the context of Machine Learning?
Receiver Operator Characteristic (ROC) curves are commonly used to present results for binary decision problems in machine learning. However, when dealing with highly skewed datasets, Precision-Recall (PR) curves give a more informative picture of an algorithm's performance.
Explain what regularization is and how it works.
Regularization is the process of adding a tuning parameter to a model to induce smoothness in order to prevent overfitting. This is most often done by adding a constant multiple to an existing weight vector. This constant is often either the L1 (Lasso) or L2 (ridge), but can in actuality can be any norm. The model predictions should then minimize the mean of the loss function calculated on the regularized training set.
What is root cause analysis?
Root cause analysis (RCA) is a method of problem solving used for identifying the root causes of faults or problems. A factor is considered a root cause if removal thereof from the problem-fault-sequence prevents the final undesirable event from recurring; whereas a causal factor is one that affects an event's outcome, but is not a root cause.
What is selection bias, why is it important and how can you avoid it?
Selection bias, in general, is a problematic situation in which error is introduced due to a non-random population sample. For example, if a given sample of 100 test cases was made up of a 60/20/15/5 split of 4 classes which actually occurred in relatively equal numbers in the population, then a given model may make the false assumption that probability could be the determining predictive factor. Avoiding non-random samples is the best way to deal with bias; however, when this is impractical, techniques such as resampling, boosting, and weighting are strategies which can be introduced to help deal with the situation.
Which of the following is not possible in a boosting algorithm? A) Increase in training error. B) Decrease in training error C) Increase in testing error D) Decrease in testing error E) Any of the above
Solution: A Boosted algorithms minimize error in previously predicted values by last estimator. So it always decreases training error.
Boosting is a general approach that can be applied to many statistical learning methods for regression or classification. A) True B) False
Solution: A Boosting is an ensemble technique and can be applied to various base algorithms.
The data scientists at "BigMart Inc" have collected 2013 sales data for 1559 products across 10 stores in different cities. Also, certain attributes of each product based on these attributes and store have been defined. The aim is to build a predictive model and find out the sales of each product at a particular store during a defined period. Which learning problem does this belong to? A) Supervised learning B) Unsupervised learning C) Reinforcement learning D) None
Solution: A Supervised learning is the machine learning task of inferring a function from labeled training data. Here historical sales data is our training data and it contains the labels / outcomes.
Which methodology does Decision Tree take to decide on first split? A) Greedy approach B) Look-ahead approach C) Brute force approach D) None of these
Solution: A The process of top-down induction of decision trees (TDIDT) is an example of a greedy algorithm, and it is by far the most common strategy for learning decision trees from data.
Decision Trees are not affected by multicollinearity in features: A) TRUE B) FALSE
Solution: A The statement is true. For example, if there are two 90% correlated features, decision tree would consider only one of them for splitting.
While creating a Decision Tree, can we reuse a feature to split a node? A) Yes B) No
Solution: A Yes, decision tree recursively uses all the features at each node.
There are 24 predictors in a dataset. You build 2 models on the dataset: 1. Bagged decision trees and 2. Random forest Let the number of predictors used at a single split in bagged decision tree is A and Random Forest is B. Which of the following statement is correct? A) A >= B B) A < B C) A >> B D) Cannot be said since different iterations use different numbers of predictors
Solution: A Random Forest uses a subset of predictors for model building, whereas bagged trees use all the features at once.
Boosting is said to be a good classifier because: A) It creates all ensemble members in parallel, so their diversity can be boosted. B) It attempts to minimize the margin distribution C) It attempts to maximize the margins on the training data D) None of these
Solution: B A. Trees are sequential in boosting. They are not parallel B. Boosting attempts to minimize residual error which reduces margin distribution C. As we saw in B, margins are minimized and not maximized. Therefore B is true
While tuning the parameters "Number of estimators" and "Shrinkage Parameter"/"Learning Rate" for boosting algorithm.Which of the following relationship should be kept in mind? A) Number of estimators is directly proportional to shrinkage parameter B) Number of estimators is inversely proportional to shrinkage parameter C) Both have polynomial relationship
Solution: B It is generally seen that smaller learning rates require more trees to be added to the model and vice versa. So when tuning parameters of boosting algorithm, there is a trade-off between learning rate and number of estimators.
To reduce under fitting of a Random Forest model, which of the following method can be used? A) Increase minimum sample leaf value B) increase depth of trees C) Increase the value of minimum samples to split D) None of these
Solution: B Only option B is correct, because A: increasing the number of samples for a leaf will reduce the depth of a tree, indirectly increasing underfitting B: Increasing depth will definitely decrease help reduce underfitting C: increasing the number of samples considered to split will have no effect, as the same information will be given to the model.
Let's say we have m numbers of estimators (trees) in a boosted tree. Now, how many intermediate trees will work on modified version (OR weighted) of data set? A) 1 B) m-1 C) m D) Can't say E) None of the above
Solution: B The first tree in boosted trees works on the original data, whereas all the rest work on modified version of the data.
Predictions of individual trees of bagged decision trees have higher correlation in comparison to individual trees of random forest. A) TRUE B) FALSE
Solution: B This is False because random Forest has more randomly generated uncorrelated trees than bagged decision trees. Random Forest considers only a subset of total features. So individual trees that are generated by random forest may have different feature subsets. This is not true for bagged trees.
Which splitting algorithm is better with categorical variable having high cardinality? A) Information Gain B) Gain Ratio C) Change in Variance D) None of these
Solution: B When high cardinality problems, gain ratio is preferred over any other splitting technique.
Which of the following statements is correct about XGBOOST parameters (may be more than one): A) Learning rate can go upto 10 B) Sub Sampling / Row Sampling percentage should lie between 0 to 1 C) Number of trees / estimators can be 1 D) Max depth can not be greater than 10
Solution: B and C A and D are wrong statements, whereas B and C are correct.
Let's say we have m number of estimators (trees) in a XGBOOST model. Now, how many trees will work on bootstrapped data set? A) 1 B) m-1 C) m D) Can't say E) None of the above
Solution: C All the trees in XGBoost will work on bootstrapped data. Therefore, option C is true.
For parameter tuning in a boosting algorithm, which of the following search strategies may give best tuned model: A) Random Search. B) Grid Search. C) A or B D) Can't say
Solution: C For a a given search space, Random search randomly picks out hyperparameters. In terms of time required, random search requires much less time to converge. Grid search deterministically tries to find optimum hyperparameters. This is a brute force approach for solving a problem, and requires much time to give output. Both random search or grid search may give best tuned model. It depends on how much time and resources can be allocated for search.
Generally, in terms of prediction performance from highest to lowest, which of the following arrangements are correct: A) Bagging>Boosting>Random Forest>Single Tree B) Boosting>Random Forest>Single Tree>Bagging C) Boosting>Random Forest>Bagging>Single Tree D) Boosting >Bagging>Random Forest>Single Tree
Solution: C Generally speaking, Boosting algorithms will perform better than bagging algorithms. In terms of bagging vs random forest, random forest works better in practice because random forest has less correlated trees compared to bagging. And it's always true that ensembles of algorithms are better than single models
Random forests (While solving a regression problem) have the higher variance of predicted result in comparison to Boosted Trees (Assumption: both Random Forest and Boosted Tree are fully optimized). A) True B) False C) Cannot be determined
Solution: C It completely depends on the data, the assumption cannot be made without data.
Which of the following is a mandatory data pre-processing step(s) for XGBOOST? A) Impute Missing Values B) Remove Outliers C) Convert data to numeric array / sparse matrix D) Input variable must have normal distribution E) Select the sample of records for each tree/ estimators
Solution: C XGBoost is doesn't require most of the pre-processing steps, so only converting data to numeric is required among of the above listed steps
Why do we prefer information gain over accuracy when splitting? A) Decision Tree is prone to overfit and accuracy doesn't help to generalize B) Information gain is more stable as compared to accuracy C) Information gain chooses more impactful features closer to root D) All of these
Solution: D All the above options are correct
Suppose we have missing values in our data. Which of the following method(s) can help us to deal with missing values while building a decision tree? A) Let it be. Decision Trees are not affected by missing values B) Fill dummy value in place of missing, such as -1 C) Impute missing value with mean/median D) All of these
Solution: D All the options are correct.
Below is a list of parameters of Decision Tree. In which of the following cases higher is better? A) Number of samples used for split B) Depth of tree C) Samples for leaf D) Can't Say
Solution: D For all three options A, B and C, it is not necessary that if you increase the value of parameter the performance may increase. For example, if we have a very high value of depth of tree, the resulting tree may overfit the data, and would not generalize well. On the other hand, if we have a very low value, the tree may underfit the data. So, we can't say for sure that "higher is better".
Which of the following tree based algorithm uses some parallel (full or partial) implementation? A) Random Forest B) Gradient Boosted Trees C) XGBOOST D) Both A and C E) A, B and C
Solution: D Only Random Forest and XGBoost have parallel implementations. Random Forest is very easy to parallelize, where as XGBoost can have partially parallel implementation. In Random Forest, all trees grows parallel and finally ensemble the output of each tree . Xgboost doesn't run multiple trees in parallel like Random Forest, you need predictions after each tree to update gradients. Rather it does the parallelization WITHIN a single tree to create branches independently.
In Random Forest, which of the following is randomly selected? A) Number of decision trees B) Features to be taken into account when building a tree C) Samples to be given to train individual tree in a forest D) B and C E) A, B and C
Solution: D Option A is False because the number of trees has to be decided when building a tree. It is not random. Options B and C are true
Which of the following are the disadvantage of Decision Tree algorithm? A) Decision tree is not easy to interpret B) Decision tree is not a very stable algorithm C) Decision Tree will over fit the data easily if it perfectly memorizes it D) Both B and C
Solution: D Option A is False, as decision tree are very easy to interpret Option B is True, as decision tree are high unstable models Option C is True, as decision tree also tries to memorize noise.
Assume everything else remains same, which of the following is the right statement about the predictions from decision tree in comparison with predictions from Random Forest? A) Lower Variance, Lower Bias B) Lower Variance, Higher Bias C) Higher Variance, Higher Bias D) Lower Bias, Higher Variance
Solution: D The predicted values in Decision Trees have low Bias but high Variance when compared to Random Forests. This is because random forest attempts to reduce variance by bootstrap aggregation.
Imagine a two variable predictor space having 10 data points. A decision tree is built over it with 5 leaf nodes. The number of distinct regions that will be formed in predictors space? A) 25 B) 10 C) 2 D) 5
Solution: D The predictor space will be divided into 5 regions. Therefore, option D is correct.
In which of the following application(s), a tree based algorithm can be applied successfully? A) Recognizing moving hand gestures in real time B) Predicting next move in a chess game C) Predicting sales values of a company based on their past sales D) A and B E) A, B, and C
Solution: E Option E is correct as we can apply tree based algorithm in all the 3 scenarios.
Give an example of how you would use experimental design to answer a question about user behavior.
Step 1: Formulate the Research Question: What are the effects of page load times on user satisfaction ratings? Step 2: Identify variables: We identify the cause & effect. Independent variable -page load time, Dependent variable- user satisfaction rating Step 3: Generate Hypothesis: Lower page download time will have more effect on the user satisfaction rating for a web page. Here the factor we analyze is page load time. Step 4: Determine Experimental Design. We consider experimental complexity i.e vary one factor at a time or multiple factors at one time in which case we use factorial design (2^k design). A design is also selected based on the type of objective (Comparative, Screening, Response surface) & number of factors. Here we also identify within-participants, between-participants, and mixed model.For e.g.: There are two versions of a page, one with Buy button (call to action) on left and the other version has this button on the right. Within-participants design - both user groups see both versions. Between-participants design - one group of users see version A & the other user group version B. Step 5: Develop experimental task & procedure: Detailed description of steps involved in the experiment, tools used to measure user behavior, goals and success metrics should be defined. Collect qualitative data about user engagement to allow statistical analysis. Step 6: Determine Manipulation & Measurements Manipulation: One level of factor will be controlled and the other will be manipulated. We also identify the behavioral measures: Latency- time between a prompt and occurrence of behavior (how long it takes for a user to click buy after being presented with products). Frequency- number of times a behavior occurs (number of times the user clicks on a given page within a time) Duration-length of time a specific behavior lasts(time taken to add all products) Intensity-force with which a behavior occurs ( how quickly the user purchased a product) Step 7: Analyze results Identify user behavior data and support the hypothesis or contradict according to the observations made for e.g. how majority of users satisfaction ratings compared with page load times.
What do you need to calculate statistical power?
Test Value (value to compare the sample average to), Sample Average (value measured from sample or expected from sample), Sample Size (size of sample or desired number of respondents), Standard Deviation for Sample, and Confidence Level (aka p-value or Alpha Error Level; Probability of incorrectly rejecting the null hypothesis that there is no difference in the average values). A p-value of 5% corresponds to a 95% Confidence Interval.
What is RMSE?
The root-mean-square error (RMSE) is a measure of the differences between values predicted by a model or an estimator and the values actually observed. RMSE is used to evaluate regression models. The lower the RMSE, the better the model. The formula is : rmse = (sqrt(sum(square(predicted_values - actual_values)) / number of observations))
What is a recommendation engine? How does it work?
They typically produce recommendations in one of two ways: using collaborative or content-based filtering. Collaborative filtering methods build a model based on users past behavior (items previously purchased, movies viewed and rated, etc) and use decisions made by current and other users. This model is then used to predict items (or ratings for items) that the user may be interested in. Content-based filtering methods use features of an item to recommend additional items with similar properties. These approaches are often combined in Hybrid Recommender Systems.
What is statistical power?
Wikipedia defines Statistical power or sensitivity of a binary hypothesis test is the probability that the test correctly rejects the null hypothesis (H0) when the alternative hypothesis (H1) is true. To put in another way, Statistical power is the likelihood that a study will detect an effect when the effect is present. The higher the statistical power, the less likely you are to make a Type II error (concluding there is no effect when, in fact, there is).
Explain what is overfitting (aka high variance and low bias) and how would you control for it.
Your model is overfitting your training data when you see that the model performs well on the training data but does not perform well on the evaluation data. This is because the model is memorizing the data it has seen and is unable to generalize to unseen examples. If your model is overfitting the training data, it makes sense to take actions that reduce model flexibility. To reduce model flexibility, try the following: Feature selection: consider using fewer feature combinations, decrease n-grams size, and decrease the number of numeric attribute bins. Increase the amount of regularization used.
Explain what is underfitting (aka High Bias) and how would you control for it.
Your model is underfitting the training data when the model performs poorly on the training data. This is because the model is unable to capture the relationship between the input examples (often called X) and the target values (often called Y). Poor performance on the training data could be because the model is too simple (the input features are not expressive enough) to describe the target well. Performance can be improved by increasing model flexibility. To increase model flexibility, try the following: Add new domain-specific features and more feature Cartesian products, and change the types of feature processing used (e.g., increasing n-grams size) Decrease the amount of regularization used.