Data Science Review Questions

Pataasin ang iyong marka sa homework at exams ngayon gamit ang Quizwiz!

Generative vs. discriminative models

Generative models model both joint and conditional density functions, while discriminative models only model the conditional probability function. Generative example: Naive Bayes, Hidden Markov Models Discriminative example: Linear/logistic regression. Discriminative models are generally better at tasks involving strictly prediction, but generative models tell you more about the underlying DGP and can also be used to impute missing values, generate simulated data, or compress the data.

Good data vs. good model

Good data is definitely more important than a good model, which is why organizations care so much about collecting the right type of data and maintaining its integrity. Good data can lead to a good model, but with bad data there is only so well your model can perform.

Visualization principles (Tufte)

Good visualization... 1. consists of complex ideas communicated with clarity, precision, and efficiency. 2. is that which gives to the viewer the greatest number of ideas in the shortest time with the least ink in the smallest space. 3. is nearly always multivariate. 4. requires telling the truth about the data. General tips: "Show the data" Encourage the eye to compare different pieces of data Simplify by maximizing the "data-ink ratio." Leverage color, shapes, facets to highlight multivariate data. Annotate your figures with context

How to handle data missingness

Hard to argue with an approach that does the following: 1. quantify the completeness of covariate data 2. present and discuss patterns of or reasons for missing data 3. provide details about your approach for handling missing data

Resampling methods

repeatedly drawing samples from a training set and refitting a model of interest on each sample in order to obtain additional information about the fitted model. example: repeatedly draw different samples from training data, fit a linear regression to each new sample, and then examine the extent to which the resulting fit differ. most common are: cross-validation and the bootstrap. cross-validation: random sampling with no replacement. bootstrap: random sampling with replacement. cross-validation: evaluating model performance, model selection (select the appropriate level of flexibility). bootstrap: mostly used to quantify the uncertainty associated with a given estimator or statistical learning method.

Correlation Tests

If both continuous variables are normally distributed, we can use the (parametric) Pearson correlation coefficient (r) to test the strength of the linear relationship between them. If one or both is not normally distributed, we instead use the (non-parametric) Spearman's rho to test the strength of the monotonic relationship between them. Monotonicity is defined as (1) as the value of one variable increases, so does the value of the other variable; or (2) as the value of one variable increases, the other variable value decreases. This is less strict than linear. Spearman's can be used for ordinal, interval or ratio variables.

Missing data

If data missing at random: deletion has no bias effect, but decreases the power of the analysis by decreasing the effective sample size. Recommended: KNN imputation, Gaussian mixture imputation, MICE (R package).

Inliers

Inlier: - Observation lying within the general distribution of other observed values - Doesn't perturb the results but are non-conforming and unusual - Simple example: observation recorded in the wrong unit (°F instead of °C) Identifying inliers: - Mahalanobis distance - Used to calculate the distance between two random vectors - Difference with Euclidean distance: accounts for correlations - Discard them

Cross-validation

Model validation technique for assessing how the results of a statistical analysis will generalize to an independent data set. Mainly used in settings where the goal is prediction and one wants to estimate how accurately a model will perform in practice. The goal of cross-validation is to define a data set to test the model in the training phase (i.e. validation data set) in order to limit problems like overfitting, and get an insight on how the model will generalize to an independent data set.

A/B testing

Two-sample hypothesis testing consisting of randomized experiments with two variants: A and B A: control; B: variation. User-experience design: identify changes to web pages that increase clicks on a banner. Current website: control; NULL hypothesis New version: variation; alternative hypothesis

Normal distribution

Z-score indicates how many standard deviations from the mean an observation is. Use the Shapiro-Wilks test to determine if a single continuous variable is normally distributed or not (null hypothesis is that it is).

Mixture model

A model used for representing the presence of subgroups within an overall population, without the requirement that to data identifies which sub-population an observation belongs to. May frequently make use of latent variables.

Random forest

- Underlying principle: several weak learners combined provide a strong learner - Builds several decision trees on bootstrapped training samples of data - On each tree, each time a split is considered, a random sample of mm predictors is chosen as split candidates, out of all p predictors - Rule of thumb: at each split m=√p - Predictions: at the majority rule Why is it good? - Very good performance (decorrelates the features) - Can model non-linear class boundaries - Generalization error for free: no cross-validation needed, gives an unbiased estimate of the generalization error as the trees is built - Generates variable importance

Bayesian vs. Frequentist statistics

Frequentists use the likelihood to approximate a sampling distribution. Bayesians modify the likelihood based on prior belief to create a posterior distribution.

Why stepwise regression is bad

- the choice of predictive variables are carried out using a systematic procedure - Usually, it takes the form of a sequence of F-tests, t-tests, adjusted R-squared, AIC, BIC - at any given step, the model is fit using unconstrained least squares - can get stuck in local optima - Better: Lasso

Generalized Linear Models

An extension of linear models where residuals are not assumed to be normally distributed. Utilizes some link function. Types include logistic regression (predict binary outcomes), multinomial logistic regression (predict the levels of a factor), and Poisson regression (predict count data where DGP assumed to follow a Poisson distribution.)

Linear regression assumptions

1) The data used in fitting the model is representative of the population. 2) The true underlying relation between x and y is linear. 3) Variance of the residuals is constant (homoscedastic, not heteroscedastic). 4) The residuals are independent. 5) The residuals are normally distributed. What we can do with each assumption: Predict y from x: 1) + 2) Estimate the standard error of predictors: 1) + 2) + 3) Get an unbiased estimation of y from x: 1) + 2) + 3) + 4) Make probability statements, hypothesis testing involving slope and correlation, confidence intervals: 1) + 2) + 3) + 4) + 5)

Decision tree

1.Take the entire data set as input. 2. Search for a variable and condition to split on that maximizes the "separation" or purity of the classes. A split is any test that divides the data in two (e.g. if variable2>10) 3. Apply the split to the input data (divide step) 4. Re-apply steps 1 to 2 to the divided data 5. Stop when you meet some stopping criteria (Optional) Clean up the tree when you went too far doing splits (called pruning, to avoid overfitting) Finding a split: methods vary, from greedy search (e.g. C4.5) to randomly selecting attributes and split points (random forests) Purity measure: information gain, Gini coefficient, Chi Squared values Stopping criteria: methods vary from minimum size, particular confidence in prediction, purity criteria threshold Pruning: reduced error pruning, out of bag error pruning (ensemble methods)

Bias-variance trade-off

A fundamental trade-off in models where bias is equivalent to underfitting (the model is overly concerned with its own erroneous assumptions) and variance is equivalent to overfitting (the model is fit too well to one set of data at the cost of generalization to new data it has not seen).

Logistic regression

A generalized linear model that fits an equation of form 1/(1 + e^-(beta_0 + beta_1x_1 + ... + beta_nx_n)). Used for predicting binary outcome variables, and extends to a multionomial framework for multilevel factor variables. Outputs a predicted probability of outputting a "positive" (or 1) in the response variable category, which can then be altered to 0 or 1 based on a specified cut-off level. Useful as a baseline task for binary classification as it is widely used and easily interpretable compared to other methods like KNN or random forest.

Law of large numbers

A theorem that describes the result of performing the same experiment a large number of times. Forms the basis of frequentist thinking. It says that the sample mean, the sample variance and the sample standard deviation converge to what they are trying to estimate. Example: roll a dice, expected value is 3.5. For a large number of experiments, the average converges to 3.5

Type 1/Type 2 error

A type 1 error is a failure to reject the null hypothesis when it is actually true, while a type 2 error is a failure to accept the alternative hypothesis when it is actually true.

Latent variable

A variable that is not directly observed, but inferred from other variables that are observed through a mathematical model. Many examples in psychology and economics including modeling personal traits from clinical trial observations and employee traits like happiness, business confidence, and morale.

T-test

Any statistical hypothesis test where the test statistic is assumed to follow a Student's t-distribution under the null hypothesis. One sample: determines whether the mean of a population is different from a value specified in the null hypothesis. Two sample: tests the null hypothesis that two population means are equal. Known as student's t-test when the variances of the two are assumed to be equal, and welch's t-test if they are not.

Ridge Regression vs. Lasso

Both introduce a regularization term that penalizes large regression coefficients in order to control overfitting, but ridge regression attempts to minimize the RMSE whereas lasso attempts to minimize the MAE. The consequence of this is that ridge regression will tend to shrink the large weights while hardly shrinking the smaller weights at all. In LASSO regression, the shrinkage will be directly proportionate to the importance of the feature in the model. Since λ is an arbitrarily selected constant, some feature weights can reach zero, meaning that these features will not be included in the model at all. And that is the built-in feature selection of LASSO regression. In other words, ridge regression will try to find a good model with as small features as possible while LASSO regression will try to find a model with as few features as possible.

Support Vector Machines

Classification or regression algorithm attempting to separate the data into groups using hyperplanes by locating a mi

Non-Gaussian distributions

Common non-gaussian distributions include the binomial, poisson, and geometric distributions (discrete), as well as exponential family distributions like beta, gamma, and dirichlet that are used often in Bayesian statistics.

Correlation

Correlation allows us to determine the strength of the linear relationship that exists between two variables, but we cannot make any causal inferences from it.

LOOCV

Cross-validation technique where model is trained on whole data set except single observation, then tested on that observation; repeated n times for number of rows in data. Not generally advised because we average the outputs of n fitted models, each of which is trained on an almost identical set of observations making the outputs are highly correlated. Since the variance of a mean of quantities increases when correlation of these quantities increase, the test error estimate from a LOOCV has higher variance than the one obtained with k-fold cross validation

F-test

Evaluates the null hypothesis that a model where all regression coefficients are equal to 0 works just as well as our model versus the alternative hypothesis that it performs significantly worse. If null rejected, indicates that R^2 value is reliable. Generally, when the number of data points is large, an F-statistic that is only a little bit larger than 1 is already sufficient to reject the null hypothesis. The reverse is true as if the number of data points is small, a large F-statistic is required to be able to ascertain that there may be a relationship between predictor and response variables.

Life cycle of a data science project

Data acquisition Acquiring data from both internal and external sources, including social media or web scraping. In a steady state, data extraction and routines should be in place, and new sources, once identified would be acquired following the established processes Data preparation Also called data wrangling: cleaning the data and shaping it into a suitable form for later analyses. Involves exploratory data analysis and feature extraction. Hypothesis & modelling Like in data mining but not with samples, with all the data instead. Applying machine learning techniques to all the data. A key sub-step: model selection. This involves preparing a training set for model candidates, and validation and test sets for comparing model performances, selecting the best performing model, gauging model accuracy and preventing overfitting Evaluation & interpretation Steps 2 to 4 are repeated a number of times as needed; as the understanding of data and business becomes clearer and results from initial models and hypotheses are evaluated, further tweaks are performed. These may sometimes include step5 and be performed in a pre-production. Deployment Operations Regular maintenance and operations. Includes performance tests to measure model performance, and can alert when performance goes beyond a certain acceptable threshold Optimization Can be triggered by failing performance, or due to the need to add new data sources and retraining the model or even to deploy new versions of an improved model Note: with increasing maturity and well-defined project goals, pre-defined performance can help evaluate feasibility of the data science project early enough in the data-science life cycle. This early comparison helps the team refine hypothesis, discard the project if non-viable, change approaches.

R^2 Value

Describes total amount of variation captured in model compared to the test data. Since it always increases when adding new variables, adjusted R-squared incorporates the model's degrees of freedom for a more accurate estimate. Can be useful to validate a regression model.

Confounding variables

Extraneous variable in a statistical model that correlates directly or inversely with both the dependent and the independent variable. A spurious relationship is a perceived relationship between an independent variable and a dependent variable that has been estimated incorrectly. The estimate fails to account for the confounding factor.

Cleaning data

First: detect anomalies and contradictions Common issues: Tidy data: (Hadley Wickam paper) column names are values, not names, e.g. <15-25, >26-45... multiple variables are stored in one column, e.g. m1534 (male of 15-34 years' old age) variables are stored in both rows and columns, e.g. tmax, tmin in the same column multiple types of observational units are stored in the same table. e.g, song dataset and rank dataset in the same table *a single observational unit is stored in multiple tables (can be combined) Data-Type constraints: values in a particular column must be of a particular type: integer, numeric, factor, boolean Range constraints: number or dates fall within a certain range. They have minimum/maximum permissible values Mandatory constraints: certain columns can't be empty Unique constraints: a field must be unique across a dataset: a same person must have a unique SS number Set-membership constraints: the values for a columns must come from a set of discrete values or codes: a gender must be female, male Regular expression patterns: for example, phone number may be required to have the pattern: (999)999-9999 Misspellings Missing values Outliers Cross-field validation: certain conditions that utilize multiple fields must hold. For instance, in laboratory medicine: the sum of the different white blood cell must equal to zero (they are all percentages). In hospital database, a patient's date or discharge can't be earlier than the admission date Clean the data using: Regular expressions: misspellings, regular expression patterns KNN-impute and other missing values imputing methods Coercing: data-type constraints Melting: tidy data issues Date/time parsing Removing observations

Types of data missingness

MCAR: Missing completely at random (no data related to missingness). MAR: Missing at random (observed data related to missingness) MNAR: Missing not at random (unobserved data related to missingness). Techniques: Stratify missingness by different factor levels of dataset to judge is associations exist. Little's MCAR test tests null hypothesis at data MCAR against alternative hypothesis that it is MAR. MNAR harder to detect systematically.

Missing data imputation

Mean imputation is a bad practice in general because it leads to an underestimate of the standard deviation and distorts relationships between variables by "pulling" estimates of the correlation toward zero. Other options: median imputation, KNN imputation, EM imputation, etc. NEVER impute the response variable!

Wilcoxon Rank Sum Test

Non-parametric equivalent of t-test to judge whether a statistically significant difference exists between the distributions of two parameters.

Kruskal-Wallis Test

Non-parametric extension of ANOVA that test whether a statistically significant difference between the distributions of a continuous variable among a >2 categorical variables exists. Like ANOVA, doesn't tell you which category the difference is in, just that it exists.

Cosine similarity

Numeric observations plotted in n-dimensional vector space and the angle between them is calculated; the cosine of the angle is calculated in a similarity range [0,1] where 0 means the vectors are orthogonal and 1 means the vectors are identical.

Outliers

Outliers: - An observation point that is distant from other observations - Can occur by chance in any distribution - Often, they indicate measurement error or a heavy-tailed distribution - Measurement error: discard them or use robust statistics - Heavy-tailed distribution: high skewness, can't use tools assuming a normal distribution - Two-sigma rules (normally distributed data): 1 in 22 observations will differ by twice the standard deviation from the mean - Three-sigma rules: 1 in 370 observations will differ by three times the standard deviation from the mean Identifying outliers: - No rigid mathematical method - Subjective exercise: be careful - Boxplots - QQ plots (sample quantiles Vs theoretical quantiles)

Parametric vs. non-parametric statistics

Parametric methods assumes some underlying distribution of the data (often Gaussian), whereas non-parametric methods use no such assumption. Some parametric methods are linear regression, t-test, and ANOVA, while non-parametric methods include k-nearest-neighbors and kernel density estimation. Semiparametric methods exist that use both parametric and nonparametric components. In general, parametric methods are more easily interpretable.

K-fold cross validation

Partition data into k folds, where model is trained on k-1 folds then tested on the last fold. Will usually give empirically better results than LOOCV for k=5 or k=10.

Metrics in classification

Recall/sensitivity (true positive rate - TP/TP + FN). Specificity (true negative rate - TN/TN + FN). Precision (Positive predictive value - TP/TP + FP). Accuracy(TP + TN/TP + TN + FP + FN)/misclassification rate. ROC/AUC. Optimal metric will depend on domain.

Curse of dimensionality

Refers to various phenomena that arise when analyzing and organizing data in high dimensional spaces. Common theme: when number of dimensions increases, the volume of the space increases so fast that the available data becomes sparse. Issue with any method that requires statistical significance: the amount of data needed to support the result grows exponentially with the dimensionality Issue when algorithms don't scale well on high dimensions typically when O(n^kn). Everything becomes far and difficult to organize

Confidence intervals

Represent the uncertainty surrounding point estimates of parameter; instinctively interpreted as the percentage of experiments which would contain the obtained value if the experiment were repeated infinitely many times. If the confidence interval of a model parameter contains 0, we can reject the significance of that parameter given that it does not have a demonstrably positive or negative value, but rather could be either depending on the particular experiment.

Robust vs. accurate models

Robust models are optimized for different types of data, perhaps at the cost of accuracy. General rule of thumb: Occam's razor to avoid overfitting: simpler models are preferred if more complex models do not significantly improve the quality of the description for the observations. Ensemble learning helps bias/variance trade-off.

Robustness

Robustness: - Statistics with good performance even if the underlying distribution is not normal - Statistics that are not affected by outliers - A learning algorithm that can reduce the chance of fitting noise is called robust - Median is a robust measure of central tendency, while mean is not - Median absolute deviation is also more robust than the standard deviation

Metrics in regression

Root mean square error (RMSE), mean absolute error (MAE), root mean squared logarithmic error (RMSLE). RMSE useful when large errors are particularly undesirable, while MAE more robust to outliers. RMSLE penalizes under-prediction higher than over-prediction, and also to penalize differences less when both estimates are huge numbers. Optimal metric will depend on domain.

Statistical power

The probability that a false null hypothesis will be rejected given that the alternative hypothesis is true. As power increases, chance of a type 2 error decreases. Used in experimental design to calculate the smallest sample size needed to successfully prove the alternative hypothesis in a given study. Can also compare the results of tests (i.e. parametric vs. non-parametric test of same hypothesis)

Biases and how to control for them

Selection bias: - An online survey about computer use is likely to attract people more interested in technology than in typical Under coverage bias: - Sample too few observations from a segment of population Survivorship bias: - Observations at the end of the study are a non-random set of those present at the beginning of the investigation - In finance and economics: the tendency for failed companies to be excluded from performance studies because they no longer exist Choose a representative sample, preferably by a random method. Choose an adequate size of sample. Identify all confounding factors if possible. Identify sources of bias and include them as additional predictors in statistical analyses. Use randomization: by randomly recruiting or assigning subjects in a study, all our experimental groups have an equal chance of being influenced by the same bias Notes: - Randomization: in randomized control trials, research participants are assigned by chance, rather than by choice to either the experimental group or the control group. - Random sampling: obtaining data that is representative of the population of interest

PCA

Statistical method that uses an orthogonal transformation to convert a set of observations of correlated variables into a set of values of linearly uncorrelated variables called principal components. If the variables are correlated, PCA can achieve dimension reduction. If not, PCA just orders them according to their variances.

Hypothesis testing

Statistical significance can be accessed using hypothesis testing: - Stating a null hypothesis which is usually the opposite of what we wish to test (classifiers A and B perform equivalently, Treatment A is equal of treatment B) - Then, we choose a suitable statistical test and statistics used to reject the null hypothesis - Also, we choose a critical region for the statistics to lie in that is extreme enough for the null hypothesis to be rejected (p-value) - We calculate the observed test statistics from the data and check whether it lies in the critical region. If so, we can reject the null hypothesis.

Types of Sampling

Stratified - population broken into groups based on specific explanatory variable of interest (e.g. gender) then samples are drawn from each group. Useful to judge the effect of an experiment on each group split by. Clustered - population broken into N groups, and a random selection of these groups is selected to test on. Useful when natural groupings are present in the data.

Supervised vs. unsupervised learning

Supervised learning: inferring a function from labeled training data Supervised learning: predictor measurements associated with a response measurement; we wish to fit a model that relates both for better understanding the relation between them (inference) or with the aim to accurately predicting the response for future observations (prediction) Supervised learning: support vector machines, neural networks, linear regression, logistic regression, extreme gradient boosting Supervised learning examples: predict the price of a house based on the are, size.; churn prediction; predict the relevance of search engine results. Unsupervised learning: inferring a function to describe hidden structure of unlabeled data Unsupervised learning: we lack a response variable that can supervise our analysis Unsupervised learning: clustering, principal component analysis, singular value decomposition; identify group of customers Unsupervised learning examples: find customer segments; image segmentation; classify US senators by their voting.

Central limit theorem

The CLT states that the arithmetic mean of a sufficiently large number of iterates of independent random variables will be approximately normally distributed regardless of the underlying distribution. i.e: the sampling distribution of the sample mean is normally distributed. - Used in hypothesis testing - Used for confidence intervals - Finite variance

Experimental design

The design of any task that aims to describe or explain the variation of information under conditions that are hypothesized to reflect the variation. In its simplest form, an experiment aims at predicting the outcome by changing the preconditions, the predictors. - Selection of the suitable predictors and outcomes - Delivery of the experiment under statistically optimal conditions - Randomization - Blocking: an experiment may be conducted with the same equipment to avoid any unwanted variations in the input - Replication: performing the same combination run more than once, in order to get an estimate for the amount of random error that could be part of the process - Interaction: when an experiment has 3 or more variables, the situation in which the interaction of two variables on a third is not additive Differences between observational and experimental data: - Observational data: measures the characteristics of a population by studying individuals in a sample, but doesn't attempt to manipulate or influence the variables of interest - Experimental data: applies a treatment to individuals and attempts to isolate the effects of the treatment on a response variable

Intercept in a regression model

The mean value of the response variable when all coefficients in the model are set to 0. Guarantees that the residuals have a zero mean and the least squares slopes estimates are unbiased.

Degrees of freedom

The number of observations in the dataset minus the number of variables used in the model. Represents the number of independent ways by which a dynamic system can move without violating any constraint imposed on it. In other words, the number of degrees of freedom can be defined as the minimum number of independent coordinates that can specify the position of the system completely.

Selection bias

Types: - Sampling bias: systematic error due to a non-random sample of a population causing some members to be less likely to be included than others - Time interval: a trial may terminated early at an extreme value (ethical reasons), but the extreme value is likely to be reached by the variable with the largest variance, even if all the variables have similar means - Data: "cherry picking", when specific subsets of the data are chosen to support a conclusion (citing examples of plane crashes as evidence of airline flight being unsafe, while the far more common example of flights that complete safely) - Studies: performing experiments and reporting only the most favorable results - Can lead to unaccurate or even erroneous conclusions - Statistical methods can generally not overcome it Why data handling make it worse? - Example: individuals who know or suspect that they are HIV positive are less likely to participate in HIV surveys - Missing data handling will increase this effect as it's based on most HIV negative -Prevalence estimates will be unaccurate

Z-test

Used to determine whether two population means are different given that the sample size is large and the population variances are known. The test statistics is assumed to have a normal distribution.

Regularization in linear models

Used to prevent overfitting and improve the generalization of the model. Decreases complexity of model by adding regulation term to generalized loss function. Examples: LASSO, ridge regression. Useful for tasks with many variables.

Chi-square test

Used to test independence between two categorical variables . If dealing with sparse cell counts (e.g. <5 in any expected count) use Fisher's Exact Test instead. Calculates the joint probability in each cell as independent, then tests whether the observed values significantly differ from this assumption to conclude whether they are independent or not. Specifically, calculates the expected joint probabilities in each cell from the marginals in the row/column under the assumption that they are independent. Then, a hypothesis test is conducted under the null hypothesis that these values match the data (i.e. the variables are independent) in order to prove whether the data are dependent or not.

ANOVA test

Used to test whether the mean is equal between a continuous variable and 2+ level categorical variable under the null hypothesis that both means are equal. Tells you that a difference exists, but not among which level of the variable it exists in. Also used in model validation to judge the way in which explanatory variable reduce uncertainty in the model (applies sequentially one-by-one, so the order in which you do the ANOVA test matters).

Validation of regression models

Validation using R^2- keep in mind R^2 always increases for more variables so use adjusted R^2. F test helps indicate that the R^2 value is reliable. Cross-validation. Residual analysis: - Heteroskedasticity (relation between the variance of the model errors and the size of an independent variable's observations) - Scatter plots residuals Vs predictors - Normality of errors

False Positive/True Negative

When false positives are more important than false negatives: - In a non-contagious disease, where treatment delay doesn't have any long-term consequences but the treatment itself is grueling - HIV test: psychological impact When false negatives are more important than false positives: - If early treatment is important for good outcomes - In quality control: a defective item passes through the cracks! - Software testing: a test to catch a virus has failed

Collinearity

When two or more variables are highly correlated, thus redundant, in a multiple regression model, leading to potential overfitting. To remove, drop/combine affected variables or use PCA/ridge regression.


Kaugnay na mga set ng pag-aaral

Chapter 5 Eigenvectors and Eigenvalues

View Set

General Psychology Module 22 Critical Thinking

View Set

Part 3: Paragraph-length analyses based on identified passages (70 points)

View Set

Othello Quotes (Act 2) Kim/Kriger

View Set

PASSPOINT: Psychosocial Integrity

View Set

IMPORTANT* 34Qw/exp Chronic Kidney Disease-critical care-IV semester

View Set