Statistics
What does the Cochran-Mantel-Haenszel statistic measure?
relationship between two variables after accounting for a third variable. Attempt to detect confounding
What is the central limit theorem?
sample means taken from any population follow an approximate Normal distribution fi sample size is large enough (n>=50)
In Python how do you perform: ttest regression (not sklearn) anova
scipy.stats.ttest_1samp() - population mean equal to a value scipy.stats.ttest_ind() - testing across populations statsmodels.formula.api import ols - gives sas-like table statsmodels.stasts.anova.anova_lm() or scipy.stats.f_oneway
What is the test statistic for the Mean?
t =(sample mean - null hypothesis) / (STD/SQRT(n))
The F-test can tell you if one of the means is significantly different, how can you tell which mean is the different one?
t-tests for all comparisons - lsmeans in proc glm
If you have 20 obs in your ANOVA and you calculate residuals, what do they sum to?
0
What are the bounds to the Odds Ratio?
0 to inf, 1 is no assocation 0 - 1: Group B has higher odds of event 1 - inf: Group A has higher odds of event.
Explain Variance Inflation Factor
1 /(1 - Rj^2), a VIF > 10 is very problematic
What are some characteristics of the F-distribution?
1. Bounded below by zero 2. Right skewed 3. Two sets of degrees of freedom
What methods can you use to test for multicollinearity between variables?
1. Correlation 2. Variance Inflation Factor (VIF)
How do you fix multicollinearity?
1. Drop one of the problem variables 2. Biased Regression ( PCA regression or Ridge Regression) 3. If polynomial model, standardize variables first by subtracting the mean
If you do not have a linear problem, what are two SAS procs to fit a model
1. Fit a non-linear regression model - proc NLIN 2. Fit a non-parametric regression model - proc LOESS
What are the assumptions of Logistic Regression?
1. Independence of Observations (not errors) 2. The logit function is the correct transformation
What are the assumptions and hypothesis of ANOVA?
1. Independent Errors 2. Normal Errors 3. Error Variance equal for all groups Ho: All means Equal Ha: At least one is different
What are the assumptions of a t-test?
1. Independent Errors 2. Normally distributed 3. Equal Variance
What is affected if each of the assumptions of Multiple Linear Regression are broken?
1. Linear Model - mispecified model - results aren't meaningful 2. Normal Distribution of Errors, Constant Variance, Independence of Errors - does not effect B's but standard Errors are inflated 3. No Perfect Multicollinearity - OLS breaks.
What are the assumptions of Linear Regression?
1. Linearity - the means of y for all values of x can be connected with a linear relationship 2. Normality of errors - the distribution of errors is normal 3. Equal Variance of Errors - errors have equal spread regardless of the values fo x (Homoscedasticity of Errors) 4. Independence of Errors - errors in model are independent of each other (randomly occuring) 5. No perfect multicollinearity - one variable cannot be a linear combination of other variables
What are methods of detecting heteroscadasticity?
1. Plot Residuals and look for a pattern 2. Formal Tests - White General Test and Breusch-Pagan Tes (preferred b/c it identifies specific variables) 3. Spearman Rank Correlation - rank correlation coefficient should be close to zero - should be no correlation b/w rank of abs residual and predicted values
What are methods to fix heteroscadasticity?
1. Robust Statistics - White Standard Errors - HCCME option in proc model 2. Weighted Least Squares - divide by variable 2. Feasible Generalize Least Squares - use function of the variable instead of the variable itself as in WLS.
What are 4 tests to identify outliers and influential observations and describe what they test?
1. Standardized/Studentized Residuals - identify obs with residuals with large standard deviations. The drawback is that the outliers are skewing the STD. 2. Cook's D- Cook's Distance: measures the influence of one observation on all estimated coefficients simultaneously (n-calculations) 3. DFFITS - Difference in Fit: measure the difference in the y-predictions without the specific observation (n-calculations 4. DFBetas - Difference in Betas: measure difference of all coefficients INDIVIDUALLY with and w/o the observation (n*k calculations)
What are the 5 main steps of Hypothesis Testing?
1. State the hypothesis 2. Test Statistic 3. P-Value (assuming H0) 4. Decision Rule (How rare is rare? usually not p=0.05) 5. Conclusion - state in words a description of the results.
What are 4 characteristics of the least squares regression line?
1. Sum of residuals equals zero (mean of residuals is zero) 2. Sum of squared residuals is minimized 3. Regression line passes through (xbar, ybar) 4. Coefficient estimates of Bo and B1 are unbiased
What does a 95% confidence interval mean?
A 95% confidence interval represents a range of values within which you are 95% certain that the true population mean exists. One interpretation is that if you drew 100 different samples from the population and 100 intervals were calculated, approximately 95 of them would contain the population mean.
What is a probability distribution?
A collection of ordered data values with how often each data value occurs with respect to all the others. Describes the distribution of probabilities for all possible outcomes of the random variable of interest.
What is correlation? (Pearson's)
A relationship exists if an increase in X corresponds to an increase (or decrease) in Y in a linear way. Assumes a straight line relationship and quantitative values
What is Analysis of Variance?
A test of equality of means. Is one group different from another.
What is an interaction?
An interaction exists if the dependent variable has a different relationship with an independent variable at different levels of another variable.
In logistic regression, how do we test for linearity of the X's relative to the logit?
Box-Tidwell transformation is a power transformation on the X's. Tests assumption that continuous variable and logit of the response are linear. If not it provides the method to transform the variable to make linear. logit(p) = Bo + B1X1 ^ alpha1 + ... + BkXk ^ alphak
In categorical analysis what statistic is used to detect interactions? Or that Odds Ratios are equal?
Breslow-Day statistic: Ho ORa = ORb Zelen's Exact Test is used for exact p-value
Are linear models always straight?
Can have non-linear function of independent variables. Linear model with linear effects: B0 + B1X1 + B2x2 Linear Model with Non-linear effects: B0 + B1x1 + B2x2^2
What types of analysis are used to analyze categorical response variables?
Categorical Predictor: Categorical Response - Tests of Association (ChiSQuare) Continuos or Cat/Cont Predictior: Cat. Response - Logistic Regression (Or machine learning)
What 4 measures describe a distribution?
Center (mean, median, percentiles) Spread (Std, variance, range, IQR) Shape (skewness, Kurtosis) Outliers or Anomalous Observations
What is a classification table and how is it different than a frequency table?
Classification Tables are created from predicted and actual 1's and 0's and are used to create ROC curves to select model cutoff values - generates accuracy statistics based on model cutoff. Frequency Tables are tables of frequencies that are used to identify associations between variables. - generates significance of association or purity of nodes in decision trees.
How do you add categorical variables to a regression model?
Create dummy variables that encode in binary form the variable. Example: Gender: one dummy variable 1=F, 0=M Months: 11 dummies,
What is the most common method used for modeling ordinal logistic regression?
Cumulative Logit. The intercept changes fro mone logit to the next, slope stays constant. Use proportional odds to test if slopes are equal (Ho: equal slopes)
In model cutoff selection, what does the K-S statistic determine?
Determines if there is a difference between two cumulative distribution functions.
Two types of random variables?
Discrete: random variable that can take a countable number of values (histogram with levels or buckets) - Binomial and Poisson Distributions Continuous: random variable that can take an uncountable number of values - Normal, ChiSquare, Exponential Distributions
What test is used to test for correlation between residuals?
Dubin-Watsin Test - time series independent test Ho: independence (no residual correlation) Ha: residual correlation)
What are the two coding methods and what is different?
Dummy Coding: The category left out is called the reference category and is all zeros - shift from reference variable Effects Coding: Avarage difference between category j and the overall average across all categories holding all else constant - reference category has -1 across the board.
What is the problem with making many comparisons and how is it resolved?
Experiment-wise error rate goes way up. EER = 1 - (1 - alpha)^#of comparisons Use Tukey-Kramer Test to compare to all pairs Use Dunnett to compare to a control level.
How is the F statistic Calculated?
F = MS Model/MS Error
What is the hypothesis test for Multiple Regression?
F-Test Ho: B1 = B2 = ... = Bk = 0 Ha: at least one B is not = 0
There is no target for Logistic Regression, what does it do instead?
Finds the probabilities that maximize the likelihood of observing the data we have.
What test is used to test the equality of variance between groups?
Folded F test: F = max(Variance_Large)/max(Variance_Small) if F < 0.05 use Satterthwaite, if > 0.05 use pooled variance.
What is an association between categorical variables?
Frequency tables are used to identify associations. An association exists if the distribution of one of the variables changes when the level of the other variable changes.
In Logistic Regression what is Generalized R^2?
GR^2 = 1 - (Lo/L1) ^ 2/n
What is Hypothesis Testing?
HT uses evidence from data to test some claim about the population parameter. Relaxes the assumption that we know the population parameters as we did for z-values.
What visual methods are used to see distribution of data?
Histograms, Normal Probability Plots, Box Plots
scipy.stats.kurtosis scipy.stats.skew alternatively using a dataframe df, df.skew() will give skewness of each column
How is skewness and kurtosis checked in Python?
What does an odds ratio indicate?
How much more likely an event occurs in one group relative to its occurrence in another group Odds are a ratio of probabilities (Prob Yes/Prob No) Odds Ratio is a ratio of odds between groups (Odds Yes in A)/(Odds Yes in B)
Describe Likelihood Ratio Chi-Square test.
Instead of differences in expected vs observed it uses the ratio
What are some tests of Normality and how can lack of normality be overcome?
Komogorov-Smirnov (KS) - General Test Anderson-Darling - Test specific to the Normal Distribution 1. Transform Target Variable (Box-Cox Transformation) 2. Robust Regression - use if outliers are the problem
What is Margin of Error?
Largest possible sampling error at the specified level of confidence
What is used to test signficance of model inference for Logistic Regression?
Likelihood Ratio Test. Likelihood null hypothesis that all betas equal zero. Compare zero to peak, the further the points are the more likely one of the betas is different than zero. LRT = -2(LogLo - LogL1) - follows Chisquare distribution.
Describe what is meant by a linear model?
Linear models only involve addition, subtraction between variables.
For Regression, what can be determined from Residual Analysis?
Linearity, Normality, Equal Variance and Independence.
What is used to test an ordinal association?
Mantel-Haenszel Chi-Square Test Ho: No LINEAR Association b/w Variables Ha: LINEAR association b/w Variables Qmh = (n-1)r^2
What is adjusted R2?
Mathematically R2 will increase with each additional variable. This is because SSE does not go up with additional variables but degrees of freedom goes down. Adjusted R2 accounts for this Adjusted R2 = 1 - (1 - R2)[(n-1)/(n-k-1)]
What is the median? What is the median in range of 1-6?
Median is the center value that divides the numerically ordered data collection in two haves. Median of: 1,,2,3,4,5 is 3 1,2,3,4,5,6: Median is 3.5 (average two middle values)
What is Cp (Mallows and Hocking) used for?
Model Selection
Tests of association tell you if an association exists but not how strong the association is. What is used to measure the strength of association?
Nominal - Cramer's V Statistic (2x2, -1 to 1. nxn, 0 to 1) Ordinal: 2x2 - Odds Ratio or Spearman Correlation nxn - Spearman Corrlation
Why is Normality of errors important and how is it checked?
Normality is important for testing variables, it underlies t-tests and f-tests , but not for estimating coefficients. If not met the significance tests and CIs are not valid. Can be checked with Histogram of Residuals or Normal Probability plot looking for departure due to skewness or kurtosis
Why can't you use OLS to predict a binary outcome?
OLS selects a model based on minimization of squared errors. The OLS process does not translate to prediction of a binary outcome b/c there are no errors to predict.
How do you remove variables with high VIF?
One at a time. Multicollinearity hides significance so you can not remove all at once.
What's the difference between a one-sided and two-sided test?
One-sided is hypothesis test which the entire focus (rejection region) is located on one side of the distribution of possibilities Two-sided is located on both sides of the distribution of possibilities.
What is OLS?
Ordinary Least Squares Regression - attempts to identify a linear relationship continuous variables through minimization of the sum of squared errors.
What is the difference between an outlier and an influential observation?
Outlier: obs with a large residual (far from regression line in the vertical direction Influential: obs that drastically affect the regression line. An obs can be an outlier but not influential, influential but not an outlier or be both.
In Python, how would you go about printing histograms, density plots , box plots and correlation scatter plots?
Pandas dataframe df with 9 variables: Histograms df.hist() plt.show() Density Plots df.plot(kind='density', subplots=True, layout=(3,3), sharex =False) plt.show() Box Plots df.plot(kind='box', subplots=True, layout=(3,3), sharex =False, sharey=Flse) plt.show() Correlation from pandas.tools.plotting import scatter_matrix scatter_matrix(df) plt.show()
What is the equation of a confidence interval?
Point Estimate +/- Critical Value * Standard Error
What is the relationship between the population, sample, parameters and statistics?
Population is set of all individuals of interest Sample is a subset of the population Parameter is a descriptive measure of the population. Statistic is a descriptive measure of the sample
Why is the logit used?
Probabilities are bounded between on and zero. the logit creates a linear, non-bounded relationship between the parameters and the logit
What is the equation for the odds ratio from probabilities and from the parameters of the log(odds)?
Probability of an Event: Odds = P_event/(1-P_event) Log(Odds) Odds(larger) = e^(Bo + B1(Age + 1) = e^L Odds(smaller) = e^(Bo + B1(Age) = e^s Odds Ration = e^L/e^s = simplifies to e^B1
What are some SAS procedure to model with non-constant variance?
Proc GENMOD and GLIMMIX
How do you use Robust Regression in Python and SAS?
Python: sklearn.linear_model.RANSACRegressor() SAS: proc ROBUSTREG
What is the difference between a Quantitative and Categorical Variable?
Quantitative data is numeric AND arithmetic makes intuitive sense. (zipcodes are not Quantitative Data). Qualitative Data is data where the scale is defined by categories. Nominal - categories w/o logical order (colors) Ordinal - categories w/ logical order (small,medium, large)
What is the coefficient of determination?
R2 = SSM/SST = 1 - SSE/SST. proportion of variance accounted for by the model.
Describe the confusion matrix and ROC Chart. What is Sensitivity, Specificity, Precision, Recall, True Positive Rate, False Positive Rate, True Negative Rate?
ROC Chart: 2D Graph both axes bounded b/w 0 and 1. Y-axis = Specificity (True Positive Rate), X-axis = 1-Specificiy (False Positive Rate). Want line to push into Top/Left Corner. Confusion matrix: Sensitivity/Recall/True Positive Rate: Proportion of Positives correctly Predicted Specificity/True Negative Rate: Proportionof Negatives correctly Predicted Precision: Proportion of predicted Positive values that were actually positive False Positive Rate: 1 - Specificity (x-axis)
What is RANSAC?
Random Sample Consensus (RANSAC) is used to estimate parameters for data containing outliers. Uses a random sampling method of data along with a voting system to find the optimal fit of the data. Based on two assumptions: 1. that outliers will not vote consistently 2. there are enough features to agree on a good model
What are some measures of Spread?
Range IQR Variance Standard Deviation
What is range and IQR?
Range is one number - difference between maximum and minimum values in the dataset. Interquartile Range is the difference between the first and third quartiles: IQR = Q3-Q1 (middle 50% of data)
How are assumptions checked for regression?
Residual Plots, QQ-Plots
How can you check the Linearity assumption, why is it important?
Residual plot - any pattern is a cause for concern. If the model is misspecified it will be inaccurate.
If you have 20 obs in your ANOVA and you calculate the squared residuals, what do they sum to?
SSE
What are two measures of shape?
Skewness (non-symmetry) and Kurtosis (thickness of tails)
What is the difference between standard deviation and standard error?
Standard Deviation is a measure of the variability of the data Standard Error is a measure of the estimated variability of the sample means - measures the variability of your estimate. SE = STD/SQRT(n)
What are the 5 characteristics of a normal distribution
Symmetric Fully characterized by mean and standard deviation Bell Shaped/Unimodal Mean = Median = Mode Asymptotic to X-axis (bounds are -inf to +inf)
What is the assumption of a sample?
That the sample is representative of the population
What is slope?
The average increase in your prediction with a one unit increase in x, all else being held equal.
What is Cardinality?
The cardinality of a set is a measure of the number of elements in a set. A = {2, 4, 6}, it contains 3 elements, A has cardinality of 3.
What is significance level?
The maximum probability of rejecting the null hypothesis given the null hypothesis is true.
What is the best guess prediction of y with no information?
The mean of y
What is a sign of Multicollinearity?
The model is significant but the variables are not significant.
What is a p-value?
The probability of observing of observing the statistic given the null hypothesis is true.
What is a percentile?
The pth percentile in a collection of ordered data is a value that divides the data set into two parts. The lower segment contains at least p% and the upper segment contains at leas 100-p% of the data Quartiles are a special case of percentiles focusing only on 0.25, 0.5 0.75 percentiles
What's the relationship between slope and correlation?
They will both have the same sign
What is the difference between Type I and Type II Error?
Think Pregnancy Example: Type 1: you've rejected the null when it was true. False Positive - Doctor tells a man he's pregnant Type II: Fail to reject the null when it's false False Negative - doctor tells pregnant woman she is not pregnant.
What is the difference between time series and cross sectional data?
Time Series is a set of ordered data values observed at successive points in time. Cross Sectional data values are observed across individuals at a fixed point in time or where time doesn't matter.
Are you more likely to reject the null hypothesis in a one- or two-sided test?
Two tailed tests require a more extreme value to reject the null hypothesis. It takes less evidence to reject the null hypothesis in a one-sided test.
What is Youden's Index and when should it be used?
Use only if false negatives and false positives have equal costs. This is usually not true. Youden's Index = Sensitivity + Specificity - 1 Want to maximize
What is the Mann-Whitney-Wilcoxon Rank Sum test?
Useful to determine if two distributions are significantly different. Does not assume the data are normally distributed potentially providing more accurate assessment of the data sets. Ho is not significantly different
Describe Variance and Standard Deviation, what are their equations?
Variance and standard Deviation both measure how far numbers are spread out from the average value.
What is model misspecification?
We picked a linear model when in fact it's a polynomial relationship. Overspecification: addint variables to a model that don't belong there - adds noise to the model
For tests of association when do you not use pearson chisqure? What do you use?
When more than 20% of cells have an expected count < 5. You need to use exact p-values - Fischer's Exact p-values. the exact p-value is the sum of probabilities of all tables with Chisqure values as great or greater than that of the observed table.
What is Quasi-Complete Separation and how can it be fixed?
When one or more variables are close to predicting the response perfectly - happens with lots of variables and lots of interactions - get cell counts of zero. Fixed by: 1. Grouping variables together. 2. Penalized Likelihood (FIRTH) 3. Exact Methods - only on small samples.
What is the equation for ANOVA?
Y = overall_mean + category1 adjustment + cat2 adjustment + interaction adjustment + error
Can multicollinearity cause problems for Logistic regression as it does in linear regression?
Yes - parameter estimates don't make sense and get huge standard errors
Does sample size influence p-value?
Yes, coin flip example: 6H, 4T: p-value = 0.75 240H, 160T: p-value = <0.0001
Explain Concordant/Discordant pairs and how the c-statistic is calculated.
You have model probabilities and actual sorting. Where the rank of model probabilities agrees with the actual is a concordant pair, otherwise a discordant pair. If the probabilities are equal it's a tie. First need to calculate the number of pairs to test. Freqency(1) * Frequency(0) - every 1 is paired with every 0 - this is a lot of pairs c-statistic = %concordant + 0.5*tied 0.5 <= c <= 1
How can correlation be calculated in Python?
df.corr(method='pearson') Methods available are pearson, spearman and kendall (kendall Tau correlation coefficient)
How can you check standard statistics from data in a dataframe?
df.describe() Gives count, mean, std, min, 25%, 50%, 75%, max df.describe(include=[np.object]) will only in give statistics on string columns and gives: count, unique, top, freq df.describe(include = 'all') gives all stats
How can you check the balance of a class target in Python?
df.groupby('class').size()
In Python how can you detect outliers?
linear_model.RANSACRegressor() - gives set of inliers, then find outliners covariance.EllipticEnvelope - fits robust covariance estimate to the data which fits an elipse to the central points. ensemble.IsolationForest - isolates observations by recursive partitioning as it builds a tree structure. The number of splits required to isolate a sampleis equivalent to a path length. Shorter paths identify anomalies. neighbors.LocalOutlierFactor (LOF) - computes a score reflecting the degree of abnormality of the obs. It measures the local density deviation of a data point with respect to it's neighbors. Detects obs that have substantially lower density than their neighbors.
What is maximum likelihood estimation (MLE)?
method of estimating the parameters of a statistical model given observations, by finding the parameter values that maximize the likelihood of making the observations given the parameters.
What is Information Criterion used for? What are some examples?
model selection AIC, AICC, BIC, SBC - smaller is better
What are degrees of freedom?
number of independent values available to estimate the population statistic. For Confidence Interval for Mean, using t-distribution df = n-1.
How can you get percentiles in Python?
numpy.percentile