Stats Final

¡Supera tus tareas y exámenes ahora con Quizwiz!

binomial distribution

If p represents probability of success, (1-p) represents probability of failure, n represents number of independent trials, and k represents number of successes:

If the variable under study (X) is normally distributed, what happens to the sampling distribution?

If the variable under study (X) is normally distributed, the sampling distribution of the sample mean is also normally distributed and inferences can be made using the N(0,1) distribution

What sampling distribution should you use for calculating p-values, confidence intervals, and hypothesis tests involving a population mean µ? (with df) Why use t(df)?

X is normally distributed and large sample n>30 = x̄~t(df) X is normally distributed and small sample n<30 = x̄~t(df) X is not normally distributed and large sample x̄~t(df) Because it is an easy default. In large samples it is essentially the N(0,1). But you don't have to make a decision.

What sampling distribution should you use for calculating p-values, confidence intervals, and hypothesis tests involving a population mean µ? remember 𝜎 is unknown!

X is normally distributed and large sample n>30 =x̄~N(0,1) X is normally distributed and small sample n<30 =x̄~t(df) X is not normally distributed and large sample x̄~N(0,1) *CLT usually applies (central limit theorem)

Simplest case: standard normal distribution formula for n>30

X ~ N(µ,𝜎) We estimate µ in our sample with x̄ We estimate 𝜎 in our sample with s. The standard error (SE) of x̄ is , which can be estimated using .' When n> 30 --> (denominator is s/√n) And thus we can use the standard normal distribution for hypothesis testing the calculation of p-values

the normal distribution

X ~ N(μ, σ) Read as "X is normally distributed with mean μ and standard deviation σ"

categorical data test statistic

Zp = ˆp-p/SE(ˆp) = ˆp-p/ √p(1-p)/n Note: In HT, when estimating the SE, we use the value of p under H0.

forward selection

add variables one at a time until we cannot find ay more variables that improve the model

margin of error (in confidence interval)

an amount (usually small) that is allowed for in case of miscalculation or change of circumstances. In a confidence interval: z* x SE is called the margin of error, and for a given sample, the margin of error changes as the confidence level changes. In order to change the confidence level we need to adjust z* in the above formula. ● Commonly used confidence levels in practice are 90%, 95%, 98%, and 99%. ● For a 95% confidence interval, z* = 1.96. ● However it is possible, using the standard normal (z) distribution, to find the appropriate z* for any confidence level.

implications of collinearity

1. if you are simply trying to explain Y from some Xs, it might be more parsimonious to include fewer variables 2. if you are trying to tease apart causality (from observational data), you really want to think about (and include) many other possibly explanatory variables

model selection: general assumptions

1. residuals are nearly normal 2. variability of the residuals is nearly constant (homoscedasticity) - plot absolute standardized residuals against fitted values 3. residuals are independent - plot standardized residuals in order of collection 4. each predictor variable is linearly related to the response variable - plot standardized residuals by each Predictor - if linear fit is correct then no pattern should be detected 5. In some cases, we would like the predictors to be independent of each other or, at least, not linearly dependent

Key domains to consider when describing the distribution of a single variable

center (mean, median) shape (modality, skewness) spread/ variability (variance, standard deviation, IQR)

blocking variables

characteristics that the experimental units come with, that we would like to control for. They are observed, not manipulated.

both/bidirectional selections

combination of both forward and backward selection

comparisons - continuous vs categorical variable

comparisons between two continuous measures on the same person comparisons between two independent groups - both sample means and proportions

factors (treatments)

conditions we can impose on experimental units. In order to determine if A causes B, we need to be able to manipulate it. If A cannot be manipulated, then causality cannot be determined.

standardizing

converting X values into Z scores. To do so: ● Subtract the mean of the scores (ힵ) ● Divide by the standard deviation of the scores **particularly useful if the original observations are normally distributed

correlation: two continuous variables

correlation describes the strength of linear association between two variables it takes values between: - -1(perfect negative) and +1 (perfect positive) - a value of 0 indicates no linear association

Types of outliers

influential high leverage

regression

line of best fit: residual is the difference between the observed (yi) and the expected (ˆyi) we want a line that as small residuals for the line of best fit

analyzing paired data

look at the difference in outcomes of each pair of observations *subtract using a consistent order *EX: differences in scores = read-write

stratified sample

made up of similar observations. we take a simple random sample from each stratum.

standard error: two samples`

mean: proportion: same as one sample x2 and square root over all

Standard error: one sample

mean: SE=s/√n proportion: SE = √(p(1-p)/n )

Degrees of freedom (t-distribution)

n-1 describes shape of the t-distribution the larger the degrees of freedom, the more closely the distribution approximates the normal model When df >30, the t distribution is nearly indistinguishable from the normal distribution

what is a null hypothesis

null hypothesis (H0) is the opposite of our alternative hypothesis (HA)

high leverage points

outliers that lie extremely far away from the center of the cloud

what graph do you use when variables are categorical?

pie charts and bar plots (more preferable)

confidence interval formula with point estimate

point estimate ± ME

outliers

points that lie away from the cloud of points

Influential points: def and how are they determined

points that strongly affect the graph of the regression line. Does the slope of the line change considerably? If so, then the point is influential. If not, then it's not an influential point.

inferential statistics

population parameter: µ, variance

census

population count problems: - difficult to complete a census - populations rarely stand still

histogram

provide a view of the data density. higher bars represent where the data are relatively more common. Convenient for describing the shape of the data distribution. the chose bin width can alter the story the histogram is telling

simple random sample

randomly select cases from the population, where there is no implied connection between the points that are selected.

standard error: means

rare that standard deviation is known, so we usually use s

box plot

represents the middle 50% of the data, and the thick line in the box is the median

observational study

researchers collect data in a way that does not directly interfere with how the data arise, i.e., they merely observe and can only establish an association between the explanatory and response variables the groups may be different to begin with, there are many possible reasons for the differences - not just the treatment

experiment

researchers randomly assign subjects to various treatments in order to establish casual connections between the explanatory and response variables. Using very simple statistical methods, researchers can determine if a treatment causes an outcome.

normal probability plot and skewness

right skew - points bend up and to the left of the line left skew - points bend down and to the right of the line short tails - points follow an S shaped curve long tails - points start below the line, bend to follow it, and end above it.

skewness

right skewed, left skewed, symmetric

pre-post

same units are measured twice, often before and after an intervention, randomized or not

mean

sample mean is denoted as x̄ population mean is computed the same way as the sample mean, but is denoted as µ the sample mean is a sample statistic and serves as a point estimate of the population mean

random sampling techniques

simple stratified cluster

multiple linear regressions

single numerical response variable; y multiple numerical or categorical explanatory/ predictor variables; x1,x2,...,xk

simple linear regression

single numerical response variable; y single numerical or categorical explanatory/predictor variable; x

Backward selection

start with model that includes all potential predictors variables. Variables are eliminated one at a time from the model until we cannot improve the model.

variance

the average squared deviation from the mean.

comparison vs group means

the intercept is the average Y in the reference group The "slopes" are differences between the average Y in a group vs. the reference

residuals

the leftovers from the model fit: Data = Fit + Residual Residual is the difference between the observed (yi) and expected ŷi. ei = yi - ŷi

what is a reference category

the level excluded from the model.

hypothesis test formula explained

the null value is often 0 since we are usually checking for a relationship between the explanatory and response variable

what is significance?

the p values associated with the estimate (e.g., the sample mean) is less than 0.05 (the significance level)

p-value

the probability that we would observe the a value as big as we see in our sample if in fact the null hypothesis were true.

confidence interval formula explained

the regression output gives b1, SEb1, and two tailed p-value for the t-test for the slope where the null value is 0

standard deviation

the square root of the variance useful way to describe variability in the data, but does not tell you about the shape of the distribution

What is different from the t-distribution vs the normal distribution What is the t-distribution formula

the t distribution looks like the normal distribution but has heavier tails. In comparison: N(0,1) has mean = 0 and sd = 1 t(df) has mean = 0 and sd = √df/(df-2) *df= degrees of freedom In the case of the sample mean, (x̄ ), we can show that: ~t(df=n-1) This T has mean = 0 sd = √n-1/n-3

random assignment

the treatment and control groups are the same on average on all variables, both observed and unobserved. the only difference between groups is that one group one treatment and one group did not.

sample statistic

the value of a variable that is estimated from a sample

median

the value that splits the data in half when ordered in ascending order

percentile

the value/cutoff for which a given proportion of observations are less than it. graphically, the percentile is the value/cutoff that has a given area under the distribution curve to its left.

group comparisons

treatment and control group are compared (a manipulated design leading to casual inferences) a treatment and control group are compared (where treatment is chosen, so an observational design leading to association) females and males are compared (an observed design leading to associations) *in each case, sample statistics are calculated twice and are compared and inferences are regarding if these are different in the population

collinearity

two predictors (x's) variables are correlated

modality

unimodal: single prominent peak bimodal/multimodal: several prominent peaks uniform: no apparent peaks

choose function formula

useful for calculating the number of ways to choose k successes in n trials

dot plot

useful for visualizing one numerical variable. Darker colors represent ares where there are more observations. mean (average) is one way to measure the center of a distribution of data

scatterplot

useful for visualizing the relationship between 2 numerical variables

cluster sample

usually not made up of homogenous observations, and we take a simple random sample from a random sample of clusters.

how do you determine if a point is influential?

visualize the regression line with and without the point and ask yourself: Does the slope of the line change considerably?

Dummy variable example X = {east, west}

we could create 2 dummy variables: east & west Where east equals 1 when X = "east" and 0 otherwise (and west is defined similarly) When a categorical variable has 2 levels (east, west) you can create up to 2 new variables. But you can include only 2 - 1 = 1 variable in your regression model. It is impossible to include both, in fact, because they are essentially the same variable (one is the opposite of the other). You could write your model as: Y=ßo+ß1west (or east depending which one you include in the model)

simple regression: Y continuous, X categorical

when X is categorical, it needs to be recoded into something numeric "Dummy" variable - a variable that indicates if X=a certain value (can only include one variable in regression model, because they are essentially the same variable but one is opposite the other) ** In the one variable case: correlation is to regression with continuous variables 2-group comparison of means is too regression with dummy variable

double-blind

when both the experimental units and the researchers who interact with the patients do not know who is in the control and who is in the treatment group.

blinding

when experimental units do not know whether they are in the control or treatment group

why is precision important?

● An estimator could be biased, i.e., it might not estimate the population parameter correctly (on average). ● One estimator might be more or less precise than another estimator. ● Precision is proportional to sample size. Estimates from bigger samples are more precise. Estimates from smaller samples are less precise. Precision helps us understand how close our sample estimate might be to the true population estimate.

inference using difference of 2 small sample means: Conditions

-Conditions: independence within groups and between groups (i.e., samples collected randomly and each sample size less than 10% of their respective population, n<10%) no extreme skew in either group/population

How to increase power in a given study, and hence decrease the Type 2 error rate

1. Reduce the standard error: a. Increase the sample size, or b. Decrease the standard deviation of the sample. This is difficult to ensure but cautious measurement process and limiting the population so that it is more homogenous may help. 2. Increase the Type 1 error, α. a. This will make it more likely to reject H0.

Recap: Hypothesis testing for a population mean

1. Set the hypotheses H0: μ = null value HA: μ < or > or ≠ null value 2. Calculate the point estimate 3. Check assumptions and conditions ● Independence: random sample/assignment, 10% condition when sampling without replacement ● Normality: nearly normal population or n ≥ 30, no extreme skew -- or use the t distribution (Ch 5) 4. Calculate a test statistic and a p-value (draw a picture!) 5. Make a decision, and interpret it in context ● If p-value < α, reject H0, data provide evidence for HA ● If p-value > α, do not reject H0, data do not provide evidence for HA

Recap: Hypothesis testing frameworks

1. Set the hypotheses. 2. Check assumptions and conditions. 3. Calculate a test statistic and a p-value. 4. Make a decision, and interpret it in context of the research question.

Steps for using t distribution

1. construct a confidence interval (same as normal) 2. calculate p value (same as normal) 3. conduct hypothesis (same as normal) Only difference is that you use the t-distribution instead of the normal: In R, instead of pnorm(value) use pt(value, df) In R, instead of qnorm(value), use qt(value,df) Instead of the Z-table, use the T-table (which requires specifying df) Instead of normal distribution web app, use a t-distribution web app

What is the equivalent confidence level for a one-sided hypothesis test at α = 0.05?

95%

sampling bias: non-response

if only a small fraction of the randomly samples people choose to respond to a survey, the sample may no longer be representative of the population.

decision errors (type 1 and type 2)

A Type 1 Error is rejecting the null hypothesis when H0 is true. A Type 2 Error is failing to reject the null hypothesis when HA is true.

T-distribution

A bell shaped distribution symmetrical about its median (centered at 0) used to make confidence intervals with small samples (<30) and unknown population variance; Degrees of freedom = # of Observations - 1 Addresses the uncertainty of the standard error Observations are more likely to fall beyond two SDs from the mean

Confidence intervals for nearly normal point estimates

A confidence interval based on an unbiased and nearly normal point estimate is point estimate ± z* SE where z* is selected to correspond to the confidence level, and SE represents the standard error. Remember that the value z* SE is called the margin of error.

CI Example: Average number of exclusive relationships A random sample of n=10 college students were asked how many exclusive relationships they have been in so far. Estimate the true average number of exclusive relationships using this sample: x̄ = 3.2 s = 1.74 n = 10

A confidence interval for µ is defined as x̄ ± t(9)* x SE where SE = s / √n = 1.74 / √10 ≈ 0.55 If a 95% CI is desired, then t(9)*=2.26 and x̄ ± 2.26 x SE → 3.2 ± 2.26 x 0.55 → (3.2 - 1.24, 3.2 + 1.24) → (1.96, 4.44)

normal distribution

A function that represents the distribution of variables as a symmetrical bell-shaped graph.

Homoscedasticity

A regression in which the variances in y for the values of x are equal or close to equal. the variability of residuals around the 0 line should be roughly constant

effect size

An alternative to providing only measures of statistical significance (p-values) is to also provide a measure of the effect estimated. ● In some cases the sample estimator is enough (e.g., the sample mean, the correlation). ● In other cases, results need to be scaled (standardized) so as to be interpretable across different outcomes. For example, the Standardized Mean Difference (SMD) moves results from an arbitrary scale (e.g., test score) to a standard-deviation scale:

what is extrapolation

Applying a model estimate to values outside of the realm of the original data is called extrapolation.

type 1 error rate

As a general rule we reject H0 when the p-value is less than 0.05, i.e. we use a significance level of 0.05, α = 0.05.

Categorical data distribution formula

Bin(n,p)

inference using difference of 2 small sample means

if standard deviation is unknown, difference between the sample means follow a t distribution

inference for a single proportion: categorical data

Distribution Studied: Bin(n,p) **ADD MORE

inference for a single proportion: continuous data

Distribution Studied: N(𝞵, σ) Sampling Distribution for estimate of μ: If X ~ N(𝞵,σ) ⇨ Z ~ N(0,1) (σ known) unlikely ⇨ T ~ t(df) (σ unknown, n large) ⇨ T ~ t(df) (σ unknown, n small) ● If X ~ ??? ⇨ T ~ t(df) (σ unknown, n large) This sampling distribution can be used to construct HT & CI for means and comparisons of means -- CLT makes this possible

placebo effect

Experimental units showing improvement simply because they believe they are receiving a special treatment

68-95-99.7 rule

For a nearly normally distributed variable/data, ● about 68% falls within 1 SD of the mean, ● about 95% falls within 2 SD of the mean, ● about 99.7% falls within 3 SD of the mean

The GSS found that 571 out of 670 (85%) of Americans answered the question on experimental design correctly. Do these data provide convincing evidence that more than 80% of Americans have a good intuition about experimental design?

H0 (null hypothesis): p= 0.8 HA (alternate hypothesis): p>.8 SE = √.8x.2/670 = 0.0154 Z = 0.85-.8/0.0154=3.25 p-value = 1-0.9994=0.0006 Since the p-value is low, we reject H0. The data provide convincing evidence that more than 80% of Americans have a good intuition of experimental design

P-value Example: College applications A survey at Duke asked students: How many colleges did you apply to? A sample of n=20 students responded. In the sample: x̄ = 9.7 college applications s = 7 college applications College Board website states that counselors recommend students apply to roughly 8 colleges. Do these data provide convincing evidence that the average number of colleges all Duke students apply to is higher than recommended? Given this, what is the probability that we'd observe a sample of 20 students with at least 9.7 applications?

H0: Students at Duke applied to the number of colleges suggested or fewer (μ ≤ 8) HA: Students at Duke applied to more colleges than suggested (μ > 8) To test this, we need to compare our data to other (theoretical) data that would be generated data under the null hypothesis (i.e., a sampling distribution). Assume H0 is true, i.e., μ = 8. Then the sample mean is distributed: X~N(µ=8, SE = ∂/√n) In order to make inferences, we can standardize this to use the standard normal distribution. But we don't know sd. Therefore, we use the sample sd s an as an estimate for sd, note that s is only based on n=20 observations: (the T distribution formula!) Now we calculate! P(x>9.7, µ=8) = P (x-8/7√20 > 9.7-8/7√20) = P(T>1.086) Use R: P(T>1.086) = 1-pt(1.086,19) = 0.1455 In comparison, if instead we incorrectly used the normal distribution we would have gotten 0.138

Recap: inference using difference of two small samples which formula do you use? what are the conditions? How do you test hypothesis? Confidence interval?

If σ1 or σ2 is unknown, difference between the sample means follow a 𝙩-distribution with SE = √S^2/n + s^2/n Conditions: independence within groups and between groups (i.e., samples collected randomly and each sample size less than 10% of their respective population, n < 10%) no extreme skew in either group/population Hypothesis testing: Tdf = point estimate - null value/ SE, where df = min(n-1, n-1) Confidence interval: point estimate ± t*df x SE

categorical data confidence intervals

In general, we construct a CI using How should you estimate the SE? One approach: use the estimated values (p-hats) in the SE Another approach: use p = 0.50 in the SE

Example: The weights of diamonds are measured in carats, where 1 carat = 100 points, 0.99 carats = 99 points, etc The difference between the size of a 0.99 carat diamond and a 1 carat diamond is undetectable to the naked human eye. Does the price per point of a 1 carat diamond tend to be higher than the price per point of a 0.99 diamond? What is the parameter of interest? point estimate? precision of estimates?

In order to be able to compare equivalent units, we divide the prices of 0.99 carat diamonds by 99 and 1 carat diamonds by 100, and compare the average point prices. Parameter of interest: Average difference between the point prices of all 0.99 carat and 1 carat diamonds Point estimate: Average difference between the point prices of sampled 0.99 carat and 1 carat diamonds Precision of estimates: The estimates of each mean have precision, SE (Xpt99) = sd/√n SE(Xpt100) = sd/√n

How to calculate from sample to population

In practice, you will: 1. Estimate your sample model (in R using lm()) 2. Check your assumptions (based on residuals) 3. Then see what inferences you can make to the population. We will need H0 and HA We will use a t-test We will use a t-distribution to determine p-value

Recap: Line Fitting, Residuals, and Correlation

Inference for the slope for a single-predictor linear regression model: Hypothesis test: Confidence interval: The null value is often 0 since we are usually checking for a relationship between the explanatory and the response variable. The regression output gives b1, SEb1, and two-tailed p-value for the t-test for the slope where the null value is 0. We rarely do inference on the intercept, so we'll be focusing on the estimates and inference for the slope.

dummy variable

Instead, you need to create a new "dummy" variable - a variable that indicates if X = a certain value.

Example: Poverty vs. Region Explain the formula: poverty=ß0+ß1west

Intercept (β0): The intercept is the the average value of Y when west = 0. When west = 0 indicates that we are in the 'east' region. We call 'east' the reference category since it is the level excluded from the model. We can interpret the intercept as: The average poverty level for Eastern states. Slope (β1): The slope is now the difference between the average Y values when west = 1 vs when west = 0. (Note that west = 0 here means 'east'). We can interpret the slope as: The average difference in poverty levels between western and eastern states.

Assumptions (simple regression)

Linearity - only an assumption since you can't do anything else yet. The relationship between the explanatory and the response variable should be linear. Nearly normal residuals - this condition may not be satisfied when there are unusual observations that don't follow the trend of the rest of the data Constant variability: the variability of points around the least squares line should be roughly constant. this implies that the variability of residuals around the 0 line should be roughly constant as well - homoscedasticity No single observation carries undo weight (no influential points) - outliers, high leverage points

What is the appropriate 𝙩* for a confidence interval for the average difference between the point prices of 0.99 and 1 carat diamonds?

Look at where the confidence level is 95% and α = 0.05

Continuous data distribution formula

N(𝞵, σ)

Types of variables

Numerical: - continuous - discrete Categorical - nominal (unordered categorical) - ordinal (ordered categorical) EX: gender --> categorical nominal sleep --> numerical continuous bedtime --> categorical ordinal countries --> numerical discrete telephone area code --> categorical nominal

Conditions: constant variability

The variability of points around the least squares line should be roughly constant. This implies that the variability of residuals around the 0 line should be roughly constant as well. Also called homoscedasticity.

binomial distribution in sampling distribution

P(k < # | p ) ~ N(𝞵, σ) Where: µ=np sd= √np(1-p)

Recap - comparing two proportions formulas? Conditions?

Population parameter: (p1 - p2), point estimate: (p̂ 1 - p̂ 2) Conditions: independence within groups - random sample and 10% condition met for both groups independence between groups at least 10 successes and failures in each group - if not → randomization (Section 6.4) for CI: use p̂ 1 and p̂ 2 for HT: when H0: p1 = p2: use when H0: p1 - p2 = (some value other than 0): use p̂ 1 and p̂ 2

Recap- Inference for one proportion what are the conditions?

Population parameter: p, point estimate: p̂ Conditions independence - random sample and 10% condition at least 10 successes and failures Standard error: for CI: use p̂ for HT: use p0

Type 2 error rate and power

Power (1 - Type 2 error rate) depends on: ● A measure of the effect (e.g., how big the true difference is in the population) ● Sometimes other sample statistics ● Sample size (n) ● The Type I error of the test (𝛼).

Hypothesis testing possibilities: power

Power of a test is the probability of correctly rejecting H0, and the probability of doing so is 1 - β In hypothesis testing, we want to keep α and β low, but there are inherent trade-offs.

confidence interval construction (z vs. t)

Previously, we said that in order to construct a CI for µ, we could use: x̄ ± z* x SE where z* has to do with how "confident" we'd like to be. When z*=1.96 our interval is 95% confident when z* = 1.645 our interval is 90% confident When sample sizes are small we can similarly construct a CI for µ using: x̄ ± t(df)* x SE where t(df)* has to do with how confident we'd like to be and our sample size (df=n-1): when n=10, t(9)* = 2.26 our interval is 95% confident when n=10, t(9)* =1.83, our interval is 90% confident

correlation pros and cons

Pros: it is independent of scale. it always takes values between -1 and 1, indicating strength of association It does not require direction (it is bidirectional) Cons: requires relationship to be linear may want to be able to interpret this on a meaningful way may want to specify direction correlation only handles two variables **regression allows for all these generalizations

IQR

Q1: 25th percentile Q2: median, 50th percentile Q3: 75th percentile IQR = Q3 - Q1

relationship between regression and correlation

R^2 tells us what the percent variability in the responses is explained by the model. Calculated using the square of the correlation coefficient (R)

CLT for proportions what are the conditions?

Remember, the CLT has conditions: Observations must be independent There are at least 10 successes and 10 failures (i.e., np > 10, n(1-p) > 10) Note: If p is unknown, then we use either use p̂ (the sample estimate) or p0 (the null value) in the calculation of the standard error.

Hypothesis testing example: "Do the data provide convincing evidence that the average amount of sleep college students get per night is different than the national average?", i.e., H0: μ = 7 vs HA: μ ≠ 7 Assume we use 𝛼 = 0.05 to indicate we reject H0 (i.e., we reject if p-value < 0.05). What values of the sample mean (based on n = 10 students) would be large enough to indicate that the null hypothesis is untrue? x̄ = 6.88 s = 0.94 n = 10

SE≈0.94/√10=0.3 We first need to determine the critical values of the t-distribution. If T ~ t(df = n - 1 = 9), what values indicate 𝛼 = 0.05? Use R: qt(0.025, 9) = -2.26 or qt (0.975,9) = 2.26 Graph! (values -2.26 and 2.26 are the x values on the graph, all values less than are shaded) df=9 And then back transform to values of x̄ using: x̄ > μ + t(9)* x SE = 7 + 2.26 x 0.30 = 7.68 x̄ < μ + t(9)* x SE = 7 - 2.26 x 0.30 = 6.32

descriptive statistics

Sample Statistics: x, s^2

R^2 vs. Adjusted R^2

Select the model that explains the most variation in the response variable R^2 increases when any variable is added to the model Adjusted R^2 applies a penalty for the number of predictors included in the model

sampling distribution for proportion

Since p̂ (sample estimate) = k/n then we can show that

point estimate

Single value that serves as an estimate of a population parameter

interpretation of slope and intercept

Slope For each unit in x, y is expected to increase / decrease on average by the slope. Intercept When x = 0, y is expected to equal the intercept.

test statistics

Statistics that can be used as indicators of what is going on in a population and can be used to evaluate results; also called inferential statistics. for inference on the difference of two sample means The test statistic for inference on the difference of two means where σ1 and σ2 are unknown is the T statistic. T = point estimate - null value / SE where SE = √s^2/n +s^2/n df=min(n-1, n-1)

CI vs HT for proportions

Success-failure condition: CI: At least 10 observed successes and failures HT: At least 10 expected successes and failures, calculated using the null value Standard error: CI: calculate using observed sample proportion: HT: calculate using the null value:

What does a 95% confidence interval mean?

Suppose we took many samples and built a confidence interval from each sample using the equation point estimate ± 2 x SE. Then about 95% of those intervals would contain the true population mean (μ)

categorical data SE formula

The SE is not independent of the mean (even in the population). Mean = p SE =

inference for comparing proportions

The details are the same as before... CI: point estimate ± margin of error HT: Use Z = (point estimate - null value) / SE to find appropriate p-value. We just need the appropriate standard error of the point estimate, which is the only new concept. Standard error of the difference between two sample proportions, just add the second proportion

When analyzing paired data, what should we look at? Example: Each student took a reading and writing test. Are the reading and writing scores of each student independent of each other?

The difference in outcomes of each pair of observations. diff=read - write Parameter of interest: Average difference between the reading and writing of all high school students Point estimate: Average difference between the reading and writing scores of sampled high school students Precision of the point estimate: Standard error of our estimate is... Xdiff ~ N(µ, sd/√n) And thus: T = Xdiff-µ/ s/√n ~ t(df) If X isn't extremely skewed and the sample size is large, then we could use N(0,1) instead of t(df) But thereis no reason to do so since the t-distribution and normal become almost indistinguishable as n becomes very large.

central limit theorem

The distribution of the sample mean is well approximated by a normal model: This approximation holds if these assumptions are met: ● Independence: Sampled observations must be independent. ● Sample size / skew: Either the population distribution is normal, or if the population distribution is skewed, the sample size must be large. ○ For slight to moderate skew, n > 30 will work. ○ As skew becomes more extreme the larger n must be large. When these conditions are met, the normal distribution can be used for: ● Hypothesis testing ● Confidence Intervals

what is an intercept?

The intercept is where the regression line intersects the y-axis. The calculation of the intercept uses the fact the a regression line always passes through (x̄ , ȳ ). b0 = ȳ - b1 x̄

Conditions: nearly normal residuals

The residuals should be nearly normal. This condition may not be satisfied when there are unusual observations that don't follow the trend of the rest of the data. Check using a histogram or normal probability plot of residuals.

Relationship between regression & correlation

The slope of the regression can be related to the correlation: b1=sy/sxR

Pooled Estimate of a Proportion

This simply means finding the proportion of total successes among the total number of observations.

Under what conditions can most other sample statistics be approximated?

Under the right conditions (e.g., n>30 & no extreme skew), most other sample statistics have sampling distributions that can be approximated by a N(0,1) distribution

Under what conditions can the sampling distribution of a proportion be approximated?

Under the right conditions (i.e., np>10 & n(1-p)>10) the sampling distribution of a proportion can be approximated using a N(0,1) distribution

what is a prediction? (in a linear model)

Using the linear model to predict the value of the response variable for a given value of the explanatory variable is called prediction, simply by plugging in the value of x in the linear model equation

Recap: sampling distribution

We can estimate the population mean, μ, using the sample mean in our data (of size n). The estimator on average gets the population mean right. But our estimator might be imprecise (have a large SE), especially when our sample size n is small.

A measure of model fit: R^2

We can summarize that using R^2. It tells us what percent of variability in the response variable is explained by the model. Conveniently, R^2 is calculated as the square of the correlation coefficient (R).

Criterion for "best"

We want a line that has small residuals Option 1: Minimize the sum of magnitudes (absolute values) of residuals |e1| + |e2| + ... + |en| Option 2: Minimize the sum of squared residuals -- least squares e^2 + e^2 +... +e^2 Why least squares? Mathematically easier to work with (for theory generation) In many applications, a residual twice as large as another is usually more than twice as bad Importantly, for a given set of data it leads to only one estimate for the line.

What happens if X is categorical?

When X is categorical, it needs to be recoded into something numeric. DUMMY variable!!

when n<30 what happens?

When n is small we have to use another formula. Because the denominator, s/√n, is estimated instead of known, 𝜎/√n, we are more likely to see extreme sample means, (x̄ values) simply by chance than the N(0,1) distribution indicates. EX: If you were to examine the sampling distribution of the sampling mean values (x̄ values) when n=5 compared this to the N(0,1,) you'd find: *Heavy tails indicate extreme values more likely

When do you use the t distribution?

When standard deviation is unknown when sample sizes are smaller

sampling distribution of the mean

Where: ● The sample mean is an unbiased (in the average sample, we get the right answer). ● The precision of any our estimate is summarized using the standard error (SE). This is the SD of the sampling distribution (be careful, this is confusing). The SE is a function of both the variability in the data (σ) and the sample size (n).

Conditions: influential points How do outliers influence the least squares line?

Without the outliers the regression line would be steeper, and lie closer to the larger group of observations. With the outliers the line is pulled up and away from some of the observations in the larger group.

Conditions for using t distribution

independent no extreme skew

bar lot

a common way to display a single categorical variable. A bar plot where proportions instead of frequencies are shown is called a relative frequency bar plot.

contingency table

a table that summarizes data for 2 categorical variables

sampling bias: convenience sample

individuals who are easily accessible are more likely to be included

stacked dot plot

higher bars represent areas where there are more observations, makes it a little easier to judge the center and the shape

standard error: proportions

if doing a hypothesis test, p (population parameter) comes from the null hypothesis if constructing a confidence interval, use point estimate instead

Types of graphs

dot plots stacked dot plot histograms box plot pie charts bar plots scatterplot segmented bar and mosaic

one sample inferences: continuous and categorical variables

estimate the sample mean and make inferences to the population mean estimate the sample proportion and make inferences to the population proportion

placebo

fake treatment, often used as the control group for medical studies

point estimate formula

first point - second point

When are statistical analyses conducted"

when making comparisons between groups E.g., A treatment and control group are compared (a manipulated design leading to causal inferences) E.g., A treatment and control group are compared (where treatment is chosen, so an observational design leading to associations) E.g., Females and Males are compared (an observed design leading to associations) In each case, sample statistics (e.g., means) are calculated twice and are compared and inferences are regarding if these are different in the population. Use t(df) distribution

sampling bias: voluntary response

when the sample consists of people who volunteer to respond because they have strong opinions on the issue. Such a sample will also not be representative of the population.

paired data definition

when two sets of observations are dependent

independent variables

when two variables are not associated

associated variables

when two variables show some connection with one another. Can also be called dependent variables **does not say how the relationship works

confidence interval construction for t distribution

where t* has to do with how "confident" we'd like to be and our sample size (df= n-1) when n=10, t*=2.26, our interval is 95% confident when n=10, t*=1.83, our interval is 90% confident

whiskers and outliers

whiskers of a box can extend up to 1.5 x IQR away from the quartiles. max upper: Q3 + 1.5xIQR max lower: Q1 - 1.5xIQR potential outlier: an observation beyond the maximum reach of the whiskers

interpretation of the line

y=ß0+ß1x y = predicted y ß0=intercept ß1=slope x=explanatory variable Intercept Notation Parameter: β0 Point estimate: b0 Slope Notation Parameter: β1 Point estimate: b1

hypothesis t-test for slope

ß0= null value Here: Point estimate = b1 is the estimated slope. SEb1 is the standard error associated with the slope estimate. Degrees of freedom associated with the slope is df = n - 2, where n is the sample size. Note: we lose 1 degree of freedom for each parameter we estimate, and in simple linear regression we estimate 2 parameters, β0 and β1


Conjuntos de estudio relacionados

Ch. 16 health assessment -Assessing eyes

View Set

Bio15-Female Reproductive System (#2 Physiology)

View Set

Stats 1430 Chapter 5 Probability Rules and Conditional Probability

View Set

NURS 215 - Exam 3 - Chapters 14, 15, 19

View Set

Chapter 55: Management of Patients w/ Urinary disorders

View Set

Creating and editing macros in Excel 365

View Set