STATS EXAM 3

Ace your homework & exams now with Quizwiz!

RStudio: Hypothesis tests for a single population

#Let's test our first hypothesis that, on average, U.S. states have reported at least 1,000 rape cases in 2018 #Null hypothesis (H0) = ?? = 1,000; Alternative hypothesis (HA) ?? ??? 1,000 # Step 1: Check assumption of data normality #Step 2: Convert to z-scores (?) # Step3: Calculate the t-statistic (?) Compare a before and after- continuous /signigicantly different - two samples and comparing- two sample t-test t.test(PROPERTY, mu=100000, alternative="greater", conf.level=0.95) sd(PROPERTY) summary(PROPERTY)

The Logic of Hypothesis Testing

o State the hypothesis/ define decision method/ gather data/ make a decision - Two opposing hypotheses about the value of a population parameter · Null hypothesis /alternative hypothesis

Regression Coefficient (690-6)

(describe differences) -Regression coefficients are estimates of the unknown population parameters and describe the relationship between a predictor variable and the response. -In linear regression, coefficients are the values that multiply the predictor values. -Suppose you have the following regression equation: y = 3X + 5. -In this equation, +3 is the coefficient, X is the predictor, and +5 is the constant. -The sign of each coefficient indicates the direction of the relationship between a predictor variable and the response variable. -A positive sign indicates that as the predictor variable increases, the response variable also increases. -A negative sign indicates that as the predictor variable increases, the response variable decreases. -The coefficient value represents the mean change in the response given a one unit change in the predictor. -For example, if a coefficient is +3, the mean response value increases by 3 for every one unit change in the predictor.

Correlation Coefficient (690-6)

(describe differences) Correlation coefficient—the numerical measure of strength and direction of the linear association between the independent variable x and dependent variable y. The correlation coefficient is calculated as: -Where, N = number of data points -r is always between -1 and +1 -The size of r indicates the strength of the relationship If r = 0 there is no relationship If r = +1 or -1, there is a perfect positive or negative correlation

RStudio: Create and interpret a scatter plot with the line of best fit

(will use variables that work for scatter plot) var(COVID) hist(COVID$TOTALHEALTH) scatter.smooth(x=COVID$TOTALHEALTH, y=COVID$Age) scatter.smooth(x=MVT, y=MURDER)

Linear Regression: Define & Examples

- Scenarios where you would use one -Or Line of best fit or least-squares line/(687)/, make predictions of what would happen in the future, r2 Line of best fit/Least-squares line—the graph line that best fits the data values in a scatter plot. -Recognizes that we will rarely find a true linear relationship; instead, we try to find a line that best fits the scattered data values. -Least-squares line seeks to find the best fitting line. =Each data point is denoted (x,y); the least-squares line is denoted (x, 𝑦-hat). =y-hat, i.e., "y-hat" refers to the "estimated" values of y. Y0- 𝑌-hat0 = 𝑒0 is the "error" or "residual" term. -Absolute value of a residual measures the vertical distance between the actual data point and the predicted point on the line. =(+) errors mean observed data values lie above the line, suggesting the line underestimates the actual data value for y =(-) errors mean observed data values lie below the line, suggesting the line overestimates the actual data value for y. =Calculate the residual term for each data point.

Describe the assumptions in testing the significance of the correlation coefficient

-Assumption #1: The test is performed on a sample of observed data values taken from a larger population. -Assumption #2: The regression line equation that we calculate is the best fit for the data, which is gauged by reviewing the scatterplot. -Assumption #3: There is a linear relationship in the population that models the average value of y for varying values of x. -Assumption #4: The y values for any x value are normally distributed about the line. -Assumption #5: The standard deviations of the population y values about the line are equal for each value of x. I.e., each of these normal distributions of y values has the same shape and spread about the line. -Assumption #6: The residual errors are mutually independent (no pattern). -Assumption #7: The data are produced from a well-designed, random sample or randomized experiment. -Assumption #8: The researcher has investigated and accounted for the effect of outliers and influential points on the best-fitting line.

Correlation vs Causation

-Correlation means there is a relationship or pattern between the values of two variables. A scatterplot displays data about two variables as a set of points in the xyxyx, y-plane and is a useful tool for determining if there is a correlation between the variables. -Causation means that one event causes another event to occur. Causation can only be determined from an appropriately designed experiment. In such experiments, similar groups receive different treatments, and the outcomes of each group are studied. We can only conclude that a treatment causes an effect if the groups have noticeably different outcomes. -Correlation does not equal causation

Correlation of Determination (691)

-Correlation of determination—𝑟^2 is the square of the correlation coefficient, but is usually stated as a percentage, rather than a decimal, which is how it appears initially (i.e., you should convert it from a decimal to percentage). -When presented as a percentage, 𝑟^2 represents the percent of variation in the dependent variable (y) that can be explained by variance in the independent variable(s) (x) using the regression (best-fit) line. -When presented as a percentage, 1 - 𝑟^2 represents the percentage of variation in the dependent variable (y) that is NOT explained by variations in the independent variable(s) (x) using the regression line.

Test the significance of a Correlation Coefficient

-Describe how one would test the significance of a correlation coefficient -Describe the assumptions in testing the significance of the correlation coefficient =We must consider the value of the correlation coefficient (i.e., the strength of relationship) as well as the size of the sample (the "n"), together. -The hypothesis test of the significance of the correlation coefficient determines whether the linear relationship in the sample data is strong enough to use to model the relationship in the population. - p (Greek 'rho') = population correlation coefficient (typically unknown). -r = sample correlation coefficient (known, and calculated for the sample data). -The hypothesis test is used to determine whether the value of the population correlation coefficient p is "close to zero" or "significantly different from zero," which is determined by investigating the strength of correlation coefficient (r) and the sample size (n). -If the correlation coefficient is significantly different from zero, the correlation coefficient is "significant". -Conclusion: there is sufficient evidence to conclude that there is a significant linear relationship between x and y. -There is a significant linear relationship between x and y. -If the correlation coefficient is not significantly different from zero, the correlation coefficient is "not significant". -Conclusion: There is insufficient evidence to conclude that there is a significant linear relationship between x and y. -There is not a significant linear relationship between x and y; thus, we cannot use the regression line to model a linear relationship. In the population.

Linear Equation: Describe (680)

-Simple linear regression includes one independent variable (IV) and one dependent variable (DV). Simple linear regression is denoted: Y = a + bx Where, y and b are constants numbers. x = independent variable y = dependent variable b = regression slope a = Intercept -Because Y = a +bx is a linear relationship, we can fit a straight line to the equation.

RStudio: One-sample t-test in Rstudio using a variable with two groups

-TOTHealthttest <-t.test(COVID$TOTALHEALTH ~ COVID$Gender, alternative="two.sided", paired=FALSE, var.equal=TRUE, conf.level=0.95, na.rm=TRUE) -TOTHealthttest -T-test and p-value, measure of the model fit, t-value smaller when p-value is larger, significance is 0.5. confidence interval, negative and positive numbers, 0 is not statistically significance. Mean average

Least-squares Line (687)

-Why someone would want to use it: -Least-squares line criteria for best fit -Linear regression—the process of fitting the best line. -Assumption = the best-fitting line conforms to the data values on a scatterplot that approximate a straight line. -Criteria for the best fit line = the sum of the squared errors (SSE) is minimized/made as small as possible, meaning that any other line would have a higher SSE than the best fit line. The best fit line = least-squared regression line. -Interpreting the slope = The slope of the best-fit line tells us how the dependent variable (y) changes for every one unit increase in the independent variable (x) variable, on average

Identify and Describe: Different components of Linear Regression (685)

-beta coefficient/regression line, slope -line of best fit or least-squares (687) Linear regression analysis—a form of statistical analysis that seeks the equation for the straight line that best describes the relationship between two continuous. Regression line equation: Y' = a + b(X)

Hypothesis Testing: Definition (534)

-involves the collection and evaluation of data from some sample -statistician collects enough data/evidence to make the decision to reject the null hypothesis -hypothesis has (2) contradictory statements: 1) decision based on the data and 2) a conclusion

Hypothesis Testing: Steps (506)

5 Steps) 1) set up two contradictory hypotheses, 2) collect sample data, 3) determine the correct distribution to perform the hypothesis test, 4) analyze sample data 5) make a decision (to accept or reject the null hypothesis) and report the conclusions

Independent Groups (579)

distinguish between: Independent groups—two groups that are independent, meaning values in one group are independent of values in a different group. ◦ Independent group comparisons use population means or population proportions.

Decision vs Conclusion (513)

Describe the differences between a decisions and conclusion, and how a statistician might evaluate both (reference table) -To determine whether a null hypothesis should be accepted or rejected, statisticians consider the p-value and alpha (α) associated with a hypothesis test. • A preset α is the probability of committing Type I error (rejecting the null hypothesis when the null hypothesis is true). • When making a decision to accept or reject the null hypothesis (H0 ), do the following: • If α > p-value (alpha greater than p value), reject H0 . Results of the test are significant. Evidence suggests that H0 is incorrect and the alternative hypothesis, Ha , may be correct. • If α ≤ p-value (less than or equal to p-value), do not reject H0 . Results of the test are not significant. Evidence suggests that H0 is incorrect and the alternative hypothesis, Ha , may be correct. • When you "do not reject H0 ", it does not necessarily mean H0 is true; rather, that the sample data failed to provide sufficient evidence to question the truthfulness of H0 . • Conclusion: write a conclusion, given the result of the hypothesis test.

Describe hypothesis testing with two populations with known standard deviations (574)

Describe the logic of/describe the steps: An uncommon situation, because we rarely know the population standard deviations. Assumptions: ◦ Sampling distribution for the difference between the means is normal and both populations must be normal. The random variable 𝑋ത1 − 𝑋ത2. The normal distribution has the following format:

Hypothesis testing with two population means with unknown standard deviations (568)

Describe the logic of/describe the steps: The two independent samples are simple random samples from two distinct populations. For two distinct populations, ◦ If the sample sizes are small, the distributions are important (should be normal) ◦ If the sample sizes are large, the distributions are not important (need not be normal) Aspin-Welch t-test—Used when comparing two independent population means with unknown and possibly unequal population standard deviations. We estimate population standard deviations using the two sample standard deviations from the independent samples. For the hypothesis test, we calculate the estimated standard deviation, or standard error, or the difference in sample means, 𝑋ത1 − 𝑋ത2.

RStudio: (conduct and interpret) Chi-square goodness-of-fit hypothesis test

Don't worry about to calculate. Why we use it but don't use in goodness-of-fit, good model overall. Why would do chi-square categorical in nature find relation. If they are significantly related to each other.***

Chi-square Distribution: unique features (X^2)

Facts about chi-square distribution Most common uses of a chi-square distribution: • The goodness-of-fit test tests whether the data fit a particular distribution. • Whether most students at ULV are Latino/a, Caucasian, Black, Asian, etc. • The test of independence determines if events are independent. • Whether a particular major at ULV prefers one or more professions; e.g., Criminologists law enforcement and social work, Child development majors education, and Biologists Medical doctors and scientists. • The test of a single variance tests variability • Whether a batter hits the ball similar distance each time s/he swings the bat ------- The chi-square distribution is denoted as: 𝑋 ~ 𝑋^2𝑑𝑓 2 Where df = degrees of freedom, which depends on how chi-square is being used. The degrees of freedom are calculated differently for each of the three common uses. • Population mean is μ = df • Population standard deviation is σ = squareroot 2(𝑑𝑓). • The random variable for a chi-square distribution with k degrees of freedom is the sum of k independent, squared standard normal variables. --------- The curve is nonsymmetrical and skewed to the right. 2. There is a different chi-square curve for each df. 3. The test statistic for any test is always greater than or equal to zero. 4. When df > 90, the chi-square curve approximates the normal distribution. For X ~ 𝑋1,000 2 the mean, μ = df = 1,000 and σ = 2 1,000 = 44.7. Thus, X ~ N(1,000, 44.7), approximately. 5. The mean, μ, is located just to the right of the peak.

RStudio: Outlier Analyses (compute and interpret)

Identify in scatter plot (beyond 3 standard deviations from the mean)

Matched Pairs (584)

distinguish between: Matched pairs—two groups whose values are dependent. ◦ Matching comparison uses the population mean.

Assumptions: Hypothesis testing with one sample

Hypothesis testing of a single population mean (μ) using a Student's t distribution has several assumptions that must be met: • Data should be from a simple random sample, • That comes from an approximately normal distribution, • Use the sample standard deviation to approximate the population standard deviation, • The single population mean hypothesis test uses a normal distribution (i.e., z-scores) to draw comparisons, and • To perform a hypothesis test for a single population proportion (p) a random sample is taken from the population that meets the conditions for a binomial distribution. These include: • There are a certain number (n) of independent trials, • The outcomes of any trial are success or failure, and • Each trial has the same probability of a success p. • The shape of the binomial distribution must approximate the normal distribution

Correlation and and Regression

If we wish to label the strength of the association, for absolute values of r, 1) 0-0.19 is regarded as very weak, 2) 0.2-0.39 as weak, 3) 0.40-0.59 as moderate, 4) 0.6-0.79 as strong and 5) 0.8-1 as very strong correlation, ( but these are rather arbitrary limits, and the context of the results should be considered.) ----- [-0.3 and +0.3 weak -0.7 and -0.7 strong correlation -0.3 and +0.7 Moderated correlation Strength of relationship, -1 +1, slope for correlation analysis]

Error term in Linear Regression Equation (687)

If you add all the error terms together, thus producing the model's Sum of Squared Errors (SSE), the denotation is as follows: (𝑒1) 2 + (𝑒2)^2 + ... +(𝑒11)^2 = Ei^11=1 𝑒^2 -Using this formula, you can calculate values for a and b that make SSE a minimum, which allows you to identify the points that are on the line of best fit. -The line of best fit = 𝑦-hat = a + bx Where a = 𝑦-line^ - b𝑥^line and b = E(𝑥 − 𝑥^line) (𝑦 − 𝑦 ^line) /E(𝑥 − 𝑥^line)^2 𝑥^line 𝑎𝑛𝑑 𝑦^line = sample means of the x and y values, respectively. - The line of best fit always passes through the point (𝑥^line, 𝑦 ^line). -The slope b = r( S𝑦/ S𝑥 ), where S𝑦 = the standard deviation of the y values and S𝑥 = the standard deviation of the x values, and r = the correlation coefficient.

Classify Hypothesis tests by type: Independent groups vs Matched pairs

Much research draws comparisons between multiple groups: ◦ Comparing men vs. women, ◦ Compare a placebo/control and experimental group, ◦ Compare two cities with different policies, or ◦ Compare two universities. Groups are classified as either "independent" or "matched" pairs. \◦ Independent groups—two groups that are independent, meaning values in one group are independent of values in a different group. ◦ Independent group comparisons use population means or population proportions. ◦ Matched pairs—two groups whose values are dependent. ◦ Matching comparison uses the population mean.

Type I Error (508)

distinguish between: α = probability of Type I error P(Type I error) = probability of rejecting the null hypothesis when the null hypothesis is true. -Ideally, both alpha and beta are small, though they are rarely = 0.

Test for Homogeneity (638)

Used to draw conclusions about whether two populations have the same distribution. Follow the same approach as testing for independence (above). Hypotheses: • 𝐻0: The distributions of the two populations are the same. • 𝐻𝑎: The distributions of the two populations are not the same. • Test statistic: • Use a 𝑋^2 . The computation is the same as testing for independence. • Degrees of freedom (df): • df = number of columns - 1 • Requirements: • All values in the table must be greater or equal to 5. • Common uses: • Comparing two populations. E.g., men vs. women, before vs. after, east vs. west, republican vs. democratic, etc. • The variable must be categorical with more than 2 possible response values

RStudio: Conduct and Interpret hypothesis tests for two population mean

TOTHealthttest <-t.test(COVID$TOTALHEALTH ~ COVID$Gender, alternative="two.sided", paired=FALSE, var.equal=TRUE, conf.level=0.95, na.rm=TRUE) TOTHealthttest

Test of Independence (633)

Tests of independence involve using a contingency table of observed (data) values. • The test statistic for a test of independence is similar to that of a goodness-of-fit test: Where: • O = observed values • E = expected values • i = the number of rows in the table • j = the number of columns in the table • There are i*j terms of the form (𝑂−𝐸)^2/ 𝐸 • A test of independence determines whether two factors are independent or not. • Note: the expected value in each cell needs to have at least 5 cases for this test to work

Goodness-of-fit test (623)

Used to determine whether the data "fit" a particular distribution. • The null and alternative hypotheses may be written as sentences or equations. • The test statistic can be written as: -Where: • O = observed values (data) - E = expected values (from theory) - k = the number of different data cells or categories. - The observed values are the data values and the expected values are the values you would expect to get if the null hypothesis were true. There are n terms of the form (𝑂−𝐸)^2/ 𝐸 . • The number of degrees of freedom is df = (number of categories - 1). • The goodness-of-fit test is almost always right-tailed. • If observed values and the corresponding expected values are not close to each other, the test statistic can get very large and way outside the right tail of the chi-square distribution

Describe at one degree of freedom the chi-square distribution approximates a normal distribution

Where df = degrees of freedom, which depends on how chi-square is being used. The degrees of freedom are calculated differently for each of the three common uses. • Population mean is μ = df • Population standard deviation is σ = 2(𝑑𝑓). • The random variable for a chi-square distribution with k degrees of freedom is the sum of k independent, squared standard normal variables.

Describe when Correlations are most useful

When we want to explore relationships between two variables that are continuous, ordinal, or a combination of continuous and ordinal. -E.g., Are property crime rates and unemployment rates related? -Does one's level of education correlate with their income? -Correlations are the first step in building linear regression models. -Linear regression is used to test cause and effect relationships. -Correlation is used to test relations. REMEMBER: Correlation does not equal causation

"Rare Events": Logic (511-13)

You make an assumption about a property of the population (i.e., the null hypothesis). • When the likelihood of an outcome happening is highly unlikely (i.e., "rare"), this causes the researcher to doubt the conditions of a test. • E.g., if someone grabs a green apple from a barrel that has 1 green apple and 99 red apples, the researcher may believe that there are actually more than 1 green apples.

Cohen's D Statistic (574)

algorithm (for example what n stands for in the equation) Cohen's d—a measure of effect size based the difference between two means. Measures the relative strength of the difference between the means of two populations based on sample data. ◦ Size of effect = 'small' = 0.2, 'medium' = 0.5, and 'large' = 0.8.

Identify the delineation of a chi-square distribution and its (population) standard deviation.(622)

algorithm: study guide**lecture

Type II Error (508)

distinguish between: β = probability of Type II error P(Type II error) = probability of not rejecting the null hypothesis when the null hypothesis is false. -Ideally, both alpha and beta are small, though they are rarely = 0.

RStudio: Correlation Coefficients (compute and interpret)

cor(FBI2018) # +1 | -1 range: +1 -0.7 cor.test(FBI2018) - is significant

Null hypothesis (506/514)

distinguish between: (reference table) H0 : The null hypothesis: a statement of no difference between sample means or proportion or no difference between a sample mean or proportion and a population mean or proportion. I.e., the difference = 0. -Reject H0 if the sample data favors the alternative hypothesis or -Do not reject H0 or "decline" to reject the H0 if the sample information is sufficient to reject the null hypothesis.

Alternative hypothesis (506/514)

distinguish between:(reference table) Ha : The alternative hypothesis: a claim about the population that is contradictory to H0 and what we conclude when we reject H0 . -Reject H0 if the sample data favors the alternative hypothesis or -Do not reject H0 or "decline" to reject the H0 if the sample information is sufficient to reject the null hypothesis.

Hypothesis testing for matched and paired samples (584-5)

identify when it is appropriate to use: 6 characteristics When using a hypothesis test for matched or paired samples, the following characteristics should be present: ◦ Simple random sampling is used. ◦ Sample sizes are often small. ◦ Two measurements (samples) are drawn from the same pair of individuals or objects. ◦ Differences are calculated from the matched or paired samples. ◦ The differences form the sample that is used for the hypothesis tests. ◦ Either the matched pairs have differences that come from a population that is normal or the number of differences is sufficiently large so that distributions of the sample mean of differences is approximately normal. In a hypothesis test for matched or paired samples, subjects are matched in pairs and differences are calculated. The differences are the data. The population mean for the differences μ𝑑, is then tested using a Student's-t test for a single population mean with n -1 degrees of freedom, where n is the number of differences. The test statistic (t-score) is:

RStudio: (conduct and interpret) Chi-square single variance hypothesis test

p-value*

Scatter plot (682-5)

why someone would use a scatter plot and the information that can be gleaned by reviewing a scatter plot: -An effective way to visualize the relationship between 2 variables. -Scatter plots tell us the direction of a relationship between two variables, which is clear when: -High values of one variable occur alongside high values of another variable or low values of one variable occur alongside low values of another variable (i.e., positive correlation), -High values of one variable occur alongside low values of another variable (i.e., negative correlation). -To determine the strength of a relationship, identify where the data values are relative to the functional line (strong relationships have most data values clustering around the line). -This is true for linear, power, exponential, and other functions, though a horizontal linear line signifies no relationship. -Pay attention to the overall pattern and any deviations from the pattern.

RStudio: (conduct and interpret) Chi-square homogeneity hypothesis test

x2 <-chisq.test(table(COVID$Location, COVID$Living))** only code x2 x2$expected

Assumptions: Hypothesis Testing with a single population mean (511)

§ Hypothesis test of single population mean : Student's t-test distribution, simple random sample, population normally distributed. Use sample standard deviation to approximate population standard deviation. § Hypothesis test of single population mean using normal distribution (z-test): Take simple random sample from population. Population is normally distributed, or sample is sufficiently large. Know the value of the population standard deviation which is rarely known. § Hypothesis test of single population proportion: take simple random sample from the population. Must meet conditions for binomial distribution: certain number n of independent trials, the outcomes of any trial are success or failure, and each trial has the same probability of a success p. shape of binomial distribution needs to be similar to the shape of the normal distribution. Np and nq must both be greater than five (np > 5 and nq > 5). Then binomial distribution of a sample (estimated) proportion can be approximated by the normal distribution with μ = p and o = . square root pq/n Remember that q = 1 - p.

Test of Single Variance (642)

• A test of a single variance assumes that the underlying distribution is normal. • The null and alternative hypotheses are stated in terms of the population variance (or population standard deviation). The test statistic is: (𝑛 − 1)𝑠^2/ σ^2 Where: • n = the total number of data. • 𝑠^2 = sample variance. • σ^2 = population variance. • The degrees of freedom = n -1. • A test of a single variance may be right-tailed, left-tailed, or two-tailed


Related study sets

Chapter 24. Using Evidence-Based Practice and Nursing Research

View Set

Injury Care and the Athletic Trainer - Chapter 1

View Set

Chapter 1-6 Exam Questions_Robins & Cotran

View Set

ch 19 - immune system - multiple choice

View Set

Accounting Chapter 11 depletion and amortization

View Set

(18)(1)에너지관리기능사 필기 (2012년 1회 기출문제)

View Set

The tragedy of Romeo and Juliet Act II

View Set

Pregnancy, Labor, Childbirth, Postpartum - At Risk

View Set

Geometry B, Unit 10 (All lessons)

View Set