BY245 - Final Exam - Statistics Review

Réussis tes devoirs et examens dès maintenant avec Quizwiz!

I ran an outlier test on my data and my pvalue was significant (0.001, at an alpha level of 0.05). a. What does this mean? b. What should I do?

A. This means that the outlier affects the overall mean of the data B. You should remove the outlier from the data.

Which graph is showing right-skewed data (positive)?

B

Normal Distribution

A function that represents the distribution of variables as a symmetrical bell-shaped graph. Symmetric around the mean.

Correlation

A measure of the relationship between two variables. How much do 2 variables change together. Variables are on the same set of data (people, country etc.). Relationship must be linear. Does not tell you if the relationship is significant.

Skewness

A measure of the shape of a data distribution. Data skewed to the left result in negative skewness; a symmetric data distribution results in zero skewness; and data skewed to the right result in positive skewness.

Linear Regression

A method of finding the best model for a linear relationship between the explanatory and response variable.

Boxplot

A plot of data that incorporates the maximum observation, the minimum observation, the first quartile, the second quartile (median), and the third quartile.

Chi-square Test

A significance test used to determine if a linear relationship exists between two variables measured on interval or ratio scales. Ex. I want to know if there was a difference in guilty and non-guilty verdicts in a trial. I poll the jurors and found that 60% wanted guilt and 40% wanted non-guilty. Question: Is the difference in guilt due to chance or something more?

Dotplot

A simple graph that shows each data value as a dot above its location on a number line. Used for smaller data sets (n < 50). Displays basic shape, center, spread of data. Single set of data. Ex. You go to the grocery store and get a carton of eggs, one is cracked inside. You wonder how many cartons of eggs contain cracked eggs. You select 30 cartons of eggs and record the number of cracked eggs in each carton

Null hypothesis (H0)

A statement that the performance of treatment groups is so similar that the groups must belong to the same population; a way of saying that the experimental manipulation had no important effect. Similar to the U.S. judicial system, the null hypothesis is assumed to be true unless proven otherwise.

Two Sample T-Test

A statistical method used to compare the means of 2 groups of subjects. If p <0.05, the null is rejected and the means are different.

Normality Test

A statistical process used to determine if a sample or any group of data fits a standard normal distribution. A normality test can be performed mathematically or graphically.

One-way ANOVA

A statistical test used to analyze data from an experimental design with one independent variable that has three or more groups (levels). The p-value for the paint hardness ANOVA is less than 0.05. Remember ANOVA tells you the groups overall is different, but doesn't tell you which groups differed

T-test

A statistical test used to evaluate the size and significance of the difference between two means. Used when mean and standard deviation are not known Independent samples - compares means for two groups Paired sample - means from same group at different times One sample - mean against a known Assumes normal distribution, and equal variances

General Linear Model

A useful framework for comparing how several variables affect different continuous variables. Takes regression and: -Allows more explanatory variables -And categorical variables (not just numerical) General linear model : Stretching = Constant + Companion Ho: Companion treatment means are the same Ha: Companion treatment means are not the same

Outlier

A value much greater or much less than the others in a data set.

What is indicated by the circled point in the following graph?

An outlier

Basic Idea Behind Analysis of Variance

Analysis of variance involves dividing the overall variability in observed data values so that we can draw conclusions about the equality, or lack thereof, of the means of the populations from where the data came. The overall (or "total") variability is divided into two components: -the variability "between" groups -the variability "within" groups

Non-parametric Test

Any statistic that is designed for ordinal or nominal data or data that is not normally distributed.

Hypotheses for Correlation Analysis

Asks the question: Does the observed correlation coefficient differ from 0? H0: r = 0 There is no correlation between the two variables HA: r ≠ 0 There is a significant correlation between the two variables

What are 2 assumptions that we have discussed for both the t-test and ANOVA that must be met, we have tests for these in Minitab

Assumes normal distribution and equal variances.

Types of Graphs

Dependent on data - Quantitative vs qualitative - Size of data set - Characteristics you want to display (spread, skewness) Explore the data - check for unusual values, what it looks like Presentation/Communication of results

Histogram

Display the frequency of data points within a particular data set. Single set of data. Used for larger data sets (n > 50). Displays shape, center & spread of data. Ex. Ages when the US president began their first term

Pearson Correlation

Examine the strength and direction of the linear relationship between two continuous variables .The Pearson correlation is the most common method for correlation.

Equal Variances

In statistics, an F-test of equality of variances is a test for the null hypothesis that two normal populations have the same variance.

R2 value

In statistics, the coefficient of determination, denoted R² or r² and pronounced "R squared", is the proportion of the variance in the dependent variable that is predictable from the independent variable.

Chi-Square Goodness of Fit

Indicates the probability that the difference between the observed and expected values is due to chance. Ho: The hypothesized distribution is a good fit of the data Ha: The hypothesized distribution is not a good fit of the data

What does a significance level of α = 0.05 mean?

It means that if H0 is actually true and the hypothesis test is repeated on different random samples of data from the same population, then we would expect H0 to be incorrectly rejected 5% of the time.

Tukey Post-Hoc Test

It uses the "Honest Significant Difference," a number that represents the distance between groups, to compare every mean with every other mean. Like Tukey's this post-hoc test is used to compare means.

What is the main purpose of statistics

Make inferences about the population Understand if trends are meaningful

Types of Transformations

Many statistical tests are robust to deviations from normality, there isn't a hard and fast rule for this. If the data is really "bad" you can transform it to make it "normal" Log or Ln - skewed right, data span several orders of magnitude, ratios Arcsine - proportions Square root - counts, variance lower

Kurtosis

Measure of the fatness of the tails of a probability distribution relative to that of a normal distribution. Indicates likelihood of extreme outcomes.

Measures of central tendency

Mode - most frequent occurring number Ex. 12, 15, 15, 17, 17, 18, 20 - have 2 modes 15 & 17 Median - middle number Ex. 12, 15, 15, 17, 18, 20 - 2 central numbers 15 & 17, 15 + 17 = 32/2 = 16 Mean - average of the numbers (𝑥 ̅) Ex. 12 + 15 + 15 + 17 + 17 + 18 + 20 = 114/7 = 16.29

Scales of Measurment

Nominal - non-ordered categories Ex. yes vs no, male versus female, Toyota vs Subaru vs Ford Ordinal - ordered categories (by height shortest - tallest) Ex. There is a better or worse (e.g pizza>sandwhiches>haggis) Interval - in order with equal units, arbitraty zero Ex. Rank food on a scale to 1-10, 1 being worst, 10 being best Ratio - true zero point, equal intervals (by height and weight)

My question is: Are there differences in the weights of my 3 ticks species? What are my hypotheses?

Null (Ho): There is no difference between the weights of the species or means are all equal Alternate (Ha): There is a difference (means are not equal), or more specifically at least one mean is different

I have 3 species of ticks and whose weight I have measured. I want to know if the mean weight is different between the 3 species of ticks. What test should I use ? If my test is significant what should I do next to see which species are different ?

One way ANOVA, if significant than Tukey's (or other post-hoc test)

Data Transformation

Process of changing the data from their original form to a format suitable for performing a data analysis addressing research objectives.

Measures of Variability (or Measure of Spread)

Range - The difference between the lowest and the highest score Ex. 12, 15, 15, 17, 17, 18, 20 - range is 8 (20-12=8) Doesn't tell us a lot, and is subject to outlier issues Variance - variation in the data set; measures how far a set of numbers are spread out from their average value. Standard Deviation - a measure of the amount of variation or dispersion of a set of values.

Graphing

Showing on a number line the relationship of a set of numbers.

Stem-and-Leaf Plots

Similar to a histogram but it's turned on it's side, and instead of bins, it displays digits from actual data values to denote the frequency of each value. Medium-sized data sets (n=50). Displays shape, center, spread of data. Shows actual data values.

Descriptive Statistics

Statistics that summarize the data collected in a study. Scales of measurement Measures of central tendency Measures of variability (measures of spread)

I have 2 different species of ticks and I have measured their weight. I want to know if there is a difference in the mean weight of the ticks. Which statistical test should I choose?

T-Test

Anderson-Darling Normality Test

The hypotheses for the Anderson-Darling normality test are: H0: Data are from a normally distributed population H1: Data are not from a normally distributed population With a p-value of 0.463, there is insufficient evidence to suggest that the data are not from a normal distribution.

Alternative Hypothesis (H1)

The hypothesis that states there is a difference between two or more sets of data. We reject the null hypothesis in favor of the alternative hypothesis only if there is convincing statistical evidence against H0.

I ran a correlation test on the data, and my r-value was 0.912, what can we conclude about the 2 variables, about their relationship?

The relationship is strongly linear and in a positive manner

What are 2 things we can do if the data is not normal

Transform the data Perform non-parametric test

One Sample T-Test

Used to determine if a single sample mean is different from a known population mean.

Graphing for Presentation

Visually illustrate your data - "a picture is worth a thousand words" Allows your viewer to notice trends Types of graphs: Pie Chart Bar Graph - Allows us to view variation, standard deviation Line Graph Scatterplot - Illustrate if there is a relationship between the two variables

Spearman Correlation (also known as Spearman's rho)

When the relationship between variables is not linear. The Spearman correlation measures the monotonic relationship between two continuous or ordinal variables.

I have the mean weight of ticks of different species. I want to display the average weight and the variation (st dev) on a graph. Which type of graph should I choose?

Bar graph Bar Graph - Allows us to view variation, standard deviation

What are the 2 main reasons (categories) that we graph data?

Exploratory and presentation

Why are pie charts not always useful

You can't show variation

Correlation Assumptions

-Data normally distributed -Data is linear -Data is homoscedastic

ANOVA Assumptions

-Samples are random -Independent observations -Population of each group is normally distributed -Population variances of all groups are equal If normality or equal variances is violated, non-parametric Kruskal-Wallis test can be used

I have data showing students high school GPA and their current GPA in college. What type of graph should I choose if I want to see if there is a relationship between high school GPA and college GPA?

scatterplot Scatterplot - Illustrate if there is a relationship between the two variables

If we reject the null hypothesis,

we do not prove the alternative hypothesis is true. We merely state there is sufficient evidence to reject the null hypothesis.

If we do not reject (i.e. fail to reject) the null hypothesis,

we do not prove the null hypothesis is true. We merely state there is not sufficient evidence to reject the null hypothesis.

Linear Regression Assumptions

•Linear relationship •Mutlivariate normality •No/little multicollinearity •No auto-correlation •Homescedasticity

Z Score

A measure of how many standard deviations you are away from the norm (average or mean),

Paired T-Test

A statistical test examining the means and variances of two related samples.

ANOVA

Compares mean values of a contributes variable for multiple categories/groups.

Confidence Intervals (CI)

Confidence Intervals are used to describe the precision of the estimate: - Narrow confidence intervals, more precise - Wide confidence intervals, less precise NOTE: your result (which is a point estimate) always lies in the middle of the 95% CI

Chi Square Assumptions

Data are random Variables are categorical Expected counts not too small (chi-square not less than 5) Expected frequency greater than 1

When to use GLM

Data has one categorical factor Response variable should be continuous Observations are independent Data is random Multicollinearity (correlation among predictors) low Data are normal Variance of the response variable is the same

Rest in fruit flies, Drosophila melanogaster, has many features in common with mammalian sleep, and its study might lead to a better understanding of sleep in mammals, including humans. Hendricks et al. (2001) examined the role o the signaling molecule cyclic AMP (cAMP) by comparing the mean number of hours of resting in 6 different lines of mutant or transgenic fly lines having different levels of cAMP expression. Measurements are hours per 24 hours divided by the mean of "wild-type" flies. Means (+/- SE) are shown for the different fly lines in the following graph: a. Write the statement of the general linear model to be fit to these data to compare means between groups. Indicate what each term in the model represents. b. Write the corresponding statement for the null hypothesis of no differences between mutant lines. c. What test statistics should be used to test whether the null model should be rejected in favor of the alternative?

a. Hours = Constant + Mutant Hours is hours of resting, Constant is the grand mean, Mutant is the effect of each line of mutant flies b. Hours = Constant c. F

A study of the Megellanic penguin (Spheniscus magellanicus) measured stress-induced levels of the hormone corticosterone in chicks living in either tourist-visited areas or undisturbed areas of a breeding colony in Argentina (Walker et al. 2005). Chicks at 3 stages of development were included in the study - namely, recently hatched, midway through growth, and close to fledging. Penguin chicks were stressed (captured) by the researchers and their corticosterone concentration were measured 30 minutes later. The following graph diagrams the mean horomone concentrations for the 3 age groups of chicks from tourist-visited (filled circles) and undisturbed (open circles) areas of the colony. a. What is the response variable b. What are the explanatory variables c. The line segments in this plot are not parallel. What does this suggest. d. Is this an observational or response variable. Explain. e. Did the study use a factorial design? Explain

a. Plasma corticosterone concentration b. Chick age group and disturbance regime c. That the 2 explanatory variables interact to affect corticosterone concentration d. Observational. The penguins were not assigned by the researchers to the different groups e. Yes every combination of the 2 factors (age group and distrubance regime) is included in the design

I ran a test for normality and my pvalue was significant (0.001, at an alpha level of 0.05). a. What does this mean? b. What should I do?

a. The data is not normal b. Transform the data. (and if data still departs from normality I can run a non-parametric test instead)

I ran a test for equal variances on my data and my pvalue was significant (0.001, at an alpha level of 0.05). a. What does this mean? b. What should I do?

a. The variances between the groups are not equal b. If this is an assumption for my test, and it's not robust to it, I cannot move on. I can either start over and re transform my data a different way (as appropriate), or I can perform a non-parametric test instead.

I ran an one-way ANOVA or GLM test on my data and my pvalue was significant (0.001, at an alpha level of 0.05). a. What does this mean? b. What should I do next?

a. There is a difference in at least one of the means compared to the others b. Run a post-hoc test (ex: Tukey's) so I can see which means of the group are different


Ensembles d'études connexes

Course 5 Module 7. Distribution Rules, Alternatives and Taxation

View Set

Operating Systems: Three Easy Pieces

View Set

Combo with "Respiratory System questions" and 6 others

View Set

Honors Chemistry Everett Study Guide Final Exam

View Set

Ch. 1 Homeostasis and Organelles

View Set

Adolescent Development Chapters 6-7

View Set