Statistics & Data (Super Combined Set)

Réussis tes devoirs et examens dès maintenant avec Quizwiz!

Filter Rule NOT equal any of several "things

!birthmonth %in% c(10, 11, 12)

NOT NA

!is.na(birthmonth)

Delta

"change in"

Another name for a type 2 error

-A false negative -Saying there is not a relationship when there actually is one -Fail to reject the null, falsely

Another name for a type 1 error

-A false positive -When we say there is a relationship, but there is not -Reject the null, falsely

When using Cohen's D, how do you know when you have a large effect size?

. 2 is small .5 is pretty large >.8 is large

Two conditions of omitted variable bias

1) the omitted variable is correlated with the included regressor 2) the omitted variable is a determinant of the dependent variable.

What are the two requirements for this test?

1. Both samples must be normally distributed (n > 30 for each sample or parent populations are known to be normal or qq plots for both groups show normality). 2. A SRS was taken from each population

What are the five steps of the statistical process

1. Design The Study 2. Collect Data 3. Describe The Data 4. Make Inferences 5. Take Action

5 Requirements for linear regression tests

1. Linear Relationships (check with scatterplot or residual plot) 2. Normal error term (check QQ plot of residuals) 3. Constant variance (no megaphone shape in residual plot) 4. X's are known constants (can't check with plot) 5. Observations are independent (can't check with plot).

3 Things to look at in inference testing

1. Significance 2. Magnitude 3. Direction

Three Rules of Probability

1. The probability of an event occuring is a number between 0 and 1 2. The sum of all possible outcomes must equal 1 3. The probability that an event will occur is 1 minus the probability it won't occur.

In experiments, what are three sources of variablility?

1. Variability in the conditions of interest (wanted) 2. Variability in the measurment process 3. Variability in the experimental material

When designing an experiment, what three decisions need to be made about the content?

1. What is the response? 2. What are the treatments? 3. What are the experimental units?

According to the Empirical Rule, how many standard deviations are there between the minimum and the maximum in a normal distribution

6

What are the 3 percentages in the Empirical Rule?

68, 95, and 99.7

Pareto Principle

80/20 rule - 80 percent of your problems come from 20 percent of your causes

The Help Command

?KidsFeet

Fixed Factor

A fixed factor is one that is set by the researchers, as opposed to a factor that is chosen using randomization.

Full Factorial

A full factorial is an ANOVA design where all possible combinations of factors are tested to see the interactions. For example if I have two factors, the first with levels 1 & 2, and the second with levels A and B, I would need to run tests with 1A, 1B, 2A, and 2B to get all of the possible combinations.

Non-linear function

A function with a slope that is not constant for all values of X

index qualitative variation

A measure of variability for nominal variables. It is based on the ratio of the total number of differences in the distribution to the maximum number of possible differences within the same distribution. This index is scored from 0 (no diversity) to 1 (total diversity).

Linear Regression

A method of finding a relationship between two variables using a line

Statistic

A number that describes a characteristic of a sample

Margin of Error

A number used in statistics to indicate the range where the true point estimator may lie.

QQ Plot

A qq plot is used to judge the normality of a sample distribution. It plots the observed values on the x-axis and the expected values (for a normal distribution) on the y-axis. If the graph shows a relatively straight, slanted line, we assume the distribution is normal. If it has significant variations from the line, the distribution may not be normal.

social desirability

A tendency to give socially approved answers to questions about oneself.

Type 1 Error

A type one error occurs when researchers reject the null hypothesis when in fact the null hypothesis was true.

Type 2 Error

A type two error occurs when researchers fail to reject the null hypothesis when in fact they should have rejected it in favor of the alternative hypothesis.

Combine Function

Ages <- c(8, 9, 7, 8)

mutate()

Allows you mutate data information and create new columns KidsFeet %>% mutate(season = case_when( birthmonth %in% c(12,1,2) ~ "Winter"))

Hypothesis

An educated guess about the outcome of an experiment

Experimental Study

An experimental study is one where the researchers apply conditions to experimental units. In essence, they have control over what units get what treatment, and therefore they have more ability to test casual relationships

Parameter

An unkown number which describes a characteristic of a population

Blinding

Blinding is a practice that can reduce bias in a study, especially when there is a placebo group involved. A study is blind when the participants don't know which treatment they are getting, and it becomes double-blind when both the researchers and the participants don't know who is getting the real treatment and who is getting the placebo.

Blocking

Blocking is a method used to account for unwanted variability in a study. When, for example, factors other than the ones you are interested are suspected to influence the response variable, researchers will organize the experimental units into blocks and test to see if some of the variability in the data can be explained by the blocks. When the block factors have a statistically significant effect on the outcome, variability is taken away from the true factor(s) of interest and often raises the chances of having a significant p-value for that/those factor(s).

Coefficients in regression

Bo & B1 also called parameters Coefficients are estimates of the unknown population parameters and describe the relationship between a predictor variable and the response

Comparison/Contrast

Contrasts are used to find differences within the levels of the ANOVA factors. A p-value in an ANOVA test will only tell you if at least one of the levels is significantly different. It does not tell you, however, which level(s) account for the difference. Running contrasts, or comparisons of levels, can tell you more of the actual differences within a factor.

The Assignment Operator

CoolCars <- Keyboard Shortcut: Alt -

What is the acronym used in 221 to remember the 5 steps of the statistical process?

Daniel Can Discern More Truth

Bivariate Data

Data that analyzes two different variables

Decomposition

Decomposition is the breaking down of a statistical analysis to understand the effects of each part. ANOVA works by computing averages for the entire data set, each factor of interest, the interaction between factors, and each individual data point (residuals). Decomposition is best understood by using the assembly-line metaphor, where each data point is visualized by imagining it moving down an assembly line. Each value starts out with the grand average and moves down and gets added on or subtracted by each factor in the study including block factors, the interaction factor, and the residuals

What is a large R-squared?

Depends on the discipline In social science there are often low R-squared

Factor Levels

Each factor may be split up into two or more levels. The factor is the broad category, and the levels are the specific distinctions within each category. If my treatment factor is hair color, the levels might be blonde, brown, red, black, etc

Standard error of the regression (SER)

Estimator of the standard deviation of the regression error ui

When can you multiply the B1 by 100 to get the correct interpretation?

For small changes where the magnitude of B<.1 multiply by 100

For Log-levels, how do you multiply when the magnitude of B1 is >.1

Formula: %∆Y = 100*[e^Bk - 1]

Distribution

Frequency with which a random variable takes any of the possible values

Tufte's Lie Factor

Graphics should accurately reflect what is going on in the data. The lie factor is calculated by dividing the change shown in the graph by the change shown in the data. An ideal lie factor is 1. % increase as shown in graph/ % increase shown in data

Cluster Sample

In a cluster sample, the population is broken up into blocks, or similar groups of items, then several blocks are randomely chosen and all items within those blocks are included in the sample.

Direction

In a regression line, the direction of the line, either positive or negative

"hats" in the model

Indicate that the number has been estimated from your sample

BREAK

LESSON 10

BREAK

LESSON 11

Linear vs Non-Linear Data

Linear data shows a relatively straight line when plotted. Some data be correlated, but not linearly

For independent samples, what symbols do we use for the mean, standard deviation, and sample size of groups 1 and 2?

Mean: X-bar1, X-bar2 SD: s1, s2 N: n1, n2

Five-Number Summary

Minimum, Q1, Q2 (median), Q3, & Maximum.

Nominal Data (Mean Median and Mode)

Mode: Yes Median: No Mean: No

Ordinal Data (Mean, Median, and Mode)

Mode: Yes Median: Yes Mean: No

Interval/Ratio (Mean, Median, Mode)

Mode: Yes Median: Yes Mean: Yes

In the normal distribution of sample means, what is the mean of random variable X-bar?

Mu

Nesting

Nesting occurs when one factor is completely inside another factor, and each level of the inside factor occurs only once within the levels of the outside factor. This happens in a SP/RM design when the blocks are nested within the between-blocks factor.

Nuisance Influence

Nuisance influences add bias and unreliability to experiments. Nuisance influences are often controlled by incorporating them into the experiment as blocks. For example, in psychological studies, the participants are often used as blocks because there is often a lot of variability between the individual participants

Null and alternative hypotheses for a chi-squared test

Null: All of the factors are independent Alternative: The factors are not independent

Null and alternative hypotheses for ANOVA tests

Null: M1 = M2 = M3 OR Alpha1 = Alpha2 = Alpha 3 = 0 Alternative: M=sub(i) is not equal for at least one level of i OR Alpha-sub(i) is not equal for at least one level of i

Bias upward (positive) Bias downward (negative)

Overstating the effect - coefficent becomes less positive when the omitted factor is controlled for Understating the effect - coefficient becomes more positive when the omitted factor is controlled for

Probability Notation

P(X) = .....

Greek P

Population parameter for r

Positive and Negative Association

Positive associations will trend upwards , and negative associations will trend downwards. For positive associations, when x increases, y also increases. In negative associations, when x increases, y decreases.

Synonyms for Magnitude of coefficients

Practical significance, substantive importance, economic importance, policy significance

R-Squared

Proportion of the variation in the dependent variance that can be explained by its relationship with the independent variable.

What is the standard deviation of the sample means?

Sigma/Sqrt(N)

In the normal distribution of sample means, what is the standard deviation of random variable X-bar?

Sigma/Square Root of N

Central Limit Theorem

States that as N gets larger, the distribution of a sample will be more normally distributed.

r (Also called Greek Row)

Statistic that measures the strength of your linear relationship. A number between -1 and 1. 0 Means there is no linear relationship. -1 is perfect negative relationship, and 1 is a perfect positive relationship. This statistic is helpful because it indicates whether the relationship is negative or positive, but it doesn't tell you how much variance in y is explained by x like R2 does

Test Statistic

Statistic used in hypothesis testing to determine the p-value for the distribution. Common test statistics include z statistics, t statistics, f statistics, and chi squared statistics.

Causal Inference v. Statistical Inference

Statistical - infer something about one or more population Causal - does x cause y

Statistical Inference

Statistical techniques used to test hypotheses and make conclusions about the data.

Who was William Sealy Gosset?

Statistician who came up with the t-distribution. Gosset worked at a Brewery at the time, and to protect his identity, he published his worked under the pseudonym "student".

Explanatory Statistics

Statistics that include explanatory relationships as to why certain things happen. Usually include dependent and independent variables.

interquartile range

The difference between the upper and lower quartiles - Q3-Q1 (or 75th percentile and 25th percentile).

What is meant by "the mean of the differences"?

The differences between every pair is calculated, then the mean is taken of those differences.

Treatment Group

The group in a study that receives certain conditions that we are interested in testing

The Law of Large Numbers

The larger the sample size, the closer x-bar will be to the true Mu

Chance Error

The observed value - the sum of the effects for the partial fit.. The chance error, or residual error, is made up of variability in the material and the measurement process. Similar items, measured under the same conditions, will have different values. The chance error is the difference between these observed values and what we would expect them to be based on averages. Chance errors usually follow a normal distribution--some are above the average and some are below, but all in all they even out around the average.

Multiple Comparisons Problem

The problem with running multiple comparisons in the same experiment is that your chances of committing a type one error (or rejecting the null hypothesis, falsely) go way up. If there is a .05 chance that the test statistic you observed was as extreme as it was, then each test you do has that .05 chance and after many tests there is a good possibility that a type one error was committed. * Family-wise error rate: The Family-wise error rate is the total chance of error from the sum of all your tests. * Adjustments: To cut down on this error, there are several different adjustments researchers make. One common one is called the bonferronni method, which splits up your error rate among all of your tests so that all tests collectively are below the family-wise error rate.

Interactions

The product of different predictors (x1 * x2)

U^2-i

The residual variance The more residual variance, the greater standard error

Response

The response variable, also called the dependent variable, is the variable we obtain by running an experiment and collecting data. In ANOVA designs, the response variable is quantitative.

Sampling Risk

The risk that a sample might not actually represent/resemble the characteristics of the parent population

p-hat

The sample proportion

B0-hat + B1-hat

The sample regression line as calculated by OLS. Without the hats, this would be the population regression line equation.

Population

The total number of individuals or items in which you are interested in studying

How do you interpret Bo in multiple regression

The value of Y when ALL X's are 0

Regressand

The variable to be explained in a regression or other statistical model; the dependent variable

R^2 in multiple regression Adjusted R^2?

The variation in y explained by all X's Adjusted R^2 is a measure that accounts for

How can you reduce sampling risk?

There are two ways to reduce sampling risk: 1. Take a random sample 2. Increase your sample size (N)

Control Group

This group typically doesn't receive treatments and is used to compare with the treatment group

Correlation Coefficient (r)

This number shows the strength of a linear relationship. r is always a number between -1 and 1. The closer it is to -1 and 1, the stronger the relationship (either positive or negative).

68 - 95 - 99.7% Rule

This rule states that in a normal distribution, 68% of the data will lie within 1 standard deviation, 95% within 2 standard deviations, and 99.7% within 3 standard deviations.

What does it mean to include a quadratic?

To include the linear term and the squared term in your model

Conditions/Treatments

Treatments or conditions are experimental factors that we, as researchers, are interested in. In ANOVA designs, these factors are categorical (or treated as categorical if numerical values are involved).

What does "Signing the bias" mean?

Trying to figure out what the bias might be if you don't have the data to check yourself

Tukey HSD Test

TukeyHSD(AnovaModel, "birthmonth")

Causal Inference

Using data to estimate how an independent variable (or many independent variables) directly impact(s) a dependent variable. How does X impact Y? The most common method of causal inference is linear regression.

Prediction

Using the observed value of some variable to predict the value of another variable

What is considered a large enough sample in econometrics?

Usually greater than 100. But check the assumptions for normality!

Estimand Estimator Estimate

What the researcher hopes to estimate - the target The rules (algorithms) by which the estimand is to be calculated - method or process used to find estimand Estimate - go back to notes

Conditional Expectation

What you would expect from the statistical prediction--in our case, the predicted regression line

log-log What is the interpretation? What is the model?

When both the predictor and the outcome are in logs A one percentage increase in x is associated with a B1 percent change in Y, on average and holding all other variable constant. You don't have to divide or times by 100 lnY = B0 + B1 * lnX1 + u

When is it appropriate to interpret the Y-intercept?

When it makes sense practically. Does it make sense for x to be 0?

When is it good to use a systematic sample?

When items can be ordered numerically and the order does not have anything to do with the characteristics of the item.

Perfect multicollinearity

When one of the regressors is an exact linear function of the other regressors This happens when you use two variables that measure the same thing -- you can't "hold the other constant" because it's the same variable.

In-sample prediction

When the observation for which the prediction is made was also used to estimate the regression coefficients

When is it okay to use a pie chart?

When the parts of the pie chart combine to make a whole

Out-of-sample prediction

When the prediction for observations is NOT in the estimation sample The goal of regression prediction is to provide out-of-sample predictions

Imperfect multicollinearity

When two or more regressors are very highly correlated High correlation between regressors will result in one (or both) coefficients being very imprecisely estimated (large standard errors) Whether or not you include or remove one of the variables that are imperfectly multicollinear is a decision you must make based on your best judgement.

Extrapolation in Regression

When you make assumptions or predictions about data that is outside of the regression line.

Percentage Rule (of contingency tables)

With your independent variable on the side, you determine percentages/proportions within the columns, not the rows

Cronbach's alpha

a correlation-based statistic that measures a scale's internal reliability

Histogram

a diagram consisting of rectangles whose area is proportional to the frequency of a variable and whose width is equal to the class interval.

Bar Graph

a graph that uses vertical or horizontal bars to show comparisons among two or more items

Z-Score

a measure of how many standard deviations you are away from the norm (average or mean)

quadratic polynomial

a polynomial of degree 2 - creates a U shape when graphed

continuous variable

a quantitative variable that has an infinite number of possible values that are not countable

Frequency Distribution

an arrangement of data that indicates how often a particular score or observation occurs

Filter Rule NOT Equals one "thing"

birthmonth != 5

Filter Rule Equals any of several "things"

birthmonth %in% c(5, 6, 7, 8, 9)

Filter Rule Less Than Less Than Or Equal to Greater Than Greater Than Or Equal to

birthmonth < 5 birthmonth <= 5 birthmonth > 5 birthmonth >= 5

What does the * mean in SPSS titles?

by, as in Importance of Obeying * (by) Importance of Thinking For One's Self

What do you need to type before any code in SPSS?

compute name_of_variable=name_of_variable. execute.

Recoding

creating new categories or columns by combining or modifying existing data

Ordinal Level

data arranged in some order, but the differences between data values cannot be determined or are meaningless

Dummy Coefficient Function

dummy.coef(Your Test)

ui

error term (or residuals) difference between Yi and it's predicted value (Y-hat-i) using the population regression line (observed - predicted)

Anscombe's Quartet

four datasets that have the same simple descriptive statistics (mean, median, ... ), yet appear very different when graphed.

Imputation

giving one's best guess to fill in missing data

Bonferroni Test

pairwise.t.test(KidsFeet$length, KidsFeet$birthmonth, "bonferroni")

Fisher's LSD Test

pairwise.t.test(KidsFeet$length, KidsFeet$birthmonth, "none")

Anova QQ Plot and Constant Variance Plot

plot(AnovaModel, which = 1:2)

ANOVA Requirements

plot(myaov, which = 1:2)

Coefficient in regression

refers to the slope

oversampling

researcher intentionally over represents one or more groups

Sample Variance

s squared

Filter Rule Equals one "thing"

sex == "G"

b1

slope (when x increases by 1 unit, y will increase by...)

Randomization

the best defense against bias, in which each individual is given a fair, random chance of selection

level of measurement

the extent or degree to which the values of variables can be compared and mathematically manipulated

Ratio Level

the interval level with an inherent zero starting point. Differences and ratios are meaningful for this level of measurement and mathematical operations may be performed.

R-squared

the proportion of the total variation in a dependent variable (Y) explained by an independent variable (X) Measure of the strength of a relationship

standard error

the standard deviation of a sampling distribution

Omitted variable bias

when the correlation we see between X1 and Y is biased, because we didn't control for X2 Mechanical definition - Alpha1 (coefficient from the "short regression") - B1 (coefficient from the "long" regression")

Nonlinear formula for changes in y for a certain unit change in x; the "exact method"

∆Y = Y2 - Y1 Y1 = B0 + B1(starting) + B1(starting)^2 Y2 = where Y1 equals the starting model with just X and Y2 equals the change model with X + ∆X

Partially standardized regression

converts the predictor metric to a standard deviations A one SD change in x is associated with a unit change in y In this cause, the predictor is changed to SD units

How are the degrees of freedom calculated in a Chi-Squared test?

df = (number of rows - 1) * (number of columns - 1)

Chi-Squared degrees of freedom equation

df= (# of rows - 1) x (# of columns - 1)

Fav Stats (One Way ANOVA)

favstats(ivdata$Particles~ivdata$Companies)

bar chart

graphic that uses categorical variables on the x axis and quantitative variables on the y axis

Residual Plot

graphical representation of the residuals that can be used to determine whether the assumptions made about the regression model appear to be valid

Equals NA

is.na(birthmonth)

Sample Size Formula (with p*)

n = (z*/m)^2 • p*(1-p*)

Normality Requirements to conclude that p-hat is normally distributed

n*p > 10 AND n(1 - p) > 10

Requirements for Confidence Interval

n*p-hat > 10 n*(1-p-hat) > 10

Percentage Change Formula

new-old/old

Degrees of Freedom

number related to the sample size.. As the sample size increases, so do the degrees of freedom.. For simple distributions, d of f = n - 1. As the degrees of freedom increase, the distribution becomes more and more like a z-distribution.

Inferential Statistics

numerical data that allow one to generalize- to infer from sample data the probability of something being true of a population

One Proportion Confidence Interval Formula

p-hat +- z*sqrt(p-hat(1-p-hat)/n)

Z-Score Formula for One Proportion

p-hat - p/sqrt(p(1-p)/n)

p-hat formula

p-hat = x/n

Overall sample proportion formula ("pooled proportion")

p-hat = x1 + x2/ n1 + n2

p*

p-star -- used to calculate the sample size needed to use certain levels of confidence. p-star is a proportion used of a previous test that can be used to estimate future sample sizes

Y-hat

predicted value of y (the line we create)

qqPlot for paired samples

qqPlot(Feet1$length - Feet2$length)

qqPlot for independent samples t-test

qqPlot(length ~ sex, data = KidsFeet)

Nominal Level

qualitative data consisting of labels or names. Cannot be logically ordered

Standard Deviation of p-hat formula

sqrt(p(1 - p)/n)

Descriptive Statistics

statistics that summarize the data collected in a study in a clear form. Popular descriptive statistics are mean, median, mode, standard deviation, and the five number summary

Strength

strength refers to how accurate your model is. Strong regression models will have data points that are very close to the line, while weak models will have data points that are more spread out from the line

d

symbol for population difference in paired samples

Independent Samples T-Test

t-test where two independent groups are compared. The mean of each group is taken, and the difference of the means is calculated to determine the t-statistic and p-value.

One Sample t-Test

t.test(KidsFeet$length, mu = 25, alternative = "two.sided", conf.level = .95)

Independent Samples t-Test (r code)

t.test(length ~ sex, data = KidsFeet, mu = 0, alternative = "two.sided", conf.level = .95)

table function

table(KidsFeet$sex) OR table(KidsFeet$sex, KidsFeet$birthmonth)

p

true population proportion (parameter)

Clustered Bar Graph

useful for comparing two categorical variables and are often used in conjunction with crosstabulations

(Population) Z-Score Formula

x - mean / standard deviation

Confidence Interval Formula (Sigma Unknown)

x-bar + or - t*s/sqrt(n)

t-score formula

x-bar - mu/s/sqrt(n)

Logrithmic Function

y = ln(x)

Simple Linear Equation

y-hat = bo + b1x

b0

y-intercept

Proportion Margin of Error Formula

z*p-hat(1-p-hat)/n)

Approximate formula for the difference between the logarithm of x + ∆x and x

∆x/x

Variance formula

∑(x - X-bar)²/n-1

How do you compare apples to oranges - a 28 on the ACT to 1200 on the SAT for example?

- Convert into percentiles - Convert into Z-scores

Effect size of a Predictor (standardization method)

- Converts the outcome to standard deviations - Bx/SDy - Taking the regression coefficient and dividing by the SD of the outcome variable In this case, the predictor is changed to SD units

Regression Assumptions

- Errors have a mean of zero - There is no selection bias - X and Y are independently and identically distributed across observations (i.i.d.) (have a simple random sample) - Large outliers are unlikely

How do you choose whether to do a quadratic or not?

- Examine the data - Underlying theory of relationship - Estimate more flexible model and test whether you can reject a less flexible model

On a contingency table, where do you put the dependent and independent variables?

- Independent variable goes on top (columns) - Dependent variables goes on the side (rows) "The side depends on the top"

What is the purpose of multiple regression?

- It can be used to prove causal inference by testing the effect of a variable while holding all other variables constant Or - Used to test multiple regressors at once to make predictions (but not causal inferences)

3 types of log models

- Log-level - the outcome is a log - The predictor is a log - Both are logs

How do you know whether to use a log or a quadratic?

- Quadratics won't work as well if you don't think your data will curve downwards - Use logs if a percentage interpretation is helpful - You can't use logs if variables have values that are <= 0 - Cannot use either for dummy variables - In many cases, it does not matter

Ways to show practical significance for unintuitive predictor variables

- Use percentiles (moves from this percentile to this percentile) - Effect size (Cohen's "d") - which shows the the percent of a standard deviation - Relate to something else - like poverty - "Living in a large district is like raising the poverty rate by 13%" or Grades - "A 8 point increase is like going from a C- to a B"

What are the 5 sampling methods discussed in this class?

1. Simple Random Sample 2. Systematic Sample 3. Cluster Sample 4. Stratified Sample 5. Convenience Sample

Broad categories of joint hypothesis tests

1. Testing differences from zero Ha: B1=0 and B2=0 Ho: B1 neq 0 and/or B2 neq 0 2. Testing equality of coefficients Ho: B1=B2=0 Ha: B1 neq 0 or B2 neq 0

Two Basic Properties of The Normal Distribution

1. The area inside the distribution is equal to 1 2. The curve lies on or above the x-axis

ANOVA Requirements

1. The data come frome a simple random sample 2. The residuals must have constant variance, meaning that they are consistently spread out from the mean. A mephaphone shape in a diagnostic plot means that there is not constant variance 3. The residuals must be normally distributed. This can be checked using a qqPlot of the residuals. A relatively straight line means that the residuals are normal.

Requirements for a chi-squared test

1. The data must come from a SRS 2. The expected counts for each box in the contingency table must be greater than or equal to 5

Chi-Square values and their corresponding p-values

1.64 -- .20 2.71 -- .10 3.84 -- .05 5.41 -- .02 6.64 -- .01 10.83 -- .001

Z Scores for 90, 95, and 99 percent confidence

1.645 1.960 2.576

Percentile

100 divisions of the data. Each percentile tells you how much of the data is at or below the percentile. For example, if you scored in the 99th percentile on an exam, 99 percent of students scored as well as you or worse than you, and only 1 percent of students scored better than you.

In what year did Gosset publish his t-distribution solution?

1908

BF Basic Factorial Design

A Basic Factorial Design is an ANOVA design that tests the effect of one or more categorical factors on one numerical response variable. If the study is experimental, it is often designated as CR or completely random.

Balanced Design

A balanced design requires the same amount of responses (or data points) for each level of each factor. For example, if my treatment variable is type of car, and I have 3 types-- Honda, Toyota, and Ford--I would need the same amount of cars in each group. In R, type 1 sums of squares is the base option, and you can use type 1 sums of squares for most experiments with balanced designs. If your design is unbalanced, your should use type 2 or type 3 sums of squares. In R, you specify this by adding type = "II" or type = "III".

Population Pyramid

A bar graph representing the distribution of population by age and sex. Population Pyramids work well when you have a ratio-level variable and a dichotomous variable (like age and sex)

Stacked Bar Graph

A bar graph that mimics a crosstab. It compares the same categories for different groups and shows category totals. The independent variables are the columns (or bars) and the dependent variables are split out within each bar.

CB Complete Block Design

A complete block design uses 1 nuisance factor as a block and one or more treatment factors. In CB designs, every level of the treatment factor is randomly applied within each block. Interaction terms are not present in CB designs unless there is more than 1 response for each treatment/block combination.

Census

A comprehensive list of all the individuals or items found in a population

Confidence Interval

A confidence interval gives a range where true values are likely to lie, given a certain percentage of confidence. In an interval with 95 percent confidence, researchers are 95 percent confident that the true mean lies within the given range. The range is calculated by using a point estimator plus and minus the margin of error.

Confidence Interval

A confidence interval is a range of values which are likely to contain the true value (usually the true average) within the range. A confidence interval is made up of a point estimator plus or minus the margin of error. Common confidence levels are 90%, 95%, and 99%. In a 95% confidence interval, for example, it can be assumed that 95% of confidence intervals will contain the true value. This does not mean, however, that there is a 95% probability that the true value is within the lower and upper limits.

Fractional Factorial

A fractional factorial is one that only runs a fraction of the possible tests that could be done by crossing all factors. Usually only 50 or 25 percent of the factor combinations are run. The limited number of factor combinations does not allow inference for interactions, but often the effects of the main factors are not incredibly far off from what would be the case if a full factorial was carried out. Fractional factorials are often used in industry to cut down on the costs of running many tests.

p-value

A p-value is the probability of obtaining a test statistic as extreme or more extreme than the one you observed, assuming that the null-hypothesis is true. P-values are considered significant if they are below a certain level, typically .05.

Point Estimator

A point estimator is a value which sums up information in a data set. Examples of point estimators are Mu, x-bar, Sigma, and S.

Population

A population is the sum of all people or things that you are interested in. If you wanted to study American alternative rock bands, your population would be every single alternative band.

Sample

A sample is a limited number of individuals or items taken from the population

Sample

A sample is a number of units taken from a population that is hopefully representative of that population

Uniformity vs. Representativeness

A sample that is uniform will have subjects that are close to the same, whereas a representative sample with resemble the variability present in the actual population.

SP/RM Split Plot / Repeated Measures Design

A split plot or repeated measures design has two factors of interest and one blocking factor that is nested within the between blocks factor. Visually, the between blocks factor is split horizontally and has only one block level for each between blocks level. The within blocks factor is split vertically and each block is contained within each level of the within blocks factor. There is also an interaction term between the within blocks and the between blocks.

Statistic

A statistic is a fact, number, or piece of data taken from an experiment. In ANOVA, an F-statistic is a statistic calculated by dividing the within group variability by the between groups variability (or the variability in the residuals).

Regression

A statistical tool used to quantify the association between two (or more) variables Not inherently causal

ANCOVA

ANCOVA stands for Analysis of Co-variance. ANCOVA is a mix of ANOVA and linear regression. In its simplest form, it has one response variable, one treatment variable, and one co-variate variable that is quantitative in nature. The co-variate is a nuissance variable which researchers want to control. Ancova may not always be the best choice, even if the co-variate is nummerical. If the relationship between the response variable and the covariate is linear, ANCOVA may be a good choice. If it is not linear or co-variate values are known before running the study, it may be better to use blocking techniques rather than ANCOVA.

Finding the median with odd cases

Add one to N, then divide by 2

What is the probability that a flipped coin is heads?

After it is already flipped, the probability is either 1 or 0

Compute Function (In SPSS)

Allows you to combine variables together

group_by function

Allows you to organize your code according to a certain column in the dataset KidsFeet %>% group_by(sex) %>% summarise(KidsLength = mean(length))

summarise function

Allows you to perform statistical functions on your data Kids Feet %>% summarise(KidsLength = mean(length))

Joint hypothesis

Allows you to test for more variable constraints using an F statistic (whatup ANOVA)

3-D Bar Graph

Allows you to visually compare 3 different variables in one graph.

Level of Significance

Also known as alpha, the level of significance is established in hypothesis testing to let researchers know when the observed test statistic is unusual (meaning not likely to occur in the distribution of sample means). The most common alpha level is .05, but other common levels are .1 and .01. For an alpha of .05, the null hypothesis is rejected when the p-value is less than .05. For this test, .05 is also the probability of commiting a type 1 error.

Title Rule (of contingency tables)

Always word your title as the dependent variable by the independent variable--"dog ownership by sex".

Interaction

An interaction in statistics is the relationship between two or more factors of interest. Interactions with p-values under the significance level are not considered significant. The significance of interactions should be determined by looking at an interaction plot and by assessing the significance of the p-value

interval level

Applies to data that can be arranged in order. In addition, differences between data values are meaningful. However, there is not a true zero. Example: temperature

(Generally) What does Pearson's r need to be to be considered significant in sociology?

Around .2 or .3

How many observations are needed for x-bar to be normal?

At least 30 for most samples.

Mean

Average calculated by adding up the sum of all the values and dividing by the number of values

Median

Average calculated by putting all the numbers in order from least to greatest, and choosing the middle number

How do you know which way the quadratic line goes by looking at the coefficients?

B1>0, B2 > 0: Positive convex B1>0, B2< 0: Positive concave B1<0, B2<0: Negative concave B1<0, B2>0: Negative convex

Bias technical definition

B2 - Gamma1 Sort - Long Go back to slides...What the heck is G2 and Y1...

B2 & Y1

B2 - Relationship between Y variable and B1 Y1 - Relationship between B1 and B2

BASIC EXPERIMENTAL ELEMENTS

BREAK

Why do you need to do a two sided test when you have a "not equal to" sign in your alternative hypothesis?

Because positive and negative z-score values can as extreme or more extreme than the observed test statistic.

If you want to obtain the total effect of an intervention, why shouldn't you control for variables that occur after the treatment and/or reflect a pathway through which the treatment might operate?

Because you might be mixing up those post-program predictors with the program itself For example, if you want to measure the impact of a job training program on earnings, you shouldn't use hours worked after the program as a control, because the program could have effected the number of hours worked.

Bias

Bias is any influence that causes your experiment to systematically deviate from the truth. Bias can occur at several stages of an experiment, including collecting the sample and assigning treatments to units. The best way to guard against bias is to use randomization whenever possible. On page 12 of the textbook, the author describes bias as drawing tickets from a box, but some tickets are larger than others; therefore; there is bias in the study because some tickets are more likely to be chosen.

Fully standardized regression coefficient (aka "Beta" coefficient)

Both predictor and outcome are converted to SD units

Binary variable vs Dummy variable

Categorical variable with two categories 1 or 0 (edit)

Categorical Variable

Categorical variables are essentially non-numerical. Performing mathematical functions on these values would not make sense, even if variable does include numbers (For example: zip codes, telephone numbers, I-numbers).

Formula for calculating expected frequencies in each cell

Cell column N x cell row N/total N

B1 in multiple regression

Change in Y associated on average with one unit change in X1 holding X2 constant

Indexing

Combining two or more variables into one to create a comprehensive variable

Concave vs Convex

Concave - curves down and makes a cave shape Convex - curves up and makes a bowl shape

Confounding

Confounding occurs when a relationship between a condition and a response is actually explained by a third nuisance variable. In some scenarios, it may seem like the outcome is being caused by the factor of interest, but it's possible that that factor is being confounded with the true cause of the observed relationship.

Crosstabulation or crosstab

Crosses two variables over one another so that we can see if there might be a relationship between them. Sometimes called a contingency table

Crossing

Crossing is the process of testing all possible level combinations for a factor. If I have 3 levels--A, B, & C-- I would want to test AB, AC, and BC.

Normal Distribution

Density curve that has a symmetrical, bell-shape. Most observations accumulate near the center and get fewer as you get farther away from the center. Normal distributions have significant statistical properties that allow researchers to make inferences about data samples.

Dot Plot

Depicts the actual values of each data point. Best for small sample sizes or for datasets where there are lots of repeated values. Histograms or boxplots are better alternatives for large sample sizes when there are few repeated values. stripchart(length ~ sex, data = KidsFeet)

Scatterplots

Depicts the actual values of the data points, which are (x,y)(x,y) pairs. Works well for small or large sample sizes. Visualizes well the correlation between the two variables. Should be used in linear regression contexts whenever possible plot(length ~ width, data = KidsFeet, pch = 8)

Bar Charts

Depicts the number of occurrences for each category, or level, of the qualitative variable. Similar to a histogram, but there is no natural way to order the bars. Thus the white-space between each bar. barplot(table(KidsFeet$sex)

Precision (in OLS)

Describes how small your standard error is - greater precision means smaller standard error or uncertainty

Survey Weight Formula

Desired Percentage/Current Percentage, then multiply your current percentage by the result.

Contingency Table

Displays the counts of categorical variables in a chi-sqaure test. Often one factor makes up the rows of the table and one factor makes up the columns

Finding the median with even cases

Divide N by 2, find value, then find the value of the next highest number, add those two together and divide by 2

What should you do about imperfect multicollinearity?

Do nothing - if you don't care about X2 and X3 and you're just holding them constant for X1 Include only 1 based on context Combine them together

What are examples of hypotheses?

Drinking gatorade will result in better game performance Using a new fertilizer will result in higher crop yields Using a new study technique will increase students' grades

Cohen's D (Effect Size)

Effect/SD of outcome

Error Term (in the regression model)

Epsilon - represents the residuals for individual points

The null hypothesis will always be a statement involving--

Equality

Why is the mean sensitive to outliers while the median is not?

Extremely high or low numbers are calculated into the mean and can alter the average. In the median, however, extreme numbers do not affect the number that lies in the middle.

Characteristics of a F-distribution

F-distributions are right skewed. There are no negative values in the F-distribution. F statistics calculate the amount of data to the right of the test statistic on the distribution.

Factor

Factors are the elements of the experiment that contribute to the final observed values. Factors are organized by grouping together data that has undergone similar conditions. In the assembly line metaphor, the factors are the stations in which each data point stops and gets refined. In every study there are universal factors and structural factors. Universal factors are those which every data point shares in common -- the grand mean and the residuals. Structural factors are those that are applied to some data points but not others. They include the treatment factors, interaction factor, and any blocking factors.

Paired Samples t-Test

Feet1 <- filter(KidsFeet, sex==B) Feet2 <- filter(KidsFeet, sex==G) t.test(Feet1$length, Feet2$length, paired = TRUE, mu = 0, alternative = "two.sided", conf.level = .95)

Where do you look up information on GSS Survey Information?

GSS Data Explorer Website https://gssdataexplorer.norc.org/variables/vfilter

Bias triangle

Go back and draw

Histogram

Graph that has response values on the x axis and the frequencies in which those values occur on the y axis

Histogram

Graph that uses a single quantitative variable on the X axis and frequency on the Y axis hist(KidsFeet$length)

Scatterplot

Graphic that shows correlation by plotting a quantitative X variable and a quantitative Y variable

Pie Chart

Graphic that shows the proportion of different variables

Boxplot

Graphical depiction of the 5-number summary. boxplot(length ~ sex, data = KidsFeet)

Boxplot

Graphical representation of the 5 number summary. 25 percent of the data is between the minimum and Q1, 25 percent between Q1 and Q2, 25 percent between Q2 and Q3, and 25 percent between Q3 and the maximum.

Null and alternative hypotheses for linear regression

Ho: B1 = -0 Ha: B1 neq 0

One sided H0 in regression

Ho: B1 = B1, 0 Ha: B1 neq B1, 0

Synonyms for "holding variables constant"

Holding fixed Controlling for Blocking

ANOVA Test

Hypothesis test that tests the difference between several means. ANOVA tests have one or more qualitative factors and one quantitative response factor

Paired Samples T-Test

Hypothesis test which tests whether there is significant variation between the differences of two connected samples.

Interpreting negative and positive bias

If the bias

How can you tell if an Indepedent Samples T-Test should be used rather than a Paired Samples T-Test?

If the individual items that go in group 1 don't tell you which items should go in group 2.

What makes an unusual observation?

If the observation occurs less than 5% of the time in a distribution. Or, in other words, if the absolute value of the z-score for the observation is greater than 2.

When is it good to use a cluster sample?

If the within-block variation is greater than the between-block variation, it is okay to use a cluster sample.

Multicollinearity and the dummy variable trap

If you know information about 3/4 dummy variables, you will automatically know the value for the last variable This is why if you have several categorical variables you always leave one out as a base category

Imputation

Imputation is when you make up for missing data by putting in, or imputing, data based on the general trend of you other data points.

Simple Random Sample (SRS)

In a SRS, all possible items in the population have an equal chance of being chosen

Stratified Sample

In a Stratified Sample, items are organized into strata (or similar groups), and then a SRS is taken of each group.

Systematic Sample

In a Systematic Sample, A random starting point is chosen (K), and then ievery Kth item is chosen.

Convenience Sample

In a conveniecne sample, items are selected out of convenience and without any randomized method

Alternative Hypothesis

In contrast to the Null Hypothesis, the Alternative Hypothesis is the new hypothesis being tested in a given experiment..

Regressor

In regression analysis, a variable that is used to explain variation in the dependent variable. The X's

i.i.d.

Independent - characteristics of one observation are not systematically related to the characteristics of another variable Identically distributed - statistical distribution of a variable is the same over time Cluster samples are not i.i.d. - your data are not independent

Median Regression

Just takes the absolute value of the errors instead of squaring them

as.numeric & as.factor functions

KidsFeet$birthmonth -> as.factor(KidsFeet$birthmonth)

The Selection Operator

KidsFeet$sex

BREAK

LESSON 12

BREAK

LESSON 13

BREAK

LESSON 14

BREAK

LESSON 16

BREAK

LESSON 17

BREAK

LESSON 18

BREAK

LESSON 19

BREAK

LESSON 21

BREAK

LESSON 22

BREAK

LESSON 23

BREAK

LESSON 3

BREAK

LESSON 4

BREAK

LESSON 5

BREAK

LESSON 6

BREAK

LESSON 7

BREAK

LESSON 9

Degrees of Freedom

Look up more on the theoretical definition of degrees of freedom -- what I know now is that degrees of freedom are the number of "free" numbers within each factor. The number of free numbers is equal to the number of values you need until you can fill in the rest using your knowledge of the factor average.

How do you interpret binary and continuous variables together in a regression?

Make sure you say "holding constant" for other variables when interpreting

pander()

Makes graphics and numerical summaries look nice

What is the mean of the sample means?

Mu

What are the null and alternative hypotheses for a paired sample t-test?

Null: Mu(d) = 0 Alternative: Mu(d) neq < > 0

Null and alternative hypotheses for one proportion test

Null: P = x/n (some value) Alternative: P <>neq x/n (some value)

Null and alternative hypotheses for a two proportion test

Null: P1 = P2 Alternative: P1 <>neq P2

What are the null and alternative hypotheses for an Indepedent Sample T-Test?

Null: X1 = X2 ALternative: X1 neq <> X2

Observational Study

Observational studies differ from experimental studies in that researchers don't apply treatments to units. Instead, they observe conditions that are already occurring. They don't manipulate the experiment.

In this example, how many restrictions do you have? South - West = 0

One restriction (even though there are two coefficients). You can let South be whatever you want, but you are restricting what West is

When do you interpret the Adjusted R^2 vs. the multiple R^2?

Only use the Adjusted R^2 if you haven't used robust standard errors

Parameter

Parameters are the unknown, true values that we want to find from the population. They are almost always unknown because it is usually impossible to sample an entire population

Quantitative Variable

Qantitative variables are essentially numeric in nature. You can realistically perform mathematical functions on them

Quartile

Quartiles divide the data into fourths -- the 25th percentile, 50th percentile, and 75th percentile (Q1, Q2, & Q3)

R-squared formula

R-squared = ESS/TSS OR 1 - SSR/TSS

Reliability

Reliability is related to the idea of repeatability. A reliable study is one where, if ran many times, the results would be close to the same.

Control

Researchers often have a control group, which is a group that does not receive the treatment of interest. Instead, the control group is either assigned no treatments, or they have a treatment that is the base-case scenario used for comparison. For example, if researchers wanted to test how two different changes to the Gatorade recipe effects flavor ratings, the control group would be the Gatorade recipe with no changes.

What plots are needed to satisfy the first 3 requirements for linear regression?

Residual plot, scatterplot, QQ plot of residuals

Marginals

Row and column totals in a contingency table, which are shown in its margins.

d-bar

Sample mean of the differences

s(d)

Sample standard deviation of the differences

Two Proportion Test

Similar to the independent samples t-test, the two proportion test is used to judge the differences between two proportions.

B1

Slope increase in y when x increases by 1 unit

Experiment

Study in which researchers assign which group(s) receive treatments and which receive no treatment (control group)

Observational Study

Study in which researchers do not assign treatments to groups. Instead, they simply observe patterns found in populations

Covariance Formula

Sxy = r * Sx * Sy

Mu(d)

Symbol for the mean of the differences

x-bar

Symbol for the sample mean

Collapsing

Taking an original variable and from it creating a new variable with fewer categories

Main effects

Terms that do not have interactions

How can you test to see if the data is linear or not?

Test whether B2 (the squared coefficient) is equal to zero Ho: B2 = 0 Ha: B2 neq 0

Hypothesis Test

Tests the alternative hypothesis against the null hypothesis. To run a hypothesis test, you need a number distribution, a test statistic, and a p-value.

What do you conclude if zero is included within your confidence interval?

That there is no significant difference between the two groups because there is a chance that the true difference is zero.

F-Ratio

The F-Ratio is a number from an F-distribution. It is used to calculate a p-value. The F-Ratio is calculated by dividing the mean of squares for the factor of interest by the mean of squares for the residuals -- this is also expressed as the within group variability divided by the between group variability.

LS Latin Square Design

The Latin Square Design is a blocking design that uses one factor of interest and two blocks. To imagine the structure of the design, you can think of one of the blocks being in the rows and one in the columns. Then, you distribute the levels of your factor throughout the rows and columns so that each row and column has at least 1 replication of each level

Null Hypothesis

The Null Hypothesis is the status quo or what is considered to be true by past experiments or conventional wisdom.

Standard Deviation Definition

The average distance data points are from the mean

What does B0 tell you with binary variables What does B1 tell you?

The average outcome for the omitted category The difference between the omitted category and the other category

Homoskedastic

The error term is homoskedastic if the variance is the same for all values of x Otherwise, the error term is heteroskedastic Most datasets/relationships have some degree of heteroskedasticity

Experimental Unit

The experimental units in a study are the things that are assigned treatments.

Factor Structure

The factor structure is the way that factors are organized in an experiment. Visually, factor structures are often represented in rows and columns to show the grand mean, factors of interest, interactions (if any), and the residuals. This structure can also help people to visualize which factors are inside of each other.

What happens to the margin of error when the sample size increases?

The margin of error decreases

Mean Squares

The mean squares gives you a measurement of the average amount of variability contained within a factor. It is calculated by taking the sum of squares for a factor and dividing it by the degrees of freedom for that factor.

What is meant by the differences of the means?

The means of both groups are calculated, and then the differences of the means are used to obtain a t statistic. In matched pairs, the mean of the differences is used.

Ordinary least squares (OLS)

The most common way of estimating the parameters β₀ and β₁. It finds a line with the "least squares" meaning the least amount of errors, or the best fit

Expected Counts

The number of factor counts that you would expect to see assuming that the null hypothesis is true

Observed Counts

The number of times a certain factor outcome occurs

What is meant in a chi-squared test by "the two variables are indepdent of each other"?

The outcome or observed counts is not dependent on the other factor. They would happen regardless of the values in the other factor.

In a SPSS output table, what is the difference between the percent and the valid percent columns?

The percent column includes missing data (often marked as DK NA -- don't know or not applicable) and the valid percent column does not.

Mu

The population mean

Sigma

The population standard deviation

P-Value

The probability of obtaining a test statistic as extreme or more extreme than the one observed, assuming that the null-hypothesis is true.

s

The sample standard deviation

Econometrics

The science and art of using economic theory and statistical techniques to analyze economic data

Standard Deviation Formula

The square root of the sum of (x - mu) squared divided by n *Note: for a sample, it is n-1

Sum of Squares

The sum of squares is the sum of variability for a given factor. It is found by squaring each of the individual effects for each level of the factor. This squaring gets rid of negative numbers and gives you the total amount of variability present.

Uniform, Unimodal, Bimodal, and Multimodal

Uniform: No noticeable peaks in the distribution Unimodal: One peak in the distribution Bimodal: Two peaks in the distribution Multimodal: Multiple peaks in the distribution

What is the solution for testing joint hypotheses?

Use an F statistic!

level log (outcome, predictor) How do you interpret? What is the model?

Use when it makes sense to think about changes in the predictor variable in percent terms (income level, population). A one percent increase in X is associated with a unit increase in Y Y = B0 +B1 * lnX1 + u

filter function

Used to reduce a dataset to a smaller set of rows than the original dataset contained

Select Function

Used to select out certain columns from a dataset select(KidsFeet$sex) OR select(KidsFeet, c(name, birthyear, birthmonth))

The Pipe Operator

Used to send functions down to the next line %>% Shortcut: Ctrl Shift M

Validity

Validity is the level to which an experiment accurately measures and answers the research question. If researchers want to know what the average free throw percentage is for NBA players in NBA games, but they measure the percentage from a sample of players shooting free throws in practice, the study will have low validity because practice free throw percentages aren't necessarily the same thing as game free throw percentages.

Control variable

Variable that is included in the model to control for unwanted variance, but is not of inherent interest to the study Control variables are distinguished in the model equation by Ws instead of Xs

Dichotomy Level

Variable that takes on only two values. For example: sex.

Dichotomous/dummy variables

Variable used in regression to compare the effect of a yes or no situation, coded as 1 or 0.

Discrete Random Variable

Variable with a random rather than a fixed outcome. The variable is discrete because all possible outcomes could be listed.

Effects

Viewed in the assembly line metaphor, effects are the added differences that data points get as they move down the line. These differences are calculated by taking the average of a given factor and subtracting the sum of all other outside factors.

What is the weight variable in the GSS

WTSSALL

Robust standard errors

Way to get around heteroskedasticity? Go back and study

(Sample) Z-Score Formula

X-bar - Mu/ Sigma/Square Root of N

Confidence Interval Formula (Sigma Known)

X-bar - Z*Sigma/SQRT(N), X-bar + Z*Sigma/SQRT(N)

Chi-Square Test Statistic

X^2

Log-level (outcome, predictor) How do you interpret? What is the model?

Y is in logs, X is in "levels" B1 = unit change in x is associated with a percent change in Y, holding everything else constant lnY = B0 + B1X1 + u

Bo

Y-intercept - value of y when x = 0

Quadratic regression equation

Yi = B0 + B1x + B2x^2 + ui

Bivariate model equation

Yi = Bo + B1 * Xi + ui

How can you tell between matched pairs and independent samples?

You can tell that the samples are matched pairs if the data in group 1 tells you what data will be in group 2.

How do you find a percentile on the nomal distirbution applet?

You shade the sections to the left of the percentile and enter in the value (for example .75) into the area specifier at the top.

Margin of Error Formula (Sigma Known)

Z * Sigma/SQRT(N)

sampling frame

a list of individuals from whom the sample is drawn. Good sampling frames include all individuals in the population.

Logarithm

a quantity representing the power to which a fixed number (the base) must be raised to produce a given number. The inverse of the exponential function 8 to the log base 2 is 3 because 2 to the third is 8

Linear Regression

a statistical method used to fit a linear model to a given data set. Simple linear regression uses one quantitative explanatory variable and one quantitative response variable.

Two-way ANOVA

aov(length ~ birthmonth*sex, data = KidsFeet, contrasts = list(birthmonth = contr.sum, sex = contr.sum))

One-way ANOVA BF[1]

aov(length ~ birthmonth, data = KidsFeet, contrasts = list(birthmonth = contr.sum))

pareto chart

bar chart with categories assorted graphically from highest to lowest

Three types of interactions

- Binary and continuous - Binary and binary - Continuous and continuous

Residual Formula (linear regression)

(Observed values) - (Expected values) or y-y-hat

What is the distribution of sample means?

(Usually theoretical) Situation where many samples are drawn from a parent population and the mean of each sample forms a normal distribution. This happens if the parent population is normal or if the sample size is large enough.

Explained sum of squares (ESS)

(Y-hat-i - Y-bar)2 Sum of squared deviations of the predicted value, Y-hat-i, from its average Tells you how much variation is explained by your line

Total sum of squares

(Yi - Y-bar)2 sum of squared deviations of Yi from its average (sample variance)

Sum of Squared Residuals (SSR)

(Yi - Y-hat-i)2 The unexplained variation

Percentile formula

(number of scores at or below a given score / total number of scores) x 100

Residuals in regression

(y-y-hat) (observed minus predicted)

Sample Size Formula (no p*)

(z*/2m)^2

Fisher Assumptions

* Additive Model: All observed values are made up of a true value plus an error term * Constant Means: The averages for the factors are constant. Different data gathered in another situation should produce the same constant average. * Zero Mean Errors: The average of the errors should be zero * Constant Variance Errors: The error terms have similar standar deviation, meaning they are spread out from the mean in a roughly equal manner. * Normal Errors: The error terms follow a normal distribution. Some are observed above the mean and some below, but most cluster closer to the mean. * Independent Errors: One chance error should not effect the likelihood of another chance error. * Same Standard Deviation: The standard deviations for the different factors in your ANOVA test should be similar. A general rule is that the largest standard deviation shouldn't be 3 times larger than the smallest According to the textbook (pg 483), when the standard deviations are too far apart, the factor(s) with the largest standard deviations dominate the grand mean.

Outliers

* Definition: Outliers are data points which are far away from the greater body of data. You can usually tell if a data point is an outlier if it is more than 3 standard deviations away from the mean. * Remedies: Some researchers will just drop outliers from the data, but to do this you have to have good reason, such as knowledge that that particular outlier was flawed or misrepresented in some way.

Sources of Variability

* Conditions: Variability within the conditions is to be expected. The reason why we test on different factors and levels is to see how the numbers differ by condition. * Material: Variability within the material is the actual differences that exist between things. For example, if you want to compare the yield of tomatoes by tomato variety, not all plants, even within the same variety, are going to produce the same amount of tomatoes. There is almost always variability within the things that we want to measure * Measurement Process: Variability within the measurement process should be mitigated as much as possible. However, even when researchers try their best to diminish variation in the measurements, there are still going to be some errors. In most cases, if all precautions are taken to mitigate measurement error, we can assume that the remaining errors will follow a chance like patter similar to the variability in the material.

Measurement Classification (Steven's Four Types)

* Nominal: Nominal responses are those that are essentially categorical and not set on any type of comparative scale. Colors or race are example of nominal responses. There is no quantitative or measurable difference between the colors red, black, orange, and blue, and taking an average of these responses wouldn't make sense. * Ordinal: Ordinal responses are categorical in nature but set to a scale that has noticeable differences. Likert scales are a good example of ordinal variables. For example, how satisfied are you of the job President Trump is doing? A Very satisfied B Satisfied C Indifferent D Unsatisfied E Very Unsatisfied However, it should be noted that although there are noticeable differences between these responses, the differences between gradations are not completely always uniform or measurable. For example, how far is satisfied from very satisfied? It isn't something that we can accurately measure. * Interval: Intervals are quantitative in nature. However, they do not have an actual zero. A good example of an interval response is temperature. * Ratio: Ratio's are quantitative in nature and have an absolute zero. For example, height, weight, volume, etc.

Experimental Content

* Response: This is the variable we want to measure. In ANOVA it is quantitative * Conditions: These are the different treatments that are causing the response. * Material: The material or the units are the things that we apply treatments to.

Sheffe Test

*Make sure the agricolae package is loaded sheffe.test(AnovaModel, "birthmonth", console = TRUE)

What aspects should be included in a table?

- A title - Sample Size - Column percentages - Clearly labeled

Benefits of using logarithms

- Allows us to express terms in percentage changes - Useful when making comparisons across different measures

Kinds of Variability

There are three types of variability present in experiments: * Planned Systematic: This variability is one that we want in our studies. Examples of planned variability include choosing a random sample out that is representative of the population or using a random number generator to assign treatments to units. * Chancelike: Chancelike variability, as stated in the textbook, is something we can deal with. There is always going to be variation in the things we want to study, whether due to measurement error or actual differences in the material. However, we can assume that chancelike variability will act in predicable ways, meaning that some data points will be above the true average and some will be below, but in all, the observed average will still give us an accurate picture of the true average. * Unplanned Systematic: Unplanned systematic variability is the type that we don't want in our studies. This kind of variability can result from human bias or bias from nuisance factors. For example, in my airplane experiment, I threw my planes outdoors and the wind picked up during the process. The wind introduced unplanned variation because it distorted the true distances that the planes would have traveled, and it didn't affect them all equally -- sometimes the wind blew harder than others.

What are the Cumulative Frequencies and Cumulative Percentages in SPSS?

They tell us how many observations (or percentages) are accounted for up to and including that row in the table.


Ensembles d'études connexes

Вступ до математичного аналізу

View Set

Women's health/Disorders & Childbearing

View Set

Study set 9 for RN NCLEX (Kaplan)

View Set

S20 Story Problems (with calculator)

View Set

Chapter 39 practice questions- PrepU

View Set

Civics 5 political games and examples

View Set