Statistics & Data (Super Combined Set)
Filter Rule NOT equal any of several "things
!birthmonth %in% c(10, 11, 12)
NOT NA
!is.na(birthmonth)
Delta
"change in"
Another name for a type 2 error
-A false negative -Saying there is not a relationship when there actually is one -Fail to reject the null, falsely
Another name for a type 1 error
-A false positive -When we say there is a relationship, but there is not -Reject the null, falsely
When using Cohen's D, how do you know when you have a large effect size?
. 2 is small .5 is pretty large >.8 is large
Two conditions of omitted variable bias
1) the omitted variable is correlated with the included regressor 2) the omitted variable is a determinant of the dependent variable.
What are the two requirements for this test?
1. Both samples must be normally distributed (n > 30 for each sample or parent populations are known to be normal or qq plots for both groups show normality). 2. A SRS was taken from each population
What are the five steps of the statistical process
1. Design The Study 2. Collect Data 3. Describe The Data 4. Make Inferences 5. Take Action
5 Requirements for linear regression tests
1. Linear Relationships (check with scatterplot or residual plot) 2. Normal error term (check QQ plot of residuals) 3. Constant variance (no megaphone shape in residual plot) 4. X's are known constants (can't check with plot) 5. Observations are independent (can't check with plot).
3 Things to look at in inference testing
1. Significance 2. Magnitude 3. Direction
Three Rules of Probability
1. The probability of an event occuring is a number between 0 and 1 2. The sum of all possible outcomes must equal 1 3. The probability that an event will occur is 1 minus the probability it won't occur.
In experiments, what are three sources of variablility?
1. Variability in the conditions of interest (wanted) 2. Variability in the measurment process 3. Variability in the experimental material
When designing an experiment, what three decisions need to be made about the content?
1. What is the response? 2. What are the treatments? 3. What are the experimental units?
According to the Empirical Rule, how many standard deviations are there between the minimum and the maximum in a normal distribution
6
What are the 3 percentages in the Empirical Rule?
68, 95, and 99.7
Pareto Principle
80/20 rule - 80 percent of your problems come from 20 percent of your causes
The Help Command
?KidsFeet
Fixed Factor
A fixed factor is one that is set by the researchers, as opposed to a factor that is chosen using randomization.
Full Factorial
A full factorial is an ANOVA design where all possible combinations of factors are tested to see the interactions. For example if I have two factors, the first with levels 1 & 2, and the second with levels A and B, I would need to run tests with 1A, 1B, 2A, and 2B to get all of the possible combinations.
Non-linear function
A function with a slope that is not constant for all values of X
index qualitative variation
A measure of variability for nominal variables. It is based on the ratio of the total number of differences in the distribution to the maximum number of possible differences within the same distribution. This index is scored from 0 (no diversity) to 1 (total diversity).
Linear Regression
A method of finding a relationship between two variables using a line
Statistic
A number that describes a characteristic of a sample
Margin of Error
A number used in statistics to indicate the range where the true point estimator may lie.
QQ Plot
A qq plot is used to judge the normality of a sample distribution. It plots the observed values on the x-axis and the expected values (for a normal distribution) on the y-axis. If the graph shows a relatively straight, slanted line, we assume the distribution is normal. If it has significant variations from the line, the distribution may not be normal.
social desirability
A tendency to give socially approved answers to questions about oneself.
Type 1 Error
A type one error occurs when researchers reject the null hypothesis when in fact the null hypothesis was true.
Type 2 Error
A type two error occurs when researchers fail to reject the null hypothesis when in fact they should have rejected it in favor of the alternative hypothesis.
Combine Function
Ages <- c(8, 9, 7, 8)
mutate()
Allows you mutate data information and create new columns KidsFeet %>% mutate(season = case_when( birthmonth %in% c(12,1,2) ~ "Winter"))
Hypothesis
An educated guess about the outcome of an experiment
Experimental Study
An experimental study is one where the researchers apply conditions to experimental units. In essence, they have control over what units get what treatment, and therefore they have more ability to test casual relationships
Parameter
An unkown number which describes a characteristic of a population
Blinding
Blinding is a practice that can reduce bias in a study, especially when there is a placebo group involved. A study is blind when the participants don't know which treatment they are getting, and it becomes double-blind when both the researchers and the participants don't know who is getting the real treatment and who is getting the placebo.
Blocking
Blocking is a method used to account for unwanted variability in a study. When, for example, factors other than the ones you are interested are suspected to influence the response variable, researchers will organize the experimental units into blocks and test to see if some of the variability in the data can be explained by the blocks. When the block factors have a statistically significant effect on the outcome, variability is taken away from the true factor(s) of interest and often raises the chances of having a significant p-value for that/those factor(s).
Coefficients in regression
Bo & B1 also called parameters Coefficients are estimates of the unknown population parameters and describe the relationship between a predictor variable and the response
Comparison/Contrast
Contrasts are used to find differences within the levels of the ANOVA factors. A p-value in an ANOVA test will only tell you if at least one of the levels is significantly different. It does not tell you, however, which level(s) account for the difference. Running contrasts, or comparisons of levels, can tell you more of the actual differences within a factor.
The Assignment Operator
CoolCars <- Keyboard Shortcut: Alt -
What is the acronym used in 221 to remember the 5 steps of the statistical process?
Daniel Can Discern More Truth
Bivariate Data
Data that analyzes two different variables
Decomposition
Decomposition is the breaking down of a statistical analysis to understand the effects of each part. ANOVA works by computing averages for the entire data set, each factor of interest, the interaction between factors, and each individual data point (residuals). Decomposition is best understood by using the assembly-line metaphor, where each data point is visualized by imagining it moving down an assembly line. Each value starts out with the grand average and moves down and gets added on or subtracted by each factor in the study including block factors, the interaction factor, and the residuals
What is a large R-squared?
Depends on the discipline In social science there are often low R-squared
Factor Levels
Each factor may be split up into two or more levels. The factor is the broad category, and the levels are the specific distinctions within each category. If my treatment factor is hair color, the levels might be blonde, brown, red, black, etc
Standard error of the regression (SER)
Estimator of the standard deviation of the regression error ui
When can you multiply the B1 by 100 to get the correct interpretation?
For small changes where the magnitude of B<.1 multiply by 100
For Log-levels, how do you multiply when the magnitude of B1 is >.1
Formula: %∆Y = 100*[e^Bk - 1]
Distribution
Frequency with which a random variable takes any of the possible values
Tufte's Lie Factor
Graphics should accurately reflect what is going on in the data. The lie factor is calculated by dividing the change shown in the graph by the change shown in the data. An ideal lie factor is 1. % increase as shown in graph/ % increase shown in data
Cluster Sample
In a cluster sample, the population is broken up into blocks, or similar groups of items, then several blocks are randomely chosen and all items within those blocks are included in the sample.
Direction
In a regression line, the direction of the line, either positive or negative
"hats" in the model
Indicate that the number has been estimated from your sample
BREAK
LESSON 10
BREAK
LESSON 11
Linear vs Non-Linear Data
Linear data shows a relatively straight line when plotted. Some data be correlated, but not linearly
For independent samples, what symbols do we use for the mean, standard deviation, and sample size of groups 1 and 2?
Mean: X-bar1, X-bar2 SD: s1, s2 N: n1, n2
Five-Number Summary
Minimum, Q1, Q2 (median), Q3, & Maximum.
Nominal Data (Mean Median and Mode)
Mode: Yes Median: No Mean: No
Ordinal Data (Mean, Median, and Mode)
Mode: Yes Median: Yes Mean: No
Interval/Ratio (Mean, Median, Mode)
Mode: Yes Median: Yes Mean: Yes
In the normal distribution of sample means, what is the mean of random variable X-bar?
Mu
Nesting
Nesting occurs when one factor is completely inside another factor, and each level of the inside factor occurs only once within the levels of the outside factor. This happens in a SP/RM design when the blocks are nested within the between-blocks factor.
Nuisance Influence
Nuisance influences add bias and unreliability to experiments. Nuisance influences are often controlled by incorporating them into the experiment as blocks. For example, in psychological studies, the participants are often used as blocks because there is often a lot of variability between the individual participants
Null and alternative hypotheses for a chi-squared test
Null: All of the factors are independent Alternative: The factors are not independent
Null and alternative hypotheses for ANOVA tests
Null: M1 = M2 = M3 OR Alpha1 = Alpha2 = Alpha 3 = 0 Alternative: M=sub(i) is not equal for at least one level of i OR Alpha-sub(i) is not equal for at least one level of i
Bias upward (positive) Bias downward (negative)
Overstating the effect - coefficent becomes less positive when the omitted factor is controlled for Understating the effect - coefficient becomes more positive when the omitted factor is controlled for
Probability Notation
P(X) = .....
Greek P
Population parameter for r
Positive and Negative Association
Positive associations will trend upwards , and negative associations will trend downwards. For positive associations, when x increases, y also increases. In negative associations, when x increases, y decreases.
Synonyms for Magnitude of coefficients
Practical significance, substantive importance, economic importance, policy significance
R-Squared
Proportion of the variation in the dependent variance that can be explained by its relationship with the independent variable.
What is the standard deviation of the sample means?
Sigma/Sqrt(N)
In the normal distribution of sample means, what is the standard deviation of random variable X-bar?
Sigma/Square Root of N
Central Limit Theorem
States that as N gets larger, the distribution of a sample will be more normally distributed.
r (Also called Greek Row)
Statistic that measures the strength of your linear relationship. A number between -1 and 1. 0 Means there is no linear relationship. -1 is perfect negative relationship, and 1 is a perfect positive relationship. This statistic is helpful because it indicates whether the relationship is negative or positive, but it doesn't tell you how much variance in y is explained by x like R2 does
Test Statistic
Statistic used in hypothesis testing to determine the p-value for the distribution. Common test statistics include z statistics, t statistics, f statistics, and chi squared statistics.
Causal Inference v. Statistical Inference
Statistical - infer something about one or more population Causal - does x cause y
Statistical Inference
Statistical techniques used to test hypotheses and make conclusions about the data.
Who was William Sealy Gosset?
Statistician who came up with the t-distribution. Gosset worked at a Brewery at the time, and to protect his identity, he published his worked under the pseudonym "student".
Explanatory Statistics
Statistics that include explanatory relationships as to why certain things happen. Usually include dependent and independent variables.
interquartile range
The difference between the upper and lower quartiles - Q3-Q1 (or 75th percentile and 25th percentile).
What is meant by "the mean of the differences"?
The differences between every pair is calculated, then the mean is taken of those differences.
Treatment Group
The group in a study that receives certain conditions that we are interested in testing
The Law of Large Numbers
The larger the sample size, the closer x-bar will be to the true Mu
Chance Error
The observed value - the sum of the effects for the partial fit.. The chance error, or residual error, is made up of variability in the material and the measurement process. Similar items, measured under the same conditions, will have different values. The chance error is the difference between these observed values and what we would expect them to be based on averages. Chance errors usually follow a normal distribution--some are above the average and some are below, but all in all they even out around the average.
Multiple Comparisons Problem
The problem with running multiple comparisons in the same experiment is that your chances of committing a type one error (or rejecting the null hypothesis, falsely) go way up. If there is a .05 chance that the test statistic you observed was as extreme as it was, then each test you do has that .05 chance and after many tests there is a good possibility that a type one error was committed. * Family-wise error rate: The Family-wise error rate is the total chance of error from the sum of all your tests. * Adjustments: To cut down on this error, there are several different adjustments researchers make. One common one is called the bonferronni method, which splits up your error rate among all of your tests so that all tests collectively are below the family-wise error rate.
Interactions
The product of different predictors (x1 * x2)
U^2-i
The residual variance The more residual variance, the greater standard error
Response
The response variable, also called the dependent variable, is the variable we obtain by running an experiment and collecting data. In ANOVA designs, the response variable is quantitative.
Sampling Risk
The risk that a sample might not actually represent/resemble the characteristics of the parent population
p-hat
The sample proportion
B0-hat + B1-hat
The sample regression line as calculated by OLS. Without the hats, this would be the population regression line equation.
Population
The total number of individuals or items in which you are interested in studying
How do you interpret Bo in multiple regression
The value of Y when ALL X's are 0
Regressand
The variable to be explained in a regression or other statistical model; the dependent variable
R^2 in multiple regression Adjusted R^2?
The variation in y explained by all X's Adjusted R^2 is a measure that accounts for
How can you reduce sampling risk?
There are two ways to reduce sampling risk: 1. Take a random sample 2. Increase your sample size (N)
Control Group
This group typically doesn't receive treatments and is used to compare with the treatment group
Correlation Coefficient (r)
This number shows the strength of a linear relationship. r is always a number between -1 and 1. The closer it is to -1 and 1, the stronger the relationship (either positive or negative).
68 - 95 - 99.7% Rule
This rule states that in a normal distribution, 68% of the data will lie within 1 standard deviation, 95% within 2 standard deviations, and 99.7% within 3 standard deviations.
What does it mean to include a quadratic?
To include the linear term and the squared term in your model
Conditions/Treatments
Treatments or conditions are experimental factors that we, as researchers, are interested in. In ANOVA designs, these factors are categorical (or treated as categorical if numerical values are involved).
What does "Signing the bias" mean?
Trying to figure out what the bias might be if you don't have the data to check yourself
Tukey HSD Test
TukeyHSD(AnovaModel, "birthmonth")
Causal Inference
Using data to estimate how an independent variable (or many independent variables) directly impact(s) a dependent variable. How does X impact Y? The most common method of causal inference is linear regression.
Prediction
Using the observed value of some variable to predict the value of another variable
What is considered a large enough sample in econometrics?
Usually greater than 100. But check the assumptions for normality!
Estimand Estimator Estimate
What the researcher hopes to estimate - the target The rules (algorithms) by which the estimand is to be calculated - method or process used to find estimand Estimate - go back to notes
Conditional Expectation
What you would expect from the statistical prediction--in our case, the predicted regression line
log-log What is the interpretation? What is the model?
When both the predictor and the outcome are in logs A one percentage increase in x is associated with a B1 percent change in Y, on average and holding all other variable constant. You don't have to divide or times by 100 lnY = B0 + B1 * lnX1 + u
When is it appropriate to interpret the Y-intercept?
When it makes sense practically. Does it make sense for x to be 0?
When is it good to use a systematic sample?
When items can be ordered numerically and the order does not have anything to do with the characteristics of the item.
Perfect multicollinearity
When one of the regressors is an exact linear function of the other regressors This happens when you use two variables that measure the same thing -- you can't "hold the other constant" because it's the same variable.
In-sample prediction
When the observation for which the prediction is made was also used to estimate the regression coefficients
When is it okay to use a pie chart?
When the parts of the pie chart combine to make a whole
Out-of-sample prediction
When the prediction for observations is NOT in the estimation sample The goal of regression prediction is to provide out-of-sample predictions
Imperfect multicollinearity
When two or more regressors are very highly correlated High correlation between regressors will result in one (or both) coefficients being very imprecisely estimated (large standard errors) Whether or not you include or remove one of the variables that are imperfectly multicollinear is a decision you must make based on your best judgement.
Extrapolation in Regression
When you make assumptions or predictions about data that is outside of the regression line.
Percentage Rule (of contingency tables)
With your independent variable on the side, you determine percentages/proportions within the columns, not the rows
Cronbach's alpha
a correlation-based statistic that measures a scale's internal reliability
Histogram
a diagram consisting of rectangles whose area is proportional to the frequency of a variable and whose width is equal to the class interval.
Bar Graph
a graph that uses vertical or horizontal bars to show comparisons among two or more items
Z-Score
a measure of how many standard deviations you are away from the norm (average or mean)
quadratic polynomial
a polynomial of degree 2 - creates a U shape when graphed
continuous variable
a quantitative variable that has an infinite number of possible values that are not countable
Frequency Distribution
an arrangement of data that indicates how often a particular score or observation occurs
Filter Rule NOT Equals one "thing"
birthmonth != 5
Filter Rule Equals any of several "things"
birthmonth %in% c(5, 6, 7, 8, 9)
Filter Rule Less Than Less Than Or Equal to Greater Than Greater Than Or Equal to
birthmonth < 5 birthmonth <= 5 birthmonth > 5 birthmonth >= 5
What does the * mean in SPSS titles?
by, as in Importance of Obeying * (by) Importance of Thinking For One's Self
What do you need to type before any code in SPSS?
compute name_of_variable=name_of_variable. execute.
Recoding
creating new categories or columns by combining or modifying existing data
Ordinal Level
data arranged in some order, but the differences between data values cannot be determined or are meaningless
Dummy Coefficient Function
dummy.coef(Your Test)
ui
error term (or residuals) difference between Yi and it's predicted value (Y-hat-i) using the population regression line (observed - predicted)
Anscombe's Quartet
four datasets that have the same simple descriptive statistics (mean, median, ... ), yet appear very different when graphed.
Imputation
giving one's best guess to fill in missing data
Bonferroni Test
pairwise.t.test(KidsFeet$length, KidsFeet$birthmonth, "bonferroni")
Fisher's LSD Test
pairwise.t.test(KidsFeet$length, KidsFeet$birthmonth, "none")
Anova QQ Plot and Constant Variance Plot
plot(AnovaModel, which = 1:2)
ANOVA Requirements
plot(myaov, which = 1:2)
Coefficient in regression
refers to the slope
oversampling
researcher intentionally over represents one or more groups
Sample Variance
s squared
Filter Rule Equals one "thing"
sex == "G"
b1
slope (when x increases by 1 unit, y will increase by...)
Randomization
the best defense against bias, in which each individual is given a fair, random chance of selection
level of measurement
the extent or degree to which the values of variables can be compared and mathematically manipulated
Ratio Level
the interval level with an inherent zero starting point. Differences and ratios are meaningful for this level of measurement and mathematical operations may be performed.
R-squared
the proportion of the total variation in a dependent variable (Y) explained by an independent variable (X) Measure of the strength of a relationship
standard error
the standard deviation of a sampling distribution
Omitted variable bias
when the correlation we see between X1 and Y is biased, because we didn't control for X2 Mechanical definition - Alpha1 (coefficient from the "short regression") - B1 (coefficient from the "long" regression")
Nonlinear formula for changes in y for a certain unit change in x; the "exact method"
∆Y = Y2 - Y1 Y1 = B0 + B1(starting) + B1(starting)^2 Y2 = where Y1 equals the starting model with just X and Y2 equals the change model with X + ∆X
Partially standardized regression
converts the predictor metric to a standard deviations A one SD change in x is associated with a unit change in y In this cause, the predictor is changed to SD units
How are the degrees of freedom calculated in a Chi-Squared test?
df = (number of rows - 1) * (number of columns - 1)
Chi-Squared degrees of freedom equation
df= (# of rows - 1) x (# of columns - 1)
Fav Stats (One Way ANOVA)
favstats(ivdata$Particles~ivdata$Companies)
bar chart
graphic that uses categorical variables on the x axis and quantitative variables on the y axis
Residual Plot
graphical representation of the residuals that can be used to determine whether the assumptions made about the regression model appear to be valid
Equals NA
is.na(birthmonth)
Sample Size Formula (with p*)
n = (z*/m)^2 • p*(1-p*)
Normality Requirements to conclude that p-hat is normally distributed
n*p > 10 AND n(1 - p) > 10
Requirements for Confidence Interval
n*p-hat > 10 n*(1-p-hat) > 10
Percentage Change Formula
new-old/old
Degrees of Freedom
number related to the sample size.. As the sample size increases, so do the degrees of freedom.. For simple distributions, d of f = n - 1. As the degrees of freedom increase, the distribution becomes more and more like a z-distribution.
Inferential Statistics
numerical data that allow one to generalize- to infer from sample data the probability of something being true of a population
One Proportion Confidence Interval Formula
p-hat +- z*sqrt(p-hat(1-p-hat)/n)
Z-Score Formula for One Proportion
p-hat - p/sqrt(p(1-p)/n)
p-hat formula
p-hat = x/n
Overall sample proportion formula ("pooled proportion")
p-hat = x1 + x2/ n1 + n2
p*
p-star -- used to calculate the sample size needed to use certain levels of confidence. p-star is a proportion used of a previous test that can be used to estimate future sample sizes
Y-hat
predicted value of y (the line we create)
qqPlot for paired samples
qqPlot(Feet1$length - Feet2$length)
qqPlot for independent samples t-test
qqPlot(length ~ sex, data = KidsFeet)
Nominal Level
qualitative data consisting of labels or names. Cannot be logically ordered
Standard Deviation of p-hat formula
sqrt(p(1 - p)/n)
Descriptive Statistics
statistics that summarize the data collected in a study in a clear form. Popular descriptive statistics are mean, median, mode, standard deviation, and the five number summary
Strength
strength refers to how accurate your model is. Strong regression models will have data points that are very close to the line, while weak models will have data points that are more spread out from the line
d
symbol for population difference in paired samples
Independent Samples T-Test
t-test where two independent groups are compared. The mean of each group is taken, and the difference of the means is calculated to determine the t-statistic and p-value.
One Sample t-Test
t.test(KidsFeet$length, mu = 25, alternative = "two.sided", conf.level = .95)
Independent Samples t-Test (r code)
t.test(length ~ sex, data = KidsFeet, mu = 0, alternative = "two.sided", conf.level = .95)
table function
table(KidsFeet$sex) OR table(KidsFeet$sex, KidsFeet$birthmonth)
p
true population proportion (parameter)
Clustered Bar Graph
useful for comparing two categorical variables and are often used in conjunction with crosstabulations
(Population) Z-Score Formula
x - mean / standard deviation
Confidence Interval Formula (Sigma Unknown)
x-bar + or - t*s/sqrt(n)
t-score formula
x-bar - mu/s/sqrt(n)
Logrithmic Function
y = ln(x)
Simple Linear Equation
y-hat = bo + b1x
b0
y-intercept
Proportion Margin of Error Formula
z*p-hat(1-p-hat)/n)
Approximate formula for the difference between the logarithm of x + ∆x and x
∆x/x
Variance formula
∑(x - X-bar)²/n-1
How do you compare apples to oranges - a 28 on the ACT to 1200 on the SAT for example?
- Convert into percentiles - Convert into Z-scores
Effect size of a Predictor (standardization method)
- Converts the outcome to standard deviations - Bx/SDy - Taking the regression coefficient and dividing by the SD of the outcome variable In this case, the predictor is changed to SD units
Regression Assumptions
- Errors have a mean of zero - There is no selection bias - X and Y are independently and identically distributed across observations (i.i.d.) (have a simple random sample) - Large outliers are unlikely
How do you choose whether to do a quadratic or not?
- Examine the data - Underlying theory of relationship - Estimate more flexible model and test whether you can reject a less flexible model
On a contingency table, where do you put the dependent and independent variables?
- Independent variable goes on top (columns) - Dependent variables goes on the side (rows) "The side depends on the top"
What is the purpose of multiple regression?
- It can be used to prove causal inference by testing the effect of a variable while holding all other variables constant Or - Used to test multiple regressors at once to make predictions (but not causal inferences)
3 types of log models
- Log-level - the outcome is a log - The predictor is a log - Both are logs
How do you know whether to use a log or a quadratic?
- Quadratics won't work as well if you don't think your data will curve downwards - Use logs if a percentage interpretation is helpful - You can't use logs if variables have values that are <= 0 - Cannot use either for dummy variables - In many cases, it does not matter
Ways to show practical significance for unintuitive predictor variables
- Use percentiles (moves from this percentile to this percentile) - Effect size (Cohen's "d") - which shows the the percent of a standard deviation - Relate to something else - like poverty - "Living in a large district is like raising the poverty rate by 13%" or Grades - "A 8 point increase is like going from a C- to a B"
What are the 5 sampling methods discussed in this class?
1. Simple Random Sample 2. Systematic Sample 3. Cluster Sample 4. Stratified Sample 5. Convenience Sample
Broad categories of joint hypothesis tests
1. Testing differences from zero Ha: B1=0 and B2=0 Ho: B1 neq 0 and/or B2 neq 0 2. Testing equality of coefficients Ho: B1=B2=0 Ha: B1 neq 0 or B2 neq 0
Two Basic Properties of The Normal Distribution
1. The area inside the distribution is equal to 1 2. The curve lies on or above the x-axis
ANOVA Requirements
1. The data come frome a simple random sample 2. The residuals must have constant variance, meaning that they are consistently spread out from the mean. A mephaphone shape in a diagnostic plot means that there is not constant variance 3. The residuals must be normally distributed. This can be checked using a qqPlot of the residuals. A relatively straight line means that the residuals are normal.
Requirements for a chi-squared test
1. The data must come from a SRS 2. The expected counts for each box in the contingency table must be greater than or equal to 5
Chi-Square values and their corresponding p-values
1.64 -- .20 2.71 -- .10 3.84 -- .05 5.41 -- .02 6.64 -- .01 10.83 -- .001
Z Scores for 90, 95, and 99 percent confidence
1.645 1.960 2.576
Percentile
100 divisions of the data. Each percentile tells you how much of the data is at or below the percentile. For example, if you scored in the 99th percentile on an exam, 99 percent of students scored as well as you or worse than you, and only 1 percent of students scored better than you.
In what year did Gosset publish his t-distribution solution?
1908
BF Basic Factorial Design
A Basic Factorial Design is an ANOVA design that tests the effect of one or more categorical factors on one numerical response variable. If the study is experimental, it is often designated as CR or completely random.
Balanced Design
A balanced design requires the same amount of responses (or data points) for each level of each factor. For example, if my treatment variable is type of car, and I have 3 types-- Honda, Toyota, and Ford--I would need the same amount of cars in each group. In R, type 1 sums of squares is the base option, and you can use type 1 sums of squares for most experiments with balanced designs. If your design is unbalanced, your should use type 2 or type 3 sums of squares. In R, you specify this by adding type = "II" or type = "III".
Population Pyramid
A bar graph representing the distribution of population by age and sex. Population Pyramids work well when you have a ratio-level variable and a dichotomous variable (like age and sex)
Stacked Bar Graph
A bar graph that mimics a crosstab. It compares the same categories for different groups and shows category totals. The independent variables are the columns (or bars) and the dependent variables are split out within each bar.
CB Complete Block Design
A complete block design uses 1 nuisance factor as a block and one or more treatment factors. In CB designs, every level of the treatment factor is randomly applied within each block. Interaction terms are not present in CB designs unless there is more than 1 response for each treatment/block combination.
Census
A comprehensive list of all the individuals or items found in a population
Confidence Interval
A confidence interval gives a range where true values are likely to lie, given a certain percentage of confidence. In an interval with 95 percent confidence, researchers are 95 percent confident that the true mean lies within the given range. The range is calculated by using a point estimator plus and minus the margin of error.
Confidence Interval
A confidence interval is a range of values which are likely to contain the true value (usually the true average) within the range. A confidence interval is made up of a point estimator plus or minus the margin of error. Common confidence levels are 90%, 95%, and 99%. In a 95% confidence interval, for example, it can be assumed that 95% of confidence intervals will contain the true value. This does not mean, however, that there is a 95% probability that the true value is within the lower and upper limits.
Fractional Factorial
A fractional factorial is one that only runs a fraction of the possible tests that could be done by crossing all factors. Usually only 50 or 25 percent of the factor combinations are run. The limited number of factor combinations does not allow inference for interactions, but often the effects of the main factors are not incredibly far off from what would be the case if a full factorial was carried out. Fractional factorials are often used in industry to cut down on the costs of running many tests.
p-value
A p-value is the probability of obtaining a test statistic as extreme or more extreme than the one you observed, assuming that the null-hypothesis is true. P-values are considered significant if they are below a certain level, typically .05.
Point Estimator
A point estimator is a value which sums up information in a data set. Examples of point estimators are Mu, x-bar, Sigma, and S.
Population
A population is the sum of all people or things that you are interested in. If you wanted to study American alternative rock bands, your population would be every single alternative band.
Sample
A sample is a limited number of individuals or items taken from the population
Sample
A sample is a number of units taken from a population that is hopefully representative of that population
Uniformity vs. Representativeness
A sample that is uniform will have subjects that are close to the same, whereas a representative sample with resemble the variability present in the actual population.
SP/RM Split Plot / Repeated Measures Design
A split plot or repeated measures design has two factors of interest and one blocking factor that is nested within the between blocks factor. Visually, the between blocks factor is split horizontally and has only one block level for each between blocks level. The within blocks factor is split vertically and each block is contained within each level of the within blocks factor. There is also an interaction term between the within blocks and the between blocks.
Statistic
A statistic is a fact, number, or piece of data taken from an experiment. In ANOVA, an F-statistic is a statistic calculated by dividing the within group variability by the between groups variability (or the variability in the residuals).
Regression
A statistical tool used to quantify the association between two (or more) variables Not inherently causal
ANCOVA
ANCOVA stands for Analysis of Co-variance. ANCOVA is a mix of ANOVA and linear regression. In its simplest form, it has one response variable, one treatment variable, and one co-variate variable that is quantitative in nature. The co-variate is a nuissance variable which researchers want to control. Ancova may not always be the best choice, even if the co-variate is nummerical. If the relationship between the response variable and the covariate is linear, ANCOVA may be a good choice. If it is not linear or co-variate values are known before running the study, it may be better to use blocking techniques rather than ANCOVA.
Finding the median with odd cases
Add one to N, then divide by 2
What is the probability that a flipped coin is heads?
After it is already flipped, the probability is either 1 or 0
Compute Function (In SPSS)
Allows you to combine variables together
group_by function
Allows you to organize your code according to a certain column in the dataset KidsFeet %>% group_by(sex) %>% summarise(KidsLength = mean(length))
summarise function
Allows you to perform statistical functions on your data Kids Feet %>% summarise(KidsLength = mean(length))
Joint hypothesis
Allows you to test for more variable constraints using an F statistic (whatup ANOVA)
3-D Bar Graph
Allows you to visually compare 3 different variables in one graph.
Level of Significance
Also known as alpha, the level of significance is established in hypothesis testing to let researchers know when the observed test statistic is unusual (meaning not likely to occur in the distribution of sample means). The most common alpha level is .05, but other common levels are .1 and .01. For an alpha of .05, the null hypothesis is rejected when the p-value is less than .05. For this test, .05 is also the probability of commiting a type 1 error.
Title Rule (of contingency tables)
Always word your title as the dependent variable by the independent variable--"dog ownership by sex".
Interaction
An interaction in statistics is the relationship between two or more factors of interest. Interactions with p-values under the significance level are not considered significant. The significance of interactions should be determined by looking at an interaction plot and by assessing the significance of the p-value
interval level
Applies to data that can be arranged in order. In addition, differences between data values are meaningful. However, there is not a true zero. Example: temperature
(Generally) What does Pearson's r need to be to be considered significant in sociology?
Around .2 or .3
How many observations are needed for x-bar to be normal?
At least 30 for most samples.
Mean
Average calculated by adding up the sum of all the values and dividing by the number of values
Median
Average calculated by putting all the numbers in order from least to greatest, and choosing the middle number
How do you know which way the quadratic line goes by looking at the coefficients?
B1>0, B2 > 0: Positive convex B1>0, B2< 0: Positive concave B1<0, B2<0: Negative concave B1<0, B2>0: Negative convex
Bias technical definition
B2 - Gamma1 Sort - Long Go back to slides...What the heck is G2 and Y1...
B2 & Y1
B2 - Relationship between Y variable and B1 Y1 - Relationship between B1 and B2
BASIC EXPERIMENTAL ELEMENTS
BREAK
Why do you need to do a two sided test when you have a "not equal to" sign in your alternative hypothesis?
Because positive and negative z-score values can as extreme or more extreme than the observed test statistic.
If you want to obtain the total effect of an intervention, why shouldn't you control for variables that occur after the treatment and/or reflect a pathway through which the treatment might operate?
Because you might be mixing up those post-program predictors with the program itself For example, if you want to measure the impact of a job training program on earnings, you shouldn't use hours worked after the program as a control, because the program could have effected the number of hours worked.
Bias
Bias is any influence that causes your experiment to systematically deviate from the truth. Bias can occur at several stages of an experiment, including collecting the sample and assigning treatments to units. The best way to guard against bias is to use randomization whenever possible. On page 12 of the textbook, the author describes bias as drawing tickets from a box, but some tickets are larger than others; therefore; there is bias in the study because some tickets are more likely to be chosen.
Fully standardized regression coefficient (aka "Beta" coefficient)
Both predictor and outcome are converted to SD units
Binary variable vs Dummy variable
Categorical variable with two categories 1 or 0 (edit)
Categorical Variable
Categorical variables are essentially non-numerical. Performing mathematical functions on these values would not make sense, even if variable does include numbers (For example: zip codes, telephone numbers, I-numbers).
Formula for calculating expected frequencies in each cell
Cell column N x cell row N/total N
B1 in multiple regression
Change in Y associated on average with one unit change in X1 holding X2 constant
Indexing
Combining two or more variables into one to create a comprehensive variable
Concave vs Convex
Concave - curves down and makes a cave shape Convex - curves up and makes a bowl shape
Confounding
Confounding occurs when a relationship between a condition and a response is actually explained by a third nuisance variable. In some scenarios, it may seem like the outcome is being caused by the factor of interest, but it's possible that that factor is being confounded with the true cause of the observed relationship.
Crosstabulation or crosstab
Crosses two variables over one another so that we can see if there might be a relationship between them. Sometimes called a contingency table
Crossing
Crossing is the process of testing all possible level combinations for a factor. If I have 3 levels--A, B, & C-- I would want to test AB, AC, and BC.
Normal Distribution
Density curve that has a symmetrical, bell-shape. Most observations accumulate near the center and get fewer as you get farther away from the center. Normal distributions have significant statistical properties that allow researchers to make inferences about data samples.
Dot Plot
Depicts the actual values of each data point. Best for small sample sizes or for datasets where there are lots of repeated values. Histograms or boxplots are better alternatives for large sample sizes when there are few repeated values. stripchart(length ~ sex, data = KidsFeet)
Scatterplots
Depicts the actual values of the data points, which are (x,y)(x,y) pairs. Works well for small or large sample sizes. Visualizes well the correlation between the two variables. Should be used in linear regression contexts whenever possible plot(length ~ width, data = KidsFeet, pch = 8)
Bar Charts
Depicts the number of occurrences for each category, or level, of the qualitative variable. Similar to a histogram, but there is no natural way to order the bars. Thus the white-space between each bar. barplot(table(KidsFeet$sex)
Precision (in OLS)
Describes how small your standard error is - greater precision means smaller standard error or uncertainty
Survey Weight Formula
Desired Percentage/Current Percentage, then multiply your current percentage by the result.
Contingency Table
Displays the counts of categorical variables in a chi-sqaure test. Often one factor makes up the rows of the table and one factor makes up the columns
Finding the median with even cases
Divide N by 2, find value, then find the value of the next highest number, add those two together and divide by 2
What should you do about imperfect multicollinearity?
Do nothing - if you don't care about X2 and X3 and you're just holding them constant for X1 Include only 1 based on context Combine them together
What are examples of hypotheses?
Drinking gatorade will result in better game performance Using a new fertilizer will result in higher crop yields Using a new study technique will increase students' grades
Cohen's D (Effect Size)
Effect/SD of outcome
Error Term (in the regression model)
Epsilon - represents the residuals for individual points
The null hypothesis will always be a statement involving--
Equality
Why is the mean sensitive to outliers while the median is not?
Extremely high or low numbers are calculated into the mean and can alter the average. In the median, however, extreme numbers do not affect the number that lies in the middle.
Characteristics of a F-distribution
F-distributions are right skewed. There are no negative values in the F-distribution. F statistics calculate the amount of data to the right of the test statistic on the distribution.
Factor
Factors are the elements of the experiment that contribute to the final observed values. Factors are organized by grouping together data that has undergone similar conditions. In the assembly line metaphor, the factors are the stations in which each data point stops and gets refined. In every study there are universal factors and structural factors. Universal factors are those which every data point shares in common -- the grand mean and the residuals. Structural factors are those that are applied to some data points but not others. They include the treatment factors, interaction factor, and any blocking factors.
Paired Samples t-Test
Feet1 <- filter(KidsFeet, sex==B) Feet2 <- filter(KidsFeet, sex==G) t.test(Feet1$length, Feet2$length, paired = TRUE, mu = 0, alternative = "two.sided", conf.level = .95)
Where do you look up information on GSS Survey Information?
GSS Data Explorer Website https://gssdataexplorer.norc.org/variables/vfilter
Bias triangle
Go back and draw
Histogram
Graph that has response values on the x axis and the frequencies in which those values occur on the y axis
Histogram
Graph that uses a single quantitative variable on the X axis and frequency on the Y axis hist(KidsFeet$length)
Scatterplot
Graphic that shows correlation by plotting a quantitative X variable and a quantitative Y variable
Pie Chart
Graphic that shows the proportion of different variables
Boxplot
Graphical depiction of the 5-number summary. boxplot(length ~ sex, data = KidsFeet)
Boxplot
Graphical representation of the 5 number summary. 25 percent of the data is between the minimum and Q1, 25 percent between Q1 and Q2, 25 percent between Q2 and Q3, and 25 percent between Q3 and the maximum.
Null and alternative hypotheses for linear regression
Ho: B1 = -0 Ha: B1 neq 0
One sided H0 in regression
Ho: B1 = B1, 0 Ha: B1 neq B1, 0
Synonyms for "holding variables constant"
Holding fixed Controlling for Blocking
ANOVA Test
Hypothesis test that tests the difference between several means. ANOVA tests have one or more qualitative factors and one quantitative response factor
Paired Samples T-Test
Hypothesis test which tests whether there is significant variation between the differences of two connected samples.
Interpreting negative and positive bias
If the bias
How can you tell if an Indepedent Samples T-Test should be used rather than a Paired Samples T-Test?
If the individual items that go in group 1 don't tell you which items should go in group 2.
What makes an unusual observation?
If the observation occurs less than 5% of the time in a distribution. Or, in other words, if the absolute value of the z-score for the observation is greater than 2.
When is it good to use a cluster sample?
If the within-block variation is greater than the between-block variation, it is okay to use a cluster sample.
Multicollinearity and the dummy variable trap
If you know information about 3/4 dummy variables, you will automatically know the value for the last variable This is why if you have several categorical variables you always leave one out as a base category
Imputation
Imputation is when you make up for missing data by putting in, or imputing, data based on the general trend of you other data points.
Simple Random Sample (SRS)
In a SRS, all possible items in the population have an equal chance of being chosen
Stratified Sample
In a Stratified Sample, items are organized into strata (or similar groups), and then a SRS is taken of each group.
Systematic Sample
In a Systematic Sample, A random starting point is chosen (K), and then ievery Kth item is chosen.
Convenience Sample
In a conveniecne sample, items are selected out of convenience and without any randomized method
Alternative Hypothesis
In contrast to the Null Hypothesis, the Alternative Hypothesis is the new hypothesis being tested in a given experiment..
Regressor
In regression analysis, a variable that is used to explain variation in the dependent variable. The X's
i.i.d.
Independent - characteristics of one observation are not systematically related to the characteristics of another variable Identically distributed - statistical distribution of a variable is the same over time Cluster samples are not i.i.d. - your data are not independent
Median Regression
Just takes the absolute value of the errors instead of squaring them
as.numeric & as.factor functions
KidsFeet$birthmonth -> as.factor(KidsFeet$birthmonth)
The Selection Operator
KidsFeet$sex
BREAK
LESSON 12
BREAK
LESSON 13
BREAK
LESSON 14
BREAK
LESSON 16
BREAK
LESSON 17
BREAK
LESSON 18
BREAK
LESSON 19
BREAK
LESSON 21
BREAK
LESSON 22
BREAK
LESSON 23
BREAK
LESSON 3
BREAK
LESSON 4
BREAK
LESSON 5
BREAK
LESSON 6
BREAK
LESSON 7
BREAK
LESSON 9
Degrees of Freedom
Look up more on the theoretical definition of degrees of freedom -- what I know now is that degrees of freedom are the number of "free" numbers within each factor. The number of free numbers is equal to the number of values you need until you can fill in the rest using your knowledge of the factor average.
How do you interpret binary and continuous variables together in a regression?
Make sure you say "holding constant" for other variables when interpreting
pander()
Makes graphics and numerical summaries look nice
What is the mean of the sample means?
Mu
What are the null and alternative hypotheses for a paired sample t-test?
Null: Mu(d) = 0 Alternative: Mu(d) neq < > 0
Null and alternative hypotheses for one proportion test
Null: P = x/n (some value) Alternative: P <>neq x/n (some value)
Null and alternative hypotheses for a two proportion test
Null: P1 = P2 Alternative: P1 <>neq P2
What are the null and alternative hypotheses for an Indepedent Sample T-Test?
Null: X1 = X2 ALternative: X1 neq <> X2
Observational Study
Observational studies differ from experimental studies in that researchers don't apply treatments to units. Instead, they observe conditions that are already occurring. They don't manipulate the experiment.
In this example, how many restrictions do you have? South - West = 0
One restriction (even though there are two coefficients). You can let South be whatever you want, but you are restricting what West is
When do you interpret the Adjusted R^2 vs. the multiple R^2?
Only use the Adjusted R^2 if you haven't used robust standard errors
Parameter
Parameters are the unknown, true values that we want to find from the population. They are almost always unknown because it is usually impossible to sample an entire population
Quantitative Variable
Qantitative variables are essentially numeric in nature. You can realistically perform mathematical functions on them
Quartile
Quartiles divide the data into fourths -- the 25th percentile, 50th percentile, and 75th percentile (Q1, Q2, & Q3)
R-squared formula
R-squared = ESS/TSS OR 1 - SSR/TSS
Reliability
Reliability is related to the idea of repeatability. A reliable study is one where, if ran many times, the results would be close to the same.
Control
Researchers often have a control group, which is a group that does not receive the treatment of interest. Instead, the control group is either assigned no treatments, or they have a treatment that is the base-case scenario used for comparison. For example, if researchers wanted to test how two different changes to the Gatorade recipe effects flavor ratings, the control group would be the Gatorade recipe with no changes.
What plots are needed to satisfy the first 3 requirements for linear regression?
Residual plot, scatterplot, QQ plot of residuals
Marginals
Row and column totals in a contingency table, which are shown in its margins.
d-bar
Sample mean of the differences
s(d)
Sample standard deviation of the differences
Two Proportion Test
Similar to the independent samples t-test, the two proportion test is used to judge the differences between two proportions.
B1
Slope increase in y when x increases by 1 unit
Experiment
Study in which researchers assign which group(s) receive treatments and which receive no treatment (control group)
Observational Study
Study in which researchers do not assign treatments to groups. Instead, they simply observe patterns found in populations
Covariance Formula
Sxy = r * Sx * Sy
Mu(d)
Symbol for the mean of the differences
x-bar
Symbol for the sample mean
Collapsing
Taking an original variable and from it creating a new variable with fewer categories
Main effects
Terms that do not have interactions
How can you test to see if the data is linear or not?
Test whether B2 (the squared coefficient) is equal to zero Ho: B2 = 0 Ha: B2 neq 0
Hypothesis Test
Tests the alternative hypothesis against the null hypothesis. To run a hypothesis test, you need a number distribution, a test statistic, and a p-value.
What do you conclude if zero is included within your confidence interval?
That there is no significant difference between the two groups because there is a chance that the true difference is zero.
F-Ratio
The F-Ratio is a number from an F-distribution. It is used to calculate a p-value. The F-Ratio is calculated by dividing the mean of squares for the factor of interest by the mean of squares for the residuals -- this is also expressed as the within group variability divided by the between group variability.
LS Latin Square Design
The Latin Square Design is a blocking design that uses one factor of interest and two blocks. To imagine the structure of the design, you can think of one of the blocks being in the rows and one in the columns. Then, you distribute the levels of your factor throughout the rows and columns so that each row and column has at least 1 replication of each level
Null Hypothesis
The Null Hypothesis is the status quo or what is considered to be true by past experiments or conventional wisdom.
Standard Deviation Definition
The average distance data points are from the mean
What does B0 tell you with binary variables What does B1 tell you?
The average outcome for the omitted category The difference between the omitted category and the other category
Homoskedastic
The error term is homoskedastic if the variance is the same for all values of x Otherwise, the error term is heteroskedastic Most datasets/relationships have some degree of heteroskedasticity
Experimental Unit
The experimental units in a study are the things that are assigned treatments.
Factor Structure
The factor structure is the way that factors are organized in an experiment. Visually, factor structures are often represented in rows and columns to show the grand mean, factors of interest, interactions (if any), and the residuals. This structure can also help people to visualize which factors are inside of each other.
What happens to the margin of error when the sample size increases?
The margin of error decreases
Mean Squares
The mean squares gives you a measurement of the average amount of variability contained within a factor. It is calculated by taking the sum of squares for a factor and dividing it by the degrees of freedom for that factor.
What is meant by the differences of the means?
The means of both groups are calculated, and then the differences of the means are used to obtain a t statistic. In matched pairs, the mean of the differences is used.
Ordinary least squares (OLS)
The most common way of estimating the parameters β₀ and β₁. It finds a line with the "least squares" meaning the least amount of errors, or the best fit
Expected Counts
The number of factor counts that you would expect to see assuming that the null hypothesis is true
Observed Counts
The number of times a certain factor outcome occurs
What is meant in a chi-squared test by "the two variables are indepdent of each other"?
The outcome or observed counts is not dependent on the other factor. They would happen regardless of the values in the other factor.
In a SPSS output table, what is the difference between the percent and the valid percent columns?
The percent column includes missing data (often marked as DK NA -- don't know or not applicable) and the valid percent column does not.
Mu
The population mean
Sigma
The population standard deviation
P-Value
The probability of obtaining a test statistic as extreme or more extreme than the one observed, assuming that the null-hypothesis is true.
s
The sample standard deviation
Econometrics
The science and art of using economic theory and statistical techniques to analyze economic data
Standard Deviation Formula
The square root of the sum of (x - mu) squared divided by n *Note: for a sample, it is n-1
Sum of Squares
The sum of squares is the sum of variability for a given factor. It is found by squaring each of the individual effects for each level of the factor. This squaring gets rid of negative numbers and gives you the total amount of variability present.
Uniform, Unimodal, Bimodal, and Multimodal
Uniform: No noticeable peaks in the distribution Unimodal: One peak in the distribution Bimodal: Two peaks in the distribution Multimodal: Multiple peaks in the distribution
What is the solution for testing joint hypotheses?
Use an F statistic!
level log (outcome, predictor) How do you interpret? What is the model?
Use when it makes sense to think about changes in the predictor variable in percent terms (income level, population). A one percent increase in X is associated with a unit increase in Y Y = B0 +B1 * lnX1 + u
filter function
Used to reduce a dataset to a smaller set of rows than the original dataset contained
Select Function
Used to select out certain columns from a dataset select(KidsFeet$sex) OR select(KidsFeet, c(name, birthyear, birthmonth))
The Pipe Operator
Used to send functions down to the next line %>% Shortcut: Ctrl Shift M
Validity
Validity is the level to which an experiment accurately measures and answers the research question. If researchers want to know what the average free throw percentage is for NBA players in NBA games, but they measure the percentage from a sample of players shooting free throws in practice, the study will have low validity because practice free throw percentages aren't necessarily the same thing as game free throw percentages.
Control variable
Variable that is included in the model to control for unwanted variance, but is not of inherent interest to the study Control variables are distinguished in the model equation by Ws instead of Xs
Dichotomy Level
Variable that takes on only two values. For example: sex.
Dichotomous/dummy variables
Variable used in regression to compare the effect of a yes or no situation, coded as 1 or 0.
Discrete Random Variable
Variable with a random rather than a fixed outcome. The variable is discrete because all possible outcomes could be listed.
Effects
Viewed in the assembly line metaphor, effects are the added differences that data points get as they move down the line. These differences are calculated by taking the average of a given factor and subtracting the sum of all other outside factors.
What is the weight variable in the GSS
WTSSALL
Robust standard errors
Way to get around heteroskedasticity? Go back and study
(Sample) Z-Score Formula
X-bar - Mu/ Sigma/Square Root of N
Confidence Interval Formula (Sigma Known)
X-bar - Z*Sigma/SQRT(N), X-bar + Z*Sigma/SQRT(N)
Chi-Square Test Statistic
X^2
Log-level (outcome, predictor) How do you interpret? What is the model?
Y is in logs, X is in "levels" B1 = unit change in x is associated with a percent change in Y, holding everything else constant lnY = B0 + B1X1 + u
Bo
Y-intercept - value of y when x = 0
Quadratic regression equation
Yi = B0 + B1x + B2x^2 + ui
Bivariate model equation
Yi = Bo + B1 * Xi + ui
How can you tell between matched pairs and independent samples?
You can tell that the samples are matched pairs if the data in group 1 tells you what data will be in group 2.
How do you find a percentile on the nomal distirbution applet?
You shade the sections to the left of the percentile and enter in the value (for example .75) into the area specifier at the top.
Margin of Error Formula (Sigma Known)
Z * Sigma/SQRT(N)
sampling frame
a list of individuals from whom the sample is drawn. Good sampling frames include all individuals in the population.
Logarithm
a quantity representing the power to which a fixed number (the base) must be raised to produce a given number. The inverse of the exponential function 8 to the log base 2 is 3 because 2 to the third is 8
Linear Regression
a statistical method used to fit a linear model to a given data set. Simple linear regression uses one quantitative explanatory variable and one quantitative response variable.
Two-way ANOVA
aov(length ~ birthmonth*sex, data = KidsFeet, contrasts = list(birthmonth = contr.sum, sex = contr.sum))
One-way ANOVA BF[1]
aov(length ~ birthmonth, data = KidsFeet, contrasts = list(birthmonth = contr.sum))
pareto chart
bar chart with categories assorted graphically from highest to lowest
Three types of interactions
- Binary and continuous - Binary and binary - Continuous and continuous
Residual Formula (linear regression)
(Observed values) - (Expected values) or y-y-hat
What is the distribution of sample means?
(Usually theoretical) Situation where many samples are drawn from a parent population and the mean of each sample forms a normal distribution. This happens if the parent population is normal or if the sample size is large enough.
Explained sum of squares (ESS)
(Y-hat-i - Y-bar)2 Sum of squared deviations of the predicted value, Y-hat-i, from its average Tells you how much variation is explained by your line
Total sum of squares
(Yi - Y-bar)2 sum of squared deviations of Yi from its average (sample variance)
Sum of Squared Residuals (SSR)
(Yi - Y-hat-i)2 The unexplained variation
Percentile formula
(number of scores at or below a given score / total number of scores) x 100
Residuals in regression
(y-y-hat) (observed minus predicted)
Sample Size Formula (no p*)
(z*/2m)^2
Fisher Assumptions
* Additive Model: All observed values are made up of a true value plus an error term * Constant Means: The averages for the factors are constant. Different data gathered in another situation should produce the same constant average. * Zero Mean Errors: The average of the errors should be zero * Constant Variance Errors: The error terms have similar standar deviation, meaning they are spread out from the mean in a roughly equal manner. * Normal Errors: The error terms follow a normal distribution. Some are observed above the mean and some below, but most cluster closer to the mean. * Independent Errors: One chance error should not effect the likelihood of another chance error. * Same Standard Deviation: The standard deviations for the different factors in your ANOVA test should be similar. A general rule is that the largest standard deviation shouldn't be 3 times larger than the smallest According to the textbook (pg 483), when the standard deviations are too far apart, the factor(s) with the largest standard deviations dominate the grand mean.
Outliers
* Definition: Outliers are data points which are far away from the greater body of data. You can usually tell if a data point is an outlier if it is more than 3 standard deviations away from the mean. * Remedies: Some researchers will just drop outliers from the data, but to do this you have to have good reason, such as knowledge that that particular outlier was flawed or misrepresented in some way.
Sources of Variability
* Conditions: Variability within the conditions is to be expected. The reason why we test on different factors and levels is to see how the numbers differ by condition. * Material: Variability within the material is the actual differences that exist between things. For example, if you want to compare the yield of tomatoes by tomato variety, not all plants, even within the same variety, are going to produce the same amount of tomatoes. There is almost always variability within the things that we want to measure * Measurement Process: Variability within the measurement process should be mitigated as much as possible. However, even when researchers try their best to diminish variation in the measurements, there are still going to be some errors. In most cases, if all precautions are taken to mitigate measurement error, we can assume that the remaining errors will follow a chance like patter similar to the variability in the material.
Measurement Classification (Steven's Four Types)
* Nominal: Nominal responses are those that are essentially categorical and not set on any type of comparative scale. Colors or race are example of nominal responses. There is no quantitative or measurable difference between the colors red, black, orange, and blue, and taking an average of these responses wouldn't make sense. * Ordinal: Ordinal responses are categorical in nature but set to a scale that has noticeable differences. Likert scales are a good example of ordinal variables. For example, how satisfied are you of the job President Trump is doing? A Very satisfied B Satisfied C Indifferent D Unsatisfied E Very Unsatisfied However, it should be noted that although there are noticeable differences between these responses, the differences between gradations are not completely always uniform or measurable. For example, how far is satisfied from very satisfied? It isn't something that we can accurately measure. * Interval: Intervals are quantitative in nature. However, they do not have an actual zero. A good example of an interval response is temperature. * Ratio: Ratio's are quantitative in nature and have an absolute zero. For example, height, weight, volume, etc.
Experimental Content
* Response: This is the variable we want to measure. In ANOVA it is quantitative * Conditions: These are the different treatments that are causing the response. * Material: The material or the units are the things that we apply treatments to.
Sheffe Test
*Make sure the agricolae package is loaded sheffe.test(AnovaModel, "birthmonth", console = TRUE)
What aspects should be included in a table?
- A title - Sample Size - Column percentages - Clearly labeled
Benefits of using logarithms
- Allows us to express terms in percentage changes - Useful when making comparisons across different measures
Kinds of Variability
There are three types of variability present in experiments: * Planned Systematic: This variability is one that we want in our studies. Examples of planned variability include choosing a random sample out that is representative of the population or using a random number generator to assign treatments to units. * Chancelike: Chancelike variability, as stated in the textbook, is something we can deal with. There is always going to be variation in the things we want to study, whether due to measurement error or actual differences in the material. However, we can assume that chancelike variability will act in predicable ways, meaning that some data points will be above the true average and some will be below, but in all, the observed average will still give us an accurate picture of the true average. * Unplanned Systematic: Unplanned systematic variability is the type that we don't want in our studies. This kind of variability can result from human bias or bias from nuisance factors. For example, in my airplane experiment, I threw my planes outdoors and the wind picked up during the process. The wind introduced unplanned variation because it distorted the true distances that the planes would have traveled, and it didn't affect them all equally -- sometimes the wind blew harder than others.
What are the Cumulative Frequencies and Cumulative Percentages in SPSS?
They tell us how many observations (or percentages) are accounted for up to and including that row in the table.