Experimental Analysis and Techniques Final PLSC 351
Logarithmic (transfrom data)
The logarithmic transformation is particularly useful when the data has a wide range of values and is positively skewed. - Used when factor effects are multiplicative rather than additive - Add a constant - Excel -LOG(value+1) - Removes big gaps --> better distribution
Square Root (transform data)
The square root transformation is particularly useful when the data has a skewed distribution - When variances = means - addition of a small constant - excel: SQRT(value -> 0.5)
Post Hoc test for ANOVA
Tukey HSD test *best* - Controls risk of alpha error - sample sized are not even Student Newman Kuels Test - called multiple range test - controls risk of alpha error (little less than tukey) Least significant differences (LSD) - Most likely to commit alpha error - Doesn't pool risk, an error inflates - Looked down upon Scheffe's Multiple contracts - more conservative - frequency used w/ uneven sample sizes Dunnett's test - tests for differences between a control & other treatments - raises risk of making a beta error
Two-Factor ANOVA
We want to determine whether two different factors, soil type and water availability, have an effect on the yield of tomato plants. Ho: There was no interaction between nitrogen and cultivar on plant tissue weight Ha: There was an interaction between nitrogen and cultivar on plant tissue weight Ho: There is no interaction between independent variable and independent variable b dependent variable.
ANOVA Model 2: Random-Effects Model
a random-effects ANOVA model is used when the groups being compared are not predetermined, but rather are randomly selected from a larger population - Levels of a facto are random - not common in agriculture - multiple comparisons are complicated Ex. Year, Block
Precision
closeness of repeated measurements
Accuracy
nearness to the actual value being measured
Distribution
the arrangement & Frequency of data points - often represented as a curve
ANOVA Model 3: Mixed Model
when there are both fixed and random factors that are expected to influence the outcome variable. - Both Fixed & random effects present - Blocks common in agriculture - formulas change *correct formula for F to get proper P* Ex. Fertilizer rate & Year
Sample = s
- A subset of a population used for analysis - needs to represent the population - random sampling is important to avoid bias
descripitve stats
- Central Tendency - Dispersion - Standard Deviation - Coefficient of variation Stats lacking dispersion descriptions are dangerous (naked stats)
Data Collection
- Collect proper data - does data measure what I want to observe? - is data collection feasible? - is sample large enough to be meaningful? - Make sure the values are legible (no ink) - Check accuracy of data entry
Nonparametric ANOVA
Kruskal-Wallis Test If you have normal data an ANOVA test is more powerful than a Kruskal-Wallis Test. If you have non-normal data a Kruskal-Wallis Test may be more powerful than an ANOVA - Randomized and independent treatments and samples - Samples have the same dispersion and shape Nemenyi Test If you reject the null hypothesis you can move on to a nomparametric Multiple Comparisons Test Uses rank data to calculate differences between treatments in a manner similar to the Tukey Test. 1. Find the sum of rankings for each group 2. Set up comparisons (start with the largest difference) 3. Calculate differences 4. Compute a standard error (SE for the experiment 5. Calculate the q-value for each difference 6. Look up the critical value for q
Block design (RCBD w/o replication)
- Randomized Complete Block Design without replication - Controls for error - Blocks set up perpendicular to gradient of variation - if conditions uniform, using blocks lower power of statistical tests - every block contains every treatment
Replication
- Required meaningful conclusions - Within experiment repetition - experimental unit must be identified - experimental units must be independent - use equal sample sizes
Software
- Specialized programs for data handling and analysis - ability to mandate complex designs - can run a wide range of tests
Psuedoreplication
- The illusion of replication Ex. Measuring mpg in 1 car
Completely Randomized Design (CRD)
- Units can be anywhere - (need uniform conditions) when low variation exists in experimental units or the environment it is very efficient - susceptible to deceptive outcomes when conditions are not uniform
Chi-Square Test of Independence
- Used w/ 2 variables - does 1 variable influence another variable? - Ho: no relationship between variables - Ha: there is a relationship between the variables - df = (# of categories in first variable - 1) * (# of categories in the second variable) - Chi-Squared best suited for higher sample sizes (1000+)
Chi-Square Goodness of Fit Test
- Used w/ one variable - Ho: sample fits a specified distribution - df = # of categories - 1 Ex. Apple trees sold on central coast (t-bud, cleft)
Hidden Variables
- Variables not tested - could be meaningless or very impactful
Nonparametric tests for Two-Sample Dependent T-test
- Wilcoxen single-sample test - Wilcoxen matched pairs test - Mann-Whitney U test
Population = p
- an exclusive group for which conclusions are desired - small or large Ex. Lizards in CA, Martin Guitars
Non-nominal data (transform data)
- data that is not categorical 1. Logarithmic 2. Square root 3. Arcsine
Randomized Complete Block Design (RCBD w/ replication)
- each block has multiple replication
Data Handling
- look for patterns in data sets - outliers - trends - scatter plots is a good way to see trends
Analyzing Nominal data
- nomincal data can be though of as count data - each experimental unit is counted (placed in a category) - Chi-square is one key concept of analysis of counts *nominal = counts of expresses unit that fit into categories*
Correlation coefficient (r)
- now closely the 2 variables relate to each other - scale of ( 10 to -1 ) ( 0 = no correction) - can be calculated by values or ranks - looking to see how well the data fits a slope
Sample Size
- sample size required to observe difference can be calculated before or after test - calculated or estimated parameters - g-power to predict sample sizes - choosing - previous experience - calculations - 4 is very common *more is not always better*
Arcsine (transform data)
- useful when the data represents proportions or percentages, Ex. The proportion of people who prefer one brand over another or the percentage of a population that has a certain characteristic.
Correlation & regression
- very similar - same statistical overlap w/ test - goals are different *we look at simple linear correlation/regression* - 2 factors w/ ratio & interval data - nominal data pairs ratio & interval data together - examine relationships between factors
Randomized complete block design
- w/ no replication - can't evaluate an interaction - only 2 sets of hypotheses Ex. independent facrpt 1: fertilizer independent factor 2: block
ANOVA Model 1: Fixed-effects model
-Classic - Levels of a factor are specifically chosen - Uses formulas we have discussed in class - Can be done again Ex. Fertilizer rates, Culitvars, Irrigation Rates, pesticide application
Analysis of Variance (ANOVA)
-R.A. Fisher - Tests if means of samples are the same - Power of ANOVA is higher when sample sizes are equal - Variances of samples are homogenous - error is randomly and independently distributed - main effects are additive (dramatic change = alters test)
Chi-Squared
Large / small drinks vs. Happy / Sad
Chi-Squared (relationship or not)
Large / small drinks vs. happy / sad
Independent T-Test
One-Sample: Asses how well a sample represents a hypothetical population Two-sample: Assesses the likelihood 2-samples came from the same population - Ratio or Interval Data - Normal distribution - Samples are independent of each other - Sample have equal variances
ANOVA & Tukey
Strawberry harvest (weight ) for 2 cultivars w/ 3 irrigation methods measurements
2 sample independent t test
Strawberry harvest (weight) for 2 cultivars
Ways experiments get ruined
1. Lack of randomization 2. Lack of replication 3. Wrong Variables 4. Poor data collection procedures / techniques 5. Disturbance 6. environmental - input variation that is not evenly distributed
Central Tendency
1. Mean 2. Median 3. Mode 4. Quantiles (percentile/decile/quartiles)
Scientific Method
1. Observation 2. Conjecture 3. Testing 4. Analysis 5. Retesting
Dispersion
1. range (high & low, high - low) 2. variance (mean sum of squares)
Nonparametric ANOVA (If the data is not normal)
1. transform data 2. use nonparametric tests - doesn't have normal distribution - lacking the estimation of parameters - based on ranks
2-Factor ANOVA w/ RCBD with replication
6 blocks w/ 5 radish cultivars
Alternate hypothesis (Ha)
A statement expressing opposition to the null hypothesis - results differ
Null Hypothesis (Ho)
A statement expressing that no difference / relationship was observed Options: reject or fail to reject
T-Distribution
Allows samples to be tested when the population mean and standard deviation are unknown - works well w/ small samples - if t-score increases the p-value decreases
Correlation
Artichoke sales & lawsuits in SLO
Stats
Branch of math dealing collection and presentation of masses of numerical data
Two-Sample Independent T-Test
Ex. Compare the average height of two different varieties of tomato plants, variety A and variety B. (Ho) is that there is no difference in the average height between the two varieties of tomato plants. - samples are not related to eachother
Single-Factor/One-Way ANOVA
Ex. Compare the growth rates of three different varieties of corn plants, labeled A, B, and C. (Ho) is that there is no difference in the mean height of the three varieties of corn plants. - Samples must be independent - Equal variances
Two-Sample Dependent T-test
Ex. We want to determine whether a new fertilizer has an effect on the growth rate of tomato plants. The measurements are paired because each plant has two measurements: one before applying the fertilizer and one after applying the fertilizer. - Used to test the difference between 2 dependent sample means - Ratio or Interval Data - Normally distributed population - Samples must be correlated to eachother
Tukey HSD Test
HSD = Honest Significant difference We have conducted a Single-Factor ANOVA on the yield of three different varieties of potato plants, labeled A, B, and C. We have found that there is a statistically significant difference in the mean yield of the three varieties. To determine which specific varieties are different from each other, we would conduct Tukey's HSD test as a follow-up analysis
Sample Validity
If population data is known, samples can be compared
Fields in ag:
- Designing in orchards/ fields/ plots - be aware of error - remember importance of randomization & independence
Random sampling
- Fruit harvest - What is meaningful? - What is practical? - How can error be controlled? - Sampling - Boxes, crates, stems, sorting machines - be aware of bias - sampling Techniques - Machine selected - assign #'s (random selection) - blind draw - develop a sampling procedure
Randomization
- Helps to avoid bias - Essentials for validity of test - Should not replace common sense - use random # generatirs - consider set-up, data collection & harvesting
Continuous
- Interval and Ratio Data - Decimals used - Reading between the point Ex. Height, Weight
Discrete
- Interval and Ratio Data - Whole #'s used - Fixed points Ex. Leaves
T-tests
- Interval or Ratio data - Samples come from a population w/ normal distribution
independent variable
- Known Variables - Input Variable - treatment " " - predictor " " - explaining factors Ex. Cultivars, Academic Standing, Fertilizer ppm
Regression
- Measures how two variables relate to each other (scale: unlimited) - Shows the change in the Y variable that occurs for one unit of the x variable - Range is infinity Ex. avocado is worth two bucks - Graphed shows perfect predictive power - β=2 -Not that predictable in real experimental data but trying to find closest correlation Ex. Sparrow age and wing length - (β) for every day the sparrow lives, it gains 0.27 cm β=0.27
Correlation of Determination (r^2)
- Measures proportion of the variation in the dependent variable that is associated with the variation of the independent variable (scale 0-1, nothing-100%) - lowercase = simple, - Adjusted r^2 is more conservative - Determining how much you can explain
Correlation
- Measuring a linear relationship between two INDEPENDENT variables, no dependency implied - 2 variables - pos. correlation: one increases = other increases - neg. correlation: one increases = other decreases
Error
- Natural variation - Should be prevented - Use a control
Ratio/Proportions
- Nominal data (categorical) - categories based on quality NOT a measurement - relationship between data points is NOT numerical - an attribute (ex. color, sex, location) - status (dead/alive, right/left) - Experimental design - think about observation & experimental units - data handling - confirm proper design - check tests assumptions (normalitu, variances, effects) - trandrom is an options ( arcsine trandformation) - if assumptions met, general linear models (GLM) - consuming a statistician is recommended - data can be handled as counts & chi-square tests
Deceptive Claims
- Only Positive results recorded - Overreaching conclusions (out of context) - Summary data misleading - No replication - psuedoreplication - experimental malpractice
Dependent variables
- Outcome Variable - output " " - trying to explain / predict / describe
Inductive/Inferential function
- Predicting - Past Results might not be the same to the future results
ordinal data (categorical data)
- Qualitative - Categories based on logical order Ex. (1st, 2nd, 3rd, 4th year) Likert Scale: (strongly dislike -> strongly like)
nominal data (categorical data)
- Qualitative - categories based on quality - Not a measurement Ex. Sex, Color, Location, Alive/Dead, Right/Left
ratio data (numerical data)
- Quantitative - Constant interval size - True 0 Ex. Height, Weight, Petal Count
Interval Data (Numerical Data)
- Quantitative - #'s that have meaning - Consistent interval size - No true zero Ex. °C, °F
