PSYC 301 Research Methods and Data Analysis in Psychology Midterm 1
Repeated Measures t-Test - explain, how to measure change, assumptions
"matched design" - dependency between scores of two groups - scores either belong to the same people (within subjects) - or scores belong to matched pairs of individuals (assumed their qualities match enough to be valid; twins etc) Measuring change: Di = Xtreatment - Xcontrol (Di = index of difference. Xtreatment: posttest. Xcontrol: pretest.) -> is average difference significantly different from zero? - Dbar (mean difference) = 0 => treatment had no significant effect; changes could be from sampling error - goal is Dbar =/= 0 Assumptions: Advantages: - more power because less error due to between subjects differences; focused on change due to manipulation Disadvantages: - other confounding variables: drug carryover effects, practice/memory effects, fatigue effects, general annoyance
The Console Window
- R syntax is executed - code is NOT saved if directly type in it.
arrange() what it does, what's the default, how to do the opposite of default
- To put data in a certain order, by a specific variable - by default puts things from smallest to largest - so: if we do arrange(variable3), it'll give us the table (tibble), with the numeric data that is in variable 3 in order from smallest to largest. - if we want descending order: arrange(desc(variable3))
lists
- a step up from tibbles/df, instead of just storing data, also stores outputs. df have a limit of fixed columns/rows, lists don't have that restriction - think of it as a drawers - as.list() , list() - [] returns as the "list obj" , [[]] returns the actual value , $, [[]][ , ] <- go to the matrix and give me the row/column
Matrix/Matrices - format, binds, how to extract a value from a row/column, labels
- as.matrix(), matrix() - a set of vectors of all the same type - so logical matrix, character matrix, etc... - matrix(data, nrow, ncol, byrow=FALSE) is the default. byrow=F means that the default is by column so we have to set byrow=T for the matrix to set our data by row. - has rbind() and cbind() - use [ , ] to extract a value. [3, ] means third row, all columns. - labels: rownames() and colnames() <- c("names")
dbinom()
- calculates the density of the distribution for probability of getting exactly that number - so dbinom(x=7 (get exactly 7 tails), size = 20 (flip 20 times), prob = 0.5) it returns 0.07 so we're getting very little odds that we'll get exactly 7 tails
select() - what does it do, what are its arguments
- choose certain columns of data to work with - when the data has like infinite columns but we don't care about all of the columns - so the columns are usually the VARIABLE NAMES, so let's say the questionnaire asked for Age, studentID, Q1, Q2, Q3... and we only care about the Q1, Q2, etc. so we select out the Q's, so we don't uselessly have age & studentid etc SO: select(starts_with("Q")) gives us all the question columns only. select(-starts_with("Q")) gives us all columns except the question ones. if we do names(data) after running select we can see what we have now. other arguments in select() numerically: select(data, 1:3) OR select(data, c(1,2,3)) #do columns 1-3 numeric range: select(dat, num_range(prefix="Q", 1:3)) Strings: starts_with(), ends_with(), contains()
filter()
- choose certain rows of data to work with - based on logical tests - so > , < , == , <= , => , != , is.na , !is.na , %in% *** if there's one = it wont work. needs == - | means or, & or , means and - filter(data, Variable == x) - can't collapse multiple tests into one; can't do filter(data, 20<Age<30). has to be two separate tests. filter(data, 20<Age, Age<30)
Factors
- code levels of a categorical variable - use it to turn anything categorical - when creating a new factor: - really useful when running t test, anova, etc - we can create a new factor or convert existing things into a factor variablefactored <- factor(variable, levels = c(1,2,3), labels = c(group1, group2, group3)) - factor(), as.factor()
mutate()
- create new columns of data based on existing ones - mutate(data, newvariablename = rowSums(data, [1:7])) #sum rows 1 to 7, add a column at the end of each row, titled newvariablename, and have the value of that be the sum of each row
%in% what is it, how to use
- group membership - uses c() - looks for particular groups in a categorical - so we could either do filter(ToothGrowth, dose == 1) or filter(ToothGrowth, dose %in% c(1)) to filter out the 1 dose condition - another example from slides: instead of filter(data, Browser == "Chrome" | Browser == "IE" | Browser == "Safari") we can do: filter(data, Browser %in% c("Chrome", "IE", "Safari")
Parse_
- helps diagnose issues with reading files - what kind of vector do you think this vector should be? -> parse_character(), parse_integer, parse_logical - if there are no errors, it'll just return. if there are errors, it'll tell us - "2 parsing failures, at row 3 and 4, this is the issue"
what are strings, how do we use them, what can we do
- library(stringr) - lets say we create my_string <- "hi im a string" - strings are anything with ' ' or " " - lets - can ask for length of string with str_length(my_string) - can add to the string: my_string <- c(my_string, "wow same") output: "hi im a string" "wow same" - substitude: str_sub - add/remove whitespace: str_pad(), str_trim()
File Manager - where is it, what's 1 important thing in it that relates to datafiles
- on the bottom right side of R - Working Directory - there's a "set as working directory" command -> need to make sure we're in the same directory as our data files
str_subset()
- only keeps strings that match a pattern - returns character, being the part of the data that fits the pattern
str_remove()
- removes matching patterns from a string - returns character, everything except the pattern we've removed
the median
- the middle value based on counts - 50th percentile - doesnt care about the magnitude of values
The Environment - where is it, functions & what it shows
- top right side of R - live updating of everything - Shows Data (matrices & datafiles) , Values (everything we've created & defined), & Functions (created functions) - tells us the type of variable it is (numeric, character, logical, etc), length (1:3 etc), first 10 elements
A Function
- uses brackets () , inside are arguments, order matters - examples: c(), round(), mean() .... anything - input -> output - can be nested inside brackets
The Source Window
- where our code is saved
str_detect()
-detects/finds patterns in a string -returns logical str_detect(data, pattern = "__")
str_extract()
-extracts matching patterns from a string
Repeated Measures t-Test - Steps
1- Hypothesis: H0 = muDbar = 0 H1 = muDbar =/= 0 2- Determine Critical Region, tcrit: - alpha = 0.05? - one tail/two tail? - df? # of subjects => tcrit(df) = # 3- gather data 4- compute t-stat t = Dbar - muDbar / SE-Dbar (muDbar = 0 bc of null hypothesis) SE-Dbar = SE-D/sqrt(n) SE-D = sqrt((sum of all (Di-Dbar))^2/df) (refer to slides for week 8 if this step doesnt make sense) 5- Statistical Decision: t-stat compare to tcrit: if tstat < tcrit, fail to reject null. 6- Confidence Interval CI95 = Dbar +- (SE-Dbar * tcrit(df)) 7- Effect size
properties of t distribution
1. The t-distribution is a family of curves, each determined by a parameter called the degrees of freedom 2. When you use a t-distribution to estimate a population mean, the degrees of freedom are equal to one less than the sample size 3. As degrees of freedom increase, the t-distribution approaches the normal distribution 4. After 29 df (n=30), the t distribution is very close to the standard normal z-distribution
ggplot()
A function from the ggplot2 package that returns a graphics object that may be modified by adding layers to it. Provides automatic axes labeling and legends. basically a pretty plot.
How would an individual determine if the difference in scores between variables is large enough to lead us to a conclusion in an experiment with n > 30?
If the sample size is equal or greater than 30, the differences between the t-distribution and the z-distribution are extremely small. Thus, we can use the standard normal curve (z-distribution) even if we don't know the population standard deviation (σ)
Power
Probability of rejecting a false Ho or the probability that you'll find difference that's really there (type 2)
length()
Returns the length - can be a set of numbers of words
typeof() / class()
Returns the type of a variable class is broader than typeof
why use Markdown (.Rmd) instead of Script (.R)
Script files are just the code & comments; no options for outputs/plots. way more limited. in markdown: - can make text pretty - can include lists and links and images - can have r chunks/codes and r outputs in the same document - can output file as html/pdf
Logical Vectors - what, and arguments
TRUE/FALSE vectors inputs: ==, >, <, <= , => , ! (not) returns TRUE/FALSE example: x != 5
Level of confidence (c)
The probability that the interval estimate contains the population parameter. "c" is the area under the standard normal curve between the critical values
Central Limit Theorem
The theory that, as sample size increases, the distribution of sample means of size n, randomly selected, approaches a normal distribution.
central limit theorem
The theory that, as sample size increases, the distribution of sample means of size n, randomly selected, approaches a normal distribution. centered at mu, width standard error (sigma over sqrt of n) so having at least n=30 means we'll get a normal distribution eventually this means sampling distributions will have a width (standard error) -> sigma/sqrt(N), avg distance between sample stat(xbar) and pop para (mu)
When σ is known - use normal distribution and z scores When σ is not known - Use t distribution and t scores
What are the two different approaches that should be taken to determine if the difference in scores between variables is large enough to lead us to a conclusion in a experiment with n < 30?
This means that we are 90% confident that the interval contains the population mean μ
What does it mean to say that the level of confidence is 90%
Raising the alpha level increases our possibility of a type 1 error
What negative effect might arise from raising our alpha level?
names()
a function that shows all the names in our dataset so like variable names
head(data, n)
a function used to look at the first n numbers of a dataset/dataframe
tail(data, n)
a function used to look at the last n numbers of a dataset/dataframe
Vectors - what are the three
are atomic/homogenous/can only be one kind at once logicals, characters, numbers
dataframe
as.data.frame() or data.frame(x=c(), y=c(),z=c()) - a matrix that can have different types of vectors in it. each column will be one type of vector. same length - so fixed # of columns and rows
tibbles
as_tibble() (convert dataframe into tibble) or tibble() (create new) - use read_csv to get tibble (read.csv gives df) - in tidyverse - better defaults: read data better; make sure each variable is properly configured for its type/gives better warnings (df dont have this) - nicer printing of data for large datasets - dataframes will print EVERYTHING when asked to print, but tibbles automatically only print the first ten - can still use $, [], [[]] to access elements - better suited to use with a pipe - instead of data$variable, we can do: data %>% .$variable (dont memorize this, we never used it. but if you see this on the test know its a tibble and is saying pipe)
why is the standard error always smaller than SD
because SE is always being divided by sqrt(N) whereas SD is divided by sqrt(n)
IQR
cares about the middle 50% of the data - kinda like the trimmed mean. quantile(x = data$variable, probs = c(.25, .75))
ggplot2
contains two functions, ggplot() & qplot(). qplot() is just to do a quick plot for exploratory data etc. functional based. ggplot() is more pretty.
library(lubridate)
control and structure dates and times a package three forms of time/date: just date, just time, date+time today() (date) now() (date+time)
what does this do: (x <- seq(from=1, to=10, by=2))
create a vector from 1 to 10 that goes up by 2
pbinom()
cumulative probability pbinom(q=7 (7 or fewer tails), size=20, prob=0.5)
what is a . a placeholder for
data so instead of rowSums(data, [1:7]), we can do rowSums(.[1:7]) (idk why or how but sure)
augmented vectors are... (3 kinds)
factors, dates, times augmented vectors build on normal vectors
summary(data)
for numerical data: gives summary statistics about a data min, max,etc for categorical data: gives the number of responses to the questions of that variable. for example if the variable is Biological Sex with responses Female, Male, and Nonbinary, it'll be like Female: 780 , Male: 640, Nonbinary: 200 etc
rnorm()
generate random Normal variates with a given mean and standard deviation
fct_collapse()
group multiple things into one label. so As = c("A+", "A", "A-") will collapse all three into As
read_csv(file)
how to read a file csv = comma separated
How to install & use a package - two steps! (what do you need to do after installing for it to work?)
install it using install.packages() and then run it using library()
Numeric Vectors
integers (whole #) and double (decimals)
is.na()
is missing filter out the people who have missing data
!is.na()
is not missing
fct_relevel()
just change the order of the levels by writing them down in the order you want
str(data)
look at the structure of the data
fct_lump
lump levels with small counts together - so if you ask where people are from and there's like 1 person from australia 1 person from canada 3 people from iran but then like 30 people from new zealand and 50 people from pakistan - then its gonna lump australia, canada, and iran together into an "other" group
four moments of a distribution
mean, sd, skew, kurtosis
R Chunk Flag: message & warning
message = FALSE: prevents messages that are generated by code from appearing in the finished file. warning = FALSE: prevents warnings that are generated by code from appearing in the finished.
Degrees of freedom
n-1 - used to adjust the shape of distribution based on sample size
why use dplyr
provides us with tools to deal with our data / wrangle/clean/just do something to our data
R function for Critical region
qt(p = 0.975, df = 24) for a sample of n=25, 2-tail test with alpha = 0.05
qbinom()
quantiles of distribution qbinom(p=c(.25, .5, .75) (what are the 25th-75th percentiles?) , size =20, prob = .5)
other ways to read files that arent csv
read_tsv: tab separated read_delim(file, delim): uses these to separate: ; - /
fct_recode()
recode the levels of a factor to something else entirely. relabel them, reduce them by combining two into one category, etc
fct_infreq()
reorder them by frequency - so most responses to least i think
Sample Standard deviation
s
dim(data)
say the dimensions of the dataset response is like "400 4" means 400 rows 4 columns
verbs of dplyr
select, filter, arrange, mutate, summarize/group
set.seed()
set a numerical value to it -> makes outputs less random - same 'random' samples each time
two stats that we can calculate to investigate shape of a distribution
skewness: asymmetry kurtosis: pointiness - platykurtic: flat like a platypus tail <3 - leptokurtic: too pointy like a leopard leaping>3 - mesokurtic: mesmerizingly balanced ~3
Character Vectors
strings, created using c(), defined by type & length
summarise/group()
summarise: - condense data - gives a table of summary stats summarise(data, mean_of_variable = mean(variablename), min_of_variable = min(variablename), max_of_variable = max(variablename)) #so this creates a tibble of 1x3, with just the scores in the row, and 3 columns for mean_of_variable, min...max.... commands: - mean, median, sd, IQR, mad, min, max, quantile, first, last, nth, n, n_distinct, any, all group_by(): - add the group that these numerical values belong to so that you know what the results of each is for. so you can add data %>% group_by(TheCategoryName) %>% summarise (.....all the things above....) # this gives a tibble. rows would be however many categories there are, columns are 5 because 4 summarise stats + 1 category name.
Repeated Measures t-Test: In R
t.test(postscores, prescores, paired = TRUE)
how to make a z-score/standardize a score in R
take each score, subtract the mean of all scores, divide by standard deviation. zscore = (sumofallrows - mean(sumofallrows))/sd(sumofallrows)) general formula: z=(X-mu)/sigma so class avg 65, sd 10, xi 83 -> 83-65/10 = 1.8 so it sits 1.8 from the mean
variance
the average amount of spread
Sampling error
the difference between the results of random samples taken at the same time higher n -> less sampling error (the law of large numbers) longer explanation: z scores assume that we know the mu & sigma (population parameters). however, we dont, we're making an inference from a sample score. even if we knew the pop param, when we take a sample from our population, it probably won't match the pop exactly. So our sample stats not matching the pop para exactly is called sampling error. higher sample size, a better prediction of our population, and a lower sampling error.
rbinom()
use at binomial distributions, when two discrete outcomes possible: success or failure - RANDOMLY generates results binom(n= how many replications, size = sample size in each replication, prob. = 0.5 success)
R Chunk Flag: include
when FALSE: don't show either text nor analysis in the output - don't INCLUDE anything -still runs tho, so if defining a variable here, variable can be used later on in the code
R Chunk Flag: echo
when FALSE: show the outputted analysis (like, show the plot/graph) but dont show the text / don't ECHO/repeat the text
R Chunk Flag: eval
when FALSE: show the text, but don't run the analysis / don't EVALUATE the text (in the outputted file)
Sampling Distribution
when we sample the population over and over again, and get different xbars. so we have a sample of n=10, and another sample of n=10, and another sample of n=10. sample 1 has mean 4, sample 2 has mean 5, sample 3 has mean 6. 4+5+6/3 = 5, which is the overall mean, and has a SD, etc. so we get a normal curve. this is the distribution of sample stats for all possible random samples from a population SAMPLING DISTRIBUTION ISNT REALISTIC -> central limit theorem
what does it mean if you see plus signs (+) on lines in R when you press enter
you started a string but didnt finish it. so "foo + + + you need to finish your string
population standard deviation
σ