PSYC 301 Research Methods and Data Analysis in Psychology Midterm 1

अब Quizwiz के साथ अपने होमवर्क और परीक्षाओं को एस करें!

Repeated Measures t-Test - explain, how to measure change, assumptions

"matched design" - dependency between scores of two groups - scores either belong to the same people (within subjects) - or scores belong to matched pairs of individuals (assumed their qualities match enough to be valid; twins etc) Measuring change: Di = Xtreatment - Xcontrol (Di = index of difference. Xtreatment: posttest. Xcontrol: pretest.) -> is average difference significantly different from zero? - Dbar (mean difference) = 0 => treatment had no significant effect; changes could be from sampling error - goal is Dbar =/= 0 Assumptions: Advantages: - more power because less error due to between subjects differences; focused on change due to manipulation Disadvantages: - other confounding variables: drug carryover effects, practice/memory effects, fatigue effects, general annoyance

The Console Window

- R syntax is executed - code is NOT saved if directly type in it.

arrange() what it does, what's the default, how to do the opposite of default

- To put data in a certain order, by a specific variable - by default puts things from smallest to largest - so: if we do arrange(variable3), it'll give us the table (tibble), with the numeric data that is in variable 3 in order from smallest to largest. - if we want descending order: arrange(desc(variable3))

lists

- a step up from tibbles/df, instead of just storing data, also stores outputs. df have a limit of fixed columns/rows, lists don't have that restriction - think of it as a drawers - as.list() , list() - [] returns as the "list obj" , [[]] returns the actual value , $, [[]][ , ] <- go to the matrix and give me the row/column

Matrix/Matrices - format, binds, how to extract a value from a row/column, labels

- as.matrix(), matrix() - a set of vectors of all the same type - so logical matrix, character matrix, etc... - matrix(data, nrow, ncol, byrow=FALSE) is the default. byrow=F means that the default is by column so we have to set byrow=T for the matrix to set our data by row. - has rbind() and cbind() - use [ , ] to extract a value. [3, ] means third row, all columns. - labels: rownames() and colnames() <- c("names")

dbinom()

- calculates the density of the distribution for probability of getting exactly that number - so dbinom(x=7 (get exactly 7 tails), size = 20 (flip 20 times), prob = 0.5) it returns 0.07 so we're getting very little odds that we'll get exactly 7 tails

select() - what does it do, what are its arguments

- choose certain columns of data to work with - when the data has like infinite columns but we don't care about all of the columns - so the columns are usually the VARIABLE NAMES, so let's say the questionnaire asked for Age, studentID, Q1, Q2, Q3... and we only care about the Q1, Q2, etc. so we select out the Q's, so we don't uselessly have age & studentid etc SO: select(starts_with("Q")) gives us all the question columns only. select(-starts_with("Q")) gives us all columns except the question ones. if we do names(data) after running select we can see what we have now. other arguments in select() numerically: select(data, 1:3) OR select(data, c(1,2,3)) #do columns 1-3 numeric range: select(dat, num_range(prefix="Q", 1:3)) Strings: starts_with(), ends_with(), contains()

filter()

- choose certain rows of data to work with - based on logical tests - so > , < , == , <= , => , != , is.na , !is.na , %in% *** if there's one = it wont work. needs == - | means or, & or , means and - filter(data, Variable == x) - can't collapse multiple tests into one; can't do filter(data, 20<Age<30). has to be two separate tests. filter(data, 20<Age, Age<30)

Factors

- code levels of a categorical variable - use it to turn anything categorical - when creating a new factor: - really useful when running t test, anova, etc - we can create a new factor or convert existing things into a factor variablefactored <- factor(variable, levels = c(1,2,3), labels = c(group1, group2, group3)) - factor(), as.factor()

mutate()

- create new columns of data based on existing ones - mutate(data, newvariablename = rowSums(data, [1:7])) #sum rows 1 to 7, add a column at the end of each row, titled newvariablename, and have the value of that be the sum of each row

%in% what is it, how to use

- group membership - uses c() - looks for particular groups in a categorical - so we could either do filter(ToothGrowth, dose == 1) or filter(ToothGrowth, dose %in% c(1)) to filter out the 1 dose condition - another example from slides: instead of filter(data, Browser == "Chrome" | Browser == "IE" | Browser == "Safari") we can do: filter(data, Browser %in% c("Chrome", "IE", "Safari")

Parse_

- helps diagnose issues with reading files - what kind of vector do you think this vector should be? -> parse_character(), parse_integer, parse_logical - if there are no errors, it'll just return. if there are errors, it'll tell us - "2 parsing failures, at row 3 and 4, this is the issue"

what are strings, how do we use them, what can we do

- library(stringr) - lets say we create my_string <- "hi im a string" - strings are anything with ' ' or " " - lets - can ask for length of string with str_length(my_string) - can add to the string: my_string <- c(my_string, "wow same") output: "hi im a string" "wow same" - substitude: str_sub - add/remove whitespace: str_pad(), str_trim()

File Manager - where is it, what's 1 important thing in it that relates to datafiles

- on the bottom right side of R - Working Directory - there's a "set as working directory" command -> need to make sure we're in the same directory as our data files

str_subset()

- only keeps strings that match a pattern - returns character, being the part of the data that fits the pattern

str_remove()

- removes matching patterns from a string - returns character, everything except the pattern we've removed

the median

- the middle value based on counts - 50th percentile - doesnt care about the magnitude of values

The Environment - where is it, functions & what it shows

- top right side of R - live updating of everything - Shows Data (matrices & datafiles) , Values (everything we've created & defined), & Functions (created functions) - tells us the type of variable it is (numeric, character, logical, etc), length (1:3 etc), first 10 elements

A Function

- uses brackets () , inside are arguments, order matters - examples: c(), round(), mean() .... anything - input -> output - can be nested inside brackets

The Source Window

- where our code is saved

str_detect()

-detects/finds patterns in a string -returns logical str_detect(data, pattern = "__")

str_extract()

-extracts matching patterns from a string

Repeated Measures t-Test - Steps

1- Hypothesis: H0 = muDbar = 0 H1 = muDbar =/= 0 2- Determine Critical Region, tcrit: - alpha = 0.05? - one tail/two tail? - df? # of subjects => tcrit(df) = # 3- gather data 4- compute t-stat t = Dbar - muDbar / SE-Dbar (muDbar = 0 bc of null hypothesis) SE-Dbar = SE-D/sqrt(n) SE-D = sqrt((sum of all (Di-Dbar))^2/df) (refer to slides for week 8 if this step doesnt make sense) 5- Statistical Decision: t-stat compare to tcrit: if tstat < tcrit, fail to reject null. 6- Confidence Interval CI95 = Dbar +- (SE-Dbar * tcrit(df)) 7- Effect size

properties of t distribution

1. The t-distribution is a family of curves, each determined by a parameter called the degrees of freedom 2. When you use a t-distribution to estimate a population mean, the degrees of freedom are equal to one less than the sample size 3. As degrees of freedom increase, the t-distribution approaches the normal distribution 4. After 29 df (n=30), the t distribution is very close to the standard normal z-distribution

ggplot()

A function from the ggplot2 package that returns a graphics object that may be modified by adding layers to it. Provides automatic axes labeling and legends. basically a pretty plot.

How would an individual determine if the difference in scores between variables is large enough to lead us to a conclusion in an experiment with n > 30?

If the sample size is equal or greater than 30, the differences between the t-distribution and the z-distribution are extremely small. Thus, we can use the standard normal curve (z-distribution) even if we don't know the population standard deviation (σ)

Power

Probability of rejecting a false Ho or the probability that you'll find difference that's really there (type 2)

length()

Returns the length - can be a set of numbers of words

typeof() / class()

Returns the type of a variable class is broader than typeof

why use Markdown (.Rmd) instead of Script (.R)

Script files are just the code & comments; no options for outputs/plots. way more limited. in markdown: - can make text pretty - can include lists and links and images - can have r chunks/codes and r outputs in the same document - can output file as html/pdf

Logical Vectors - what, and arguments

TRUE/FALSE vectors inputs: ==, >, <, <= , => , ! (not) returns TRUE/FALSE example: x != 5

Level of confidence (c)

The probability that the interval estimate contains the population parameter. "c" is the area under the standard normal curve between the critical values

Central Limit Theorem

The theory that, as sample size increases, the distribution of sample means of size n, randomly selected, approaches a normal distribution.

central limit theorem

The theory that, as sample size increases, the distribution of sample means of size n, randomly selected, approaches a normal distribution. centered at mu, width standard error (sigma over sqrt of n) so having at least n=30 means we'll get a normal distribution eventually this means sampling distributions will have a width (standard error) -> sigma/sqrt(N), avg distance between sample stat(xbar) and pop para (mu)

When σ is known - use normal distribution and z scores When σ is not known - Use t distribution and t scores

What are the two different approaches that should be taken to determine if the difference in scores between variables is large enough to lead us to a conclusion in a experiment with n < 30?

This means that we are 90% confident that the interval contains the population mean μ

What does it mean to say that the level of confidence is 90%

Raising the alpha level increases our possibility of a type 1 error

What negative effect might arise from raising our alpha level?

names()

a function that shows all the names in our dataset so like variable names

head(data, n)

a function used to look at the first n numbers of a dataset/dataframe

tail(data, n)

a function used to look at the last n numbers of a dataset/dataframe

Vectors - what are the three

are atomic/homogenous/can only be one kind at once logicals, characters, numbers

dataframe

as.data.frame() or data.frame(x=c(), y=c(),z=c()) - a matrix that can have different types of vectors in it. each column will be one type of vector. same length - so fixed # of columns and rows

tibbles

as_tibble() (convert dataframe into tibble) or tibble() (create new) - use read_csv to get tibble (read.csv gives df) - in tidyverse - better defaults: read data better; make sure each variable is properly configured for its type/gives better warnings (df dont have this) - nicer printing of data for large datasets - dataframes will print EVERYTHING when asked to print, but tibbles automatically only print the first ten - can still use $, [], [[]] to access elements - better suited to use with a pipe - instead of data$variable, we can do: data %>% .$variable (dont memorize this, we never used it. but if you see this on the test know its a tibble and is saying pipe)

why is the standard error always smaller than SD

because SE is always being divided by sqrt(N) whereas SD is divided by sqrt(n)

IQR

cares about the middle 50% of the data - kinda like the trimmed mean. quantile(x = data$variable, probs = c(.25, .75))

ggplot2

contains two functions, ggplot() & qplot(). qplot() is just to do a quick plot for exploratory data etc. functional based. ggplot() is more pretty.

library(lubridate)

control and structure dates and times a package three forms of time/date: just date, just time, date+time today() (date) now() (date+time)

what does this do: (x <- seq(from=1, to=10, by=2))

create a vector from 1 to 10 that goes up by 2

pbinom()

cumulative probability pbinom(q=7 (7 or fewer tails), size=20, prob=0.5)

what is a . a placeholder for

data so instead of rowSums(data, [1:7]), we can do rowSums(.[1:7]) (idk why or how but sure)

augmented vectors are... (3 kinds)

factors, dates, times augmented vectors build on normal vectors

summary(data)

for numerical data: gives summary statistics about a data min, max,etc for categorical data: gives the number of responses to the questions of that variable. for example if the variable is Biological Sex with responses Female, Male, and Nonbinary, it'll be like Female: 780 , Male: 640, Nonbinary: 200 etc

rnorm()

generate random Normal variates with a given mean and standard deviation

fct_collapse()

group multiple things into one label. so As = c("A+", "A", "A-") will collapse all three into As

read_csv(file)

how to read a file csv = comma separated

How to install & use a package - two steps! (what do you need to do after installing for it to work?)

install it using install.packages() and then run it using library()

Numeric Vectors

integers (whole #) and double (decimals)

is.na()

is missing filter out the people who have missing data

!is.na()

is not missing

fct_relevel()

just change the order of the levels by writing them down in the order you want

str(data)

look at the structure of the data

fct_lump

lump levels with small counts together - so if you ask where people are from and there's like 1 person from australia 1 person from canada 3 people from iran but then like 30 people from new zealand and 50 people from pakistan - then its gonna lump australia, canada, and iran together into an "other" group

four moments of a distribution

mean, sd, skew, kurtosis

R Chunk Flag: message & warning

message = FALSE: prevents messages that are generated by code from appearing in the finished file. warning = FALSE: prevents warnings that are generated by code from appearing in the finished.

Degrees of freedom

n-1 - used to adjust the shape of distribution based on sample size

why use dplyr

provides us with tools to deal with our data / wrangle/clean/just do something to our data

R function for Critical region

qt(p = 0.975, df = 24) for a sample of n=25, 2-tail test with alpha = 0.05

qbinom()

quantiles of distribution qbinom(p=c(.25, .5, .75) (what are the 25th-75th percentiles?) , size =20, prob = .5)

other ways to read files that arent csv

read_tsv: tab separated read_delim(file, delim): uses these to separate: ; - /

fct_recode()

recode the levels of a factor to something else entirely. relabel them, reduce them by combining two into one category, etc

fct_infreq()

reorder them by frequency - so most responses to least i think

Sample Standard deviation

s

dim(data)

say the dimensions of the dataset response is like "400 4" means 400 rows 4 columns

verbs of dplyr

select, filter, arrange, mutate, summarize/group

set.seed()

set a numerical value to it -> makes outputs less random - same 'random' samples each time

two stats that we can calculate to investigate shape of a distribution

skewness: asymmetry kurtosis: pointiness - platykurtic: flat like a platypus tail <3 - leptokurtic: too pointy like a leopard leaping>3 - mesokurtic: mesmerizingly balanced ~3

Character Vectors

strings, created using c(), defined by type & length

summarise/group()

summarise: - condense data - gives a table of summary stats summarise(data, mean_of_variable = mean(variablename), min_of_variable = min(variablename), max_of_variable = max(variablename)) #so this creates a tibble of 1x3, with just the scores in the row, and 3 columns for mean_of_variable, min...max.... commands: - mean, median, sd, IQR, mad, min, max, quantile, first, last, nth, n, n_distinct, any, all group_by(): - add the group that these numerical values belong to so that you know what the results of each is for. so you can add data %>% group_by(TheCategoryName) %>% summarise (.....all the things above....) # this gives a tibble. rows would be however many categories there are, columns are 5 because 4 summarise stats + 1 category name.

Repeated Measures t-Test: In R

t.test(postscores, prescores, paired = TRUE)

how to make a z-score/standardize a score in R

take each score, subtract the mean of all scores, divide by standard deviation. zscore = (sumofallrows - mean(sumofallrows))/sd(sumofallrows)) general formula: z=(X-mu)/sigma so class avg 65, sd 10, xi 83 -> 83-65/10 = 1.8 so it sits 1.8 from the mean

variance

the average amount of spread

Sampling error

the difference between the results of random samples taken at the same time higher n -> less sampling error (the law of large numbers) longer explanation: z scores assume that we know the mu & sigma (population parameters). however, we dont, we're making an inference from a sample score. even if we knew the pop param, when we take a sample from our population, it probably won't match the pop exactly. So our sample stats not matching the pop para exactly is called sampling error. higher sample size, a better prediction of our population, and a lower sampling error.

rbinom()

use at binomial distributions, when two discrete outcomes possible: success or failure - RANDOMLY generates results binom(n= how many replications, size = sample size in each replication, prob. = 0.5 success)

R Chunk Flag: include

when FALSE: don't show either text nor analysis in the output - don't INCLUDE anything -still runs tho, so if defining a variable here, variable can be used later on in the code

R Chunk Flag: echo

when FALSE: show the outputted analysis (like, show the plot/graph) but dont show the text / don't ECHO/repeat the text

R Chunk Flag: eval

when FALSE: show the text, but don't run the analysis / don't EVALUATE the text (in the outputted file)

Sampling Distribution

when we sample the population over and over again, and get different xbars. so we have a sample of n=10, and another sample of n=10, and another sample of n=10. sample 1 has mean 4, sample 2 has mean 5, sample 3 has mean 6. 4+5+6/3 = 5, which is the overall mean, and has a SD, etc. so we get a normal curve. this is the distribution of sample stats for all possible random samples from a population SAMPLING DISTRIBUTION ISNT REALISTIC -> central limit theorem

what does it mean if you see plus signs (+) on lines in R when you press enter

you started a string but didnt finish it. so "foo + + + you need to finish your string

population standard deviation

σ


संबंधित स्टडी सेट्स

Electrical: NEC Level 4, Entire First Semester

View Set

BIOB 160 Lab 1: Chemical Formulas and Structures

View Set

Pharmacology-Psychiatric/Mental Health Drugs- ADN 2B

View Set

SMT 310 FINAL EXAM Notes (ch. 1-10)

View Set