EXAM 1 TOPICS (Ch. 1, 4-9; Labs 1-4)
shapes of frequency distributions
- J shaped distribution - positively skewed distribution (e.g. a harder test would be positively skewed because there's not as many scores at the high end) - negatively skewed distribution - rectangular distribution (even) - bimodal distribution (a lot made A's, not many made B's, a lot made C's and D's) - bell-shaped distribution (normal)
scatterplot with a fitted line in SPSS
- X and Y variables matter - Y is the predicted variable - X is the predictor variable - Graphs: Legacy dialogs --> Scatter/Dot -->Select Simple Scatter --> Click Define --> Move the DV (Y) and IV (X) to correct place - AFTER GENERATING PLOT: Double click on plot and select "Add Fit Line at Total"
linear regression equation
- Y' = bX + a - b is the slope and a is the y-intercept (both "regression coefficients") - predicted value of Y: best prediction at every corresponding X value - Based on relationship summarized by the regression line - SPSS calls it a "FIT" measurement
correlation
- a measure of a linear relationship between two variables - we need pairs of scores for each participant to calculate the correlation coefficient (r) - values of r range between +1.00 and -1.00 - the correlation coefficient is attributed to Pearson and called Pearson's r/index - can construct a scatterplot - e.g.) Stress is the X and Eating Difficulties is the Y - Positive correlation: both variables move together, as X increases so does Y and vice versa, more spread out distributions will be like +.68 rather than +1.00 - Negative correlation: the variables move in the opposite directions, as X increases Y decreases and vice versa - No correlation: basically a circle, something like +.06, no linear relationship between X and Y, random scatter
score transformations of measures of variability
- adding a constant number to each score in a distribution does not affect any measure of variability (variance or standard deviation) - if we multiply or divide each score in a distribution by a constant: 1) the standard deviation will change by multiplying (or dividing) by the absolute value of the constant 2) the variance will change by the square of the constant (e.g. so if Y the raw score is being multiplied by 4, then the variance will be multiplied by 16)
mean
- arithmetic mean - the sum of the scores divided by the total number of scores - know formula - balance point of a distribution (center of gravity) - seesaw with scores of a distribution spread along the board like bricks with one brick per score, so where you place the fulcrum so that the seesaw will be in perfect balance is the mean - deviation scores (how different each point is from the mean) when summed will equal 0 and only the mean can do this - has very good mathematical properties, generally quite stable from sample to sample - sensitive to extreme scores (outliers) which is why we don't normally report annual salary as a mean because there's some super super rich people like NFL players, Dabo, Warren Buffet
variance
- average squared deviation from the mean - denoted s^2 - you sum all of the deviation scores, square them, and then divide by n-1 - know the formula
normal curve refers to a family of curves
- bell-shaped - symmetric - unimodal - continuous - area under the normal curve equals to 1.0 - we will focus on the standard normal curve
area under a normal curve
- between 0 and 1 = 34.13% - between 1 and 2 = 13.59% - between 2 and positive infinity = 2.28% - from -1 to 1 = 68% of all cases - from -2 to 2 = 95% of all cases - from -3 to 3 = 99.7% of all cases
Example 3 of z scores: 3000 students given a math entrance exam. Normal distribution of scores with the mean Y = 100 and standard deviation s = 20. We want to place all scores above 120 to be in honors. So how many students will be placed into the honors class?
- change the raw score into a z score (120-100/20 = 1.0) - what proportion of all scores fall above z = 1.0? go to the third column "AREA BEYOND Z", and for a z of 1.0, it's .1587 - multiply .1587 by 3000 = 476 freshmen placed in honors math
SPSS: correlation coefficient
- click analyze --> correlate --> bivariate - move X and Y to the variables box - Pearson, two- tailed (normal), and flag significant correlations should already be checked but just make sure - note: don't ever say p = 0 because psychology is not an absolute, hard science, so if it's that little and significant, just say p < .001 (and always stick to 3 decimal places with p)
SPSS: Scatterplots
- click graphs --> legacy dialogs --> scatter/dot - select simple scatter --> define - move your x and y variables to the x and y axes --> OK - when you describe your relationship, use this format: "there appears to be a ___ (+/-) linear relationship between ___ (variable one) and ___ (variable two) (r = ___). As ___ (variable one) ___ (increases/decreases), ___ (variable two) ___ (increases/decreases).
proportion of explained variable: coefficient of determination
- coefficient of determination = the proportion of variance in Y that can be explained by (or is attributed to) differences on X - coefficient of determination = r^2 - r^2 = SS sub Y'/SS sub Y - r^2 = the sum of squares of predicted scores/ sum of squares of actual scores - if r = .4, then r^2 = .16 - r^2 of .16 means that 16% of the variance in depression is explained by whatever X is (e.g. self-esteem) - what would help me in explaining the other 84% (other factors such as being adopted, trauma, etc.) - coefficient of determination = the squared version of the correlation between actual Y and predicted Y (Y') - if the actual and predicted Y are correlated 1.0 (having a S.E.E. of 0), then prediction is perfect. So the proportion of variance/variability in Y can all be explained by using X - 16% of the variance in Y is attributed to X - 1 - r^2 = .84 so 84% of the variance in Y remains unexplained (due to error/noise)
Example 4 of z scores: students who score lower than 85 will be placed in remedial math. How many scored between 85 and 120 and will not be placed into either remedial math or honors math?
- convert both raw scores into z scores (120-100/20 = 1.0) and (85-100/20 = -.75) - determine area for both, so the area between z=-.75 and the mean is .2734, and the area between the mean and 1.0 is .3413 - add both of the areas (.2734+.3413) = .6147 then multiply .6147 by 3000 to find that 1,844 students scored between 85 and 120
Example 5 of z scores: A standardized math achievement test for 6th graders is administered in SC. A score below 40 makes a student eligible for remedial math tutoring. It is known that μ = 50 & σ = 8. Assuming that the distribution of scores in SC is approximately normally distributed, what percentage of 6th graders will be eligible for tutoring?
- convert the raw score to a z score (40-50/8 = -1.25) but remember that there are no negative values in Table A - go to 1.25 in Table A in the third column "AREA BEYOND Z" = .1056 , so 10.56% of students will be eligible for tutoring
Example 1 of z scores: it's known that cognitive ability scores follow a normal curve with mu = 100 and sigma = 15. We want to know the proportion of individuals with scores greater than 130.
- convert the raw score, 130, into a z score (130-100/15 = 2) - find z = 2.0 in Table A (Appendix D) - look under the third column labeled "AREA BEYOND Z" = .0228 - so 2.28% of individuals score higher than 130
range restriction
- correlations are affected by range restriction/talent - as range restriction increases/gets worse/restricts variability, this tends to lower (attenuate) correlation coefficients - they tend to UNDERESTIMATE the actual relationship e.g.) instead of studying weight vs. age of people 20-90 years old, range restriction would be studying only people aged 20-30 years old, and the correlation would be weak in absolute magnitude - happens in educational settings (don't admit people with lower SAT scores so can't predict what their freshman GPA would've been) - happens in work settings (people that are at the upper end of some score like test score/interview/work experience don't get hired, so we don't know how they would've performed)
summary on correlations and regressions
- correlations tell us the strength and direction of the linear relationship between 2 variables - once we've descried a correlation, it can only tell us what the general data tends to do, we can't make specific predictions - (-0.9) is stronger than 0.6 - 0.5 and 0.5 are equal - 3.0 or 0.3 = 3.0 is not a correlation because it's greater than 1 - linear regression is used to make predictions (unknown) from existing relationships (known) in the data - we "model" the relationship, and use that model to make predictions
Pearson's r
- describes the linear relationship between 2 continuous variables - formula: r = sum (x -xbar)(y - ybar) / square root of (SSx)(SSy) - r = correlation coefficient - X and Y = units, scores, etc. - xbar and ybar = means of x and y - SSx = sum of squares of x - SSy = sum of squares of y - when solving by hand, the numerator will become the sum of XY - [(sum of X)(sum of Y)/n]
range
- difference between the highest (max) and lowest (min) score in the distribution (largest number - smallest number) - crude measure of variability and it depends on only 2 scores - important for screening your data at first (e.g. make sure the range of the wonderlic isn't higher than 50 because that's impossible) - a single outlier substantially influences this measure of variability
region of discontinuity
- discontinuities in the distributions tend to result in OVERESTIMATES of the actual relationship - if you only have scores at extreme ends e.g.) school-wide correlation between IQ and reading achievement is 0.5, but with a group of mentally disabled 10 year olds at the school it's 0.25 - this is because there is range restriction on both variables with the mentally disabled group - IQ is not at full range, and it's only available at the lower end/lower left - this means that correlation with the subset of of data from the mentally disabled group will be weaker and the true relationship with the full range of the whole school is greater
variability
- do the scores in a distribution cluster around a central point or do the scores spread around it? - measures of variability: range, semiinterquartile range, variance, standard deviation
regression in SPSS
- does time spent studying (X) predict exam scores (Y)? - Analyze -->Regression -->Linear -->Choose your Variables --> Save -->Check "Unstandardized" under Predicted Values on the top left -->Ok
correlation does not prove causation
- e.g.) the number of people who drowned by falling in a pool correlates with the number of films Nicholas cage was in per year but they don't cause each other - e.g.) common example: ice cream sales (X) and aggravated assaults (Y) are positively correlated, but - THIRD/CONFOUNDING variable that's probably the cause is temperature - DON'T KNOW THE DIRECTION OF CAUSATION between job satisfaction and job performance's positive correlation
standard error of estimate
- estimates the error around the model - formula: take the y' = bX + a and plug in all the X's you have to get y', then subtract the y''s from the y's and add them all up then divide by n-2 and take the square root
IV can also be called
- factor (in experimental studies) - predictor/explanatory variable (in non-experimental studies without random assignment)
Table A (Appendix D) of z scores
- first column: z score - second column: area between the mean (0) and the z score - third column: area in the tail (beyond z) - the table only shows positive values, but you can use it for negative values you just have to keep in mind that they're negative
statistical reasoning
- for a distribution of scores 1) can the variance be a negative number? no because its numerator is a bunch of squared numbers so they're all positive 2) can the standard deviation be a negative number? no because it's just the square root of the variance which is positive - think of what the distribution of scores would look like if all of the scores clustered around the mean versus spread around the mean (e.g. if everyone got a 95 on the test, then the variance would be 0)
stem-and-leaf plot
- for quantitative data - retains the original data - shows all original data and gives an overall idea of the shape/distribution/trends - need to make a key 6I8 can mean 68 or 6.8 or 68000 - leaves are the last significant digit - stems are the remaining digits - we do not lose any information - combines/summarizes quantitative data and adds some nominal data (can have 2 stems/classes)
deviation scores
- for two other measures of variability, we need deviation scores - for a given distribution of scores, calculate the mean. Then, subtract the mean from each score - e.g.) if the mean is 4, then 1 would have a deviation score of -3 - the sum of deviation scores will always equal 0 - now we have scores the show how far a given score is from a central point - it might seem reasonable to take the average of the deviation scores, but the problem is that the sum of the deviation scores is 0, so the average will always be 0 - a way to work around this is to square the deviation scores
standard scores and a normal curve
- given a z score we can convert a set of scores to have any mean or standard deviation we would like - e.g.) if you're trying to see what the equivalent ACT score would be to an SAT score, if you're trying to compare grades on a test in your history class and your calculus class, if you're trying to compare your GPA to an ACT score, etc.
semi-interquartile range
- half of the middle 50% of the scores - Q = Q3 - Q1 / 2 - less sensitive to extreme scores (75th percentile - 25th percentile)/2
ways to deal with confounding variables
- hold variables constant (e.g. age, course, gender) - matching = identical people in each condition (e.g. 2 18 year olds driving, but this only accounts for age, so matching becomes complicated when trying to match for multiple variables) - random assignment to a condition (most elegant)
standard error of estimate
- how much do the actual Y values vary around the predicted Y scores (Y')? - we want an index of the amount of error in our prediction (want to see if the regression line is doing well) - desirable if this index: 1) equaled 0 when there was NO ERROR in prediction (you're on the line) 2) was positive (a number bigger than 0) if there was prediction error
score transformations of the mean
- if we add a constant number to each score in a distribution, the distribution shifts by the amount of the constant - the mean will increase or decrease by the same amount (e.g. mean is 75, we add 4, the new mean is 79) - if we multiply (or divide) by that same constant, the mean will be multiplied (or divided) by that same constant (e.g. mean is 6.5, multiply by 4, new mean is 26)
parameter
- index used to describe some characteristic of a population (this is what we really want to know) - greek symbols like mu for mean, sigma for standard deviation, and beta for slope
statistic
- index used to describe some characteristic of a sample - roman letters like y bar for mean, s for standard deviation, and b for slope
Spearman's rank order correlation
- it's the regular Pearson correlation r but applied to data that have been "properly" ranked - you don't have the original data, just the ranks - X has to be ranked. Y has to be ranked. separately - instead of pairs of scores, we have pairs of ranks for each participant - apply the usual formula for Spearman's rank (r sub s) using columns Rx and Ry - if there's a tie then you have to get the mean of the ranks for all numbers in that tie (e.g. if there are two X scores of 19 and one was ranked 6 and another was ranked 7, then 6.5 would go in the Rx column for both scores of 19)
prediction does NOT prove correlation
- just because I can predict job performance (Y) from job satisfaction (X), this doesn't mean that satisfaction causes better performance - it may be because performing my job (Y) well led to reinforcers like pay increase and flexible work schedule which, in turn, cause me to be more satisfied (X) with my job
nominal measurement
- label used for mutually exclusive and collectively exhaustive (MECE) labels - it just represents which categories you belong to, not which category is better - the person, object, or event should be assigned to a unique label - e.g.) 1 = female, 2 = male - e.g.) democrat, republican, independent, other
ordinal measurement
- labels still MECE, and there's an order of magnitude (more or less of a characteristic) - e.g.) grades A,B,C,D,F - e.g.) freshman, sophomore...
ratio measurement
- labels still MECE, order of magnitude, and equal intervals, but there is also an absolute 0 point - the ratio between measurements has meaning - e.g.) Kelvin scale, weight, height - different levels of measurement let us do different analyses - grade A,B,C,D,F is interval and GPA is quantitative
interval measurement
- labels still MECE, order of magnitude, but also equal intervals - equal differences between numbers reflect equal magnitude differences in corresponding classes - e.g.) personality, attitude, Fahrenheit and Celsius scales because they don't have a true 0 point - 0 degrees F or C doesn't mean the absence of temperature, but still equal differences between 80 degrees F and 40 degrees F as between 20 degrees F and 60 degrees F
Lab 3: Correlation Coefficients
- measure the degree of the relationship between X and Y from -1 to +1 - tells us: direction and strength - direction: 1) positive: > 0; as X increases, Y tends to increase; as X decreases, Y tends to decrease 2) negative: < 0; as X increases, Y tends to decrease; as X decreases, Y tends to increase (say tends to because it's just a trend) - or it can be zero correlation - strength: 1) weak: r = 0 - +/- 0.2 2) moderate: r = +/- 0.2 - +/- 0.6 3) strong: r = +/- 0.7 - +/- 1.0
levels of measurement
- measurement = assigning numbers to observations 1) nominal 2) ordinal 3) interval 4) ratio
central tendency
- measures of central tendency are indices which represent the center value of a set of observations - mode (Mo), median (Mdn), mean
sample vs. population
- measuring personality of football players - sample = Clemson university football players - population = every football player at every college in the country - you can make implications about the population you're trying to measure based on calculations taken from the sample, but there's uncertainty because you didn't measure everyone
Lab 1: Summation, Means, Stem and Leaf Plots
- n is the total number of cases in a sample - capital sigma is the sum of the cases - notation: n is the stopping point (upper limit of the summation), capital sigma is the summation sign, i is the index of summation, the number that comes at the end of i= is the starting point (lower limit of the summation), and x sub i is the typical element - normal notation you just add all the values - if the typical element is x^2 sub i then you square all the values and then add them together - if the whole normal expression is in parentheses and squared, then you square the sum (sum x sum) - if the typical element is xiyi then you multiply the first x value with the first y value and then add that to the sum of the second x value and second y value and so on - the mean is the summation notation / n - frequency tables: e.g.) class on the left (to find the midpoint of a class 29-39 just do (29+39)/2) and frequency on the right (number of students in that class) - stem and leaf plots 8I0 means 80
residuals
- observed value - predicted value = residual - residual = error - each data point has one residual - our model minimizes the sum of the squared residuals sigma(Y- Y')^2 - regression model often called the "least squares" model
prediction using (simple linear) regression
- one outcome (Y) such as house sales as a realtor - one predictor (X) such as extraversion - we will assume a linear relation between X and Y (can use it for nonlinear relationships but we won't do that in this class) - asymmetry: we want to make predictions about Y using X (IMPORTANT) - there's an infinite number of lines that could be drawn through a scatterplot, but we want to find the "best fitting" line (a regression line) using an equation called a regression equation - measure the vertical distance between the actual value of Y and the predicted value of Y (Y' or Y hat) on the line - d sub y is a discrepancy or a residual (how far off the actual value is from the value predicted from the line)
Example 2 of z scores: the SAT follows a normal distribution with a mean of 500 and standard deviation of 100. We want to know above what raw score (Y) would a student need to have in order to be in the top 15% of the SAT distribution.
- partition the standard normal curve so that 15% of the distribution is to the right of a particular z and 85% is to the left of the same z - in Table A, look in the third column "AREA BEYOND Z" to get as close as possible to 15% (.15), it's about 1.04 (between 1.03 and 1.04 so you can average it to 1.035 which rounds to 1.04) - then use Y = 500 + (1.04)(100) = 604 which is raw score = mean plus (z score)(standard deviation)
median
- point in the ordered distribution of scores that divides the data into 2 groups having equal frequency (50th percentile) - if n is an odd number, the median is the middle-ranked value - if n is an even number, the median is the average of the two middle ranked values - it's sensitive only to the number of scores above and below it, not the values of the actual scores - tends to be used to represent the center for skewed distributions (e.g. salary because it's positively skewed because most people have low salaries, but if we used the mean, then the billionaires would raise the average too much)
mode
- score that occurs most frequently - it's possible to have more than one mode (e.g. bimodal, multimodal) e.g.) 7, 12, 6, 2, 9, 7, 5 , 2 has modes of 7 and 2 - can also be used with qualitative (nominal) data like eye color, blood type, and race
scatterplots
- show the positions of all cases in an x-y coordinate system - dots are intersections of data on the x and y axes - represents association (not causation) between two variables
summary of simple linear regression
- simple linear regression = predict a quantitative outcome (Y) using a quantitative predictor (Y) - e.g.) does time spent in dialectical behavior therapy (X) predict emotion regulation (Y) - expect a positive slope - standard error of estimate = measures how much variability there is in residuals (we would like it to be a small number) - heteroscedasticity is bad and influences the S.E.E. - we want the strength of the relationship indexed by r^2 (range between 0-1) to be large to explain as much of the variability as possible
strength of correlations
- strong relationships (r's) create strong models (small residuals/errors) - weak relationships (r's) create weak models (large residuals/errors) - strength of models --> confidence in predictions
r^2
- the amount of variance in the DV that is explained by the IV (can be converted into the "percentage of explained variance" from the regression line) e.g.) 73.5% of the variance in exam score is explained by hours studying - we want this to be high
standard deviation
- the average deviation from the mean - it's the same as variance but you take the square root - e.g.) if the variance s^2 = 9, then the standard deviation s = 3) - know the formula
when we convert a set of raw scores to z scores
- the mean of the z scores will equal 0 (because it's nothing but a deviation score minus a constant divided by a standard deviation constant) - the standard deviation of the z scores will equal 1 (and by necessity, what is the variance of the z scores? 1 because 1 squared is 1) - the shape of the new distribution of scores will not differ from the shape of the original distribution of scores
know the formula for the correlation coefficient (r)
- the numerator is the sum of the deviation scores for X multiplied by the deviation scores for Y - the denominator is the square root of the sum of squares for X times the sum of squares for Y - the sum of square is the same as the numerator of variance)
if a distribution of scores approximately follows a normal distribution, we can use z scores to find
- the percent (proportion) of individuals above or below a score - the percent (proportion) of individuals between a pair of scores - a score above or below which a certain percentage (proportion) of the individuals will fall
homoscedasticity/ homogeneity of variance
- the spread of the actual scores (Y) around the regression line (predicted scores, Y') is about the same (across all values of X) - graphs in oval shape have all the arrows about the same length vs ice cream cone shape they are bigger or smaller with different x values - e.g.) heteroscedasticity: predicting visual acuity (Y) when you know age (X) because there's more variability in visual acuity as you get older (some 90 year olds have the vision of a hawk, others don't) - this is an IMPORTANT assumption in regression because when it comes time to calculate S.E.E., we get a better S.E.E. in homoscedastic data - if the data is heteroscedastic, then the S.E.E. will be wrong and generally too high - heteroscedasticity doesn't affect predicted values (the line still is what it is), but it does affect S.E.E.
d sub y = actual/observed Y - predicted Y
- the sum of all discrepancies would equal 0 - so we square the residuals and add them up - this is called the sum of squares residuals/errors or SSresid or SSerror
continuous
- there's an infinite number of possibles - you can't count all the variables in between (e.g. GPA between 1.5 and 1.75 is infinite)
advantages to z scores
- they're standardized scores - mean always 0, standard deviation always 1 - you can compare scores from different samples - assuming a normal distribution, you can calculate the proportion of scores that fall between any given range of scores
Y' values in SPSS
- to obtain predicted values (y' values) in SPSS - Analyze -->Regression -->Linear -->Choose your Variables -->Save -->Check "Unstandardized" under Predicted Values on the top left -->Ok
regression equation (standard score form)
- use this because some tests/inventories might use different scales like 1-10, 1-50, etc. - pairs of scores on X and Y - convert all X scores to z scores - convert all Y scores to Z scores - reminder: the mean of z scores for X and the mean of z scores for Y will both be 0 - this takes the scatter of data and centers it right on top of the origin (0,0) - now calculate a regression equation using these z scores (very simplified) - z' subY = rzsubX - r = slope: correlation between X and Y - there's no y-intercept (a) because it equals 0 (mean of scores for Y is 0 and mean of scores for X is 0 and the regression line passes through this centroid 0,0) - the standard deviation of z scores is 1
Cohen's conventions
- used by some researchers - r = 0.1 is small - r = 0.3 is medium - r = 0.5 is large
discrete
- variable where its observed values are countable and finite (e.g. the number of eggs laid by a hen) - you can count all of the values in between, there's no 1.5 eggs or 1.5993402345 eggs...
regression line
- way to summarize the linear relationship that is present in the data set - maps out all Y primes in a line - X variable = predicted variable (IV) - Y variable = criterion variable (outcome, DV) - Y' = predicted Y (predicted outcome) - regression line will be flat if no relationship (slope of 0) - b = r (Sy/Sx) - a = Year - bXbar - you have to find the slope (b) first before you can plug it in to find the y-intercept - with our MODEL (which describes the Y') we can predict Y values by just plugging X in - predictions always fall on the regression line
regression toward the mean
- when a variable that is EXTREME on its first measurement will tend to be closer to the center of the distribution when measured a second time e.g.) health: when individuals with high blood pressure are asked to return for a second measurement, on average, the second measure will be less than the first e.g.) genetics: if you're tall (or short), it's likely that your offspring will be tall (or short), but not as tall (or short) as you e.g.) if you score a 99 on the first exam, you will maybe make a 95 or 97 on the second exam e.g.) if you score a 30 on the first exam, you will maybe make a 50 on the second exam
least squares criterion
- when we sum all the squared discrepancies (square residuals), we prefer the smallest possible sum - line that accomplishes this is the "best fitting" for our data
standard scores (Z scores)
- you need a frame of reference when told that your raw score on a test was 346 because raw scores are not very informative by themselves - state the position of the raw score relative to the mean in standard deviation units - a deviation score of 0 is the same as the mean - a positive deviation score is above the mean - a negative deviation score is below the mean - know different formulas for the z score in a population versus in a sample
making predictions
- you only need 2 points to plot a line - for every 1 unit increase on X, Y will increase by .46 units - or if a different metric makes more sense, if X increased by 2 units, how would Y change? Y would go up by (.46 x 2) = .92 units
the relationship between two variables is weaker as r gets closer to
0
steps to solving z score problems
1) begin by drawing the distribution - remember to put the number on the correct side of the mean - decide: - if they're looking for a proportion, just leave it - if they're looking for a percent, change to percent 2) copy down th entire decimal you get in the z score table
Example 1 of correlation coefficient: I administer a quiz (X) 10 days before Exam 1. Then I obtain Exam 1 scores (Y). Is there a linear relationship between quiz scores and exam scores?
1) calculate the means for X and Y 2) find the deviation scores on X 3) find the deviation scores on Y 4) square the deviation scores on X (and sum) to find SSx 5) square the deviation scores on Y (and sum) to find SSy 6) take the cross products of X and Y's deviation scores (and sum) to find the SCP (sum of cross products) 7) do SCP/(SSx x SSy) to get r
z scores in SPSS
1) do the same things as before up to moving the variable to the variable column 2) click save standardized values as variables --> ok 3) z scores will now create a column right beside your original data
things I missed on Exam 1
1) regression equation of using cognitive ability (X) to predict job performance (Y): Y' = 3X + 4.5 - if cognitive ability scores increased by 2 units, exactly how are job performance scores expected to change? - answer: 3(2) = increase by 6 units - can check by picking two random X's so if X=0 then Y'= 3(0) + 4.5 = 4.5 and if X=2 then Y'= 3(2) + 4.5 = 10.5 which is a 6 unit increase 2) make sure to put the numbers in order before finding the median - basically just make sure to go over the slides (examples) and what we cover in class
process of statistical study
1) research question: "does texting while driving create unsafe roads?" 2) statistical question: Have 40 participants each drive in a driving simulator while text messaging and a different 40 participants each drive in the same simulator but with no cell phone. For each driver, record the number of times that the driver deviated from her/his lane. Compute average lane deviation for participants in: Text Messaging Group and No Cell Phone Group "Was the average number of lane deviations greater for those in the Text Messaging Group than for those in the No Cell Phone Group?" 3) data collection and analysis: driving simulator, analyze with a 2-sample t test to compare means of samples 4) statistical conclusion: "the average number of lane deviations is greater for those in the Text Messaging Group than for those in the No Cell Phone Group" 5) research conclusion: "Text messaging while driving tends to create unsafe roads"
how to approach correlations
1) think about the relationship (look at the data and see what it's measuring) 2) scatterplot (estimate correlation coefficient based on the plot) 3) compute correlation by hand or with SPSS 4) interpret (based on the variables, strength, and direction, and DON'T say prove/caused/variable X made variable Y... because it's not causation)
process of SPSS giving us descriptive statistics
1) variable view --> name variable 2) data view --> enter data manually going down (each row is a person, so 1 is participant 1, 2 is participant 2... 3) analyze --> descriptive statistics --> descriptives 4) hit arrow to send variable over --> select ok
formula for standard error of estimate
S sub YX = square root of sum(Y - Y')^2 / n-2 - the denominator can change with more complicated regressions with more predictors - it's n-2 because we're predicting the y intercept and the slope, so we're losing a bit of data each time - n-2 is the one we use for simple linear regression - if the S.E.E. is exactly 0, then it tells us that the predicted Y scores are the same as the actual Y scores (ideal)
standard deviation of a sample
Sx = square root of SSx/ n-1 SS = sum of x^2 - (sum of x)^2/n
z score to standard score
X = Xbar + ZS
notation for simple linear regression
X = actual scores on X Y = actual scores on Y Y' = predicted scores on Y d sub y = (Y - Y') = discrepancy/residual - note: each subject in your sample will have a discrepancy - if you added up all the discrepancies, they would equal 0 (because positive and negative)
regression equation (raw score)
Y' = bX + a - we say that Y is "regressed on" X, or that Y is being predicted by X - use this equation to make predictions about people by plugging in values for X and calculating the predicted values of Y - BUT NO EXTRAPOLATION: only use the X values in the range of X values of the original data b (slope) = r (Sy/Sx) - correlation coefficient times (standard deviation of Y/standard deviation of X) a (y-intercept) = Ybar - b(Xbar) - mean of Y - (slope)(mean of X) - the regression line will always pass through the intersection of the mean of X and the mean of Y (the centroid - center of mass of data)
sample
a subset of the population (e.g. something small like 100 lung cancer patients in South Carolina)
population
all of the observations about which an investigator wishes to draw conclusions (e.g. something large like lung cancer patients)
confounding variables
anything varying with the IV (e.g. if you put older drivers in the text messaging simulation and younger drivers in the non text messaging condition) (e.g. if one group drives on a winding road course and the other drives on a straight road course)
variable
can vary and take on different quantities and values (e.g. hair color, height)
the independent variable _____ the dependent variable
causes
frequency distributions of quantitative data
e.g.) Wonderlic scores: a 50 item test of mental ability/logistic reasoning, spiraling test so the first question is the easiest and it gets harder after that, used by the NFL and the quarterbacks get the best scores, chart of how many people got each score 1-50 - can be grouped or ungrouped frequency distributions - if it's grouped, each group has a class midpoint - you make histograms out of this data
substantive examples of using simple linear regression
e.g.) for a salesperson, we can predict monthly sales (Y) based on extraversion scores (X) e.g.) based on a person's mood (+/-) (X), can we predict whether he or she will be altruistic (Y)? e.g.) can we predict job performance (Y) using cognitive ability (X)? e.g.) as a counselor, using self-esteem (X) to predict depression (Y) e.g.) using spacial visualization (X) to predict simulator performance (Y)
frequency distributions of qualitative/nominal/categorical data
e.g.) number of people per political affiliation if it's in percentages it can be called relative frequency distribution
we want _____, ______ samples to better approximate the characteristics of the population instead of other types of samples (e.g. small, nonrandom samples)
large, random
independent variable (IV)
manipulated variable (e.g. manipulated text messaging)
in a negatively skewed frequency distribution,
mean < median < mode
in a normal distribution,
mean, median, and mode are all the same
qualitative
membership of a certain characteristic (e.g. blood type, hair color, freshman/sophomore/...)
in a positively skewed frequency distribution,
mode < median < mean
quantitative
numbers (e.g. GPA)
dependent variable (DV)
outcome (e.g. number of lane deviations)
DV is also called
outcome, response, criterion
population vs. sample notation
population mean = mu sample mean = x bar population population = N sample population = n population standard deviation = sigma sub x sample standard deviation = S sub x
the sign of the z score tells us
something meaningful about whether the score is above or below the mean
standard deviation vs variance
st.dev = the square root of variance variance = st. dev^2
r= -.85 has a _______ degree of linear relationship than r= .79
stronger
the sign of r indicates
the degree/magnitude/strength of the linear relationship
the absolute value of the z score tells us
the distance between the score and the mean in standard deviation units e.g.) If your population had a mean of 100 and a standard deviation of 10, and you got a 125, then the z score would be 2.5 which is 2.5 standard deviations from the mean
regression line vs. standard estimate of error
think of the regression line as - the regression model - all predicted values connected together - the 2-dimensional center of the data think of the S.E.E. as - a 2-dimensional measure of spread - "average error" of model
descriptive statistics
used to organize and summarize conclusions (e.g. finding mean, slope, standard deviation)
simple linear regression
using 1 predictor to predict an outcome
inferential statistics
using statistics calculated from a sample to draw conclusions about the population
constant
will not take on other values (e.g. only study homosapians)