HEAL709 Advanced Research in Psychology Ass2 (covering all quant material wks 1 and 5-10)
lab 6
negative sign correlations - as emotional coping goes down, age goes down put correlation matrix in ass - there are sig ones, justifies purpose of doin a reression -- put table in appendix - pick out larger values and sig to discuss eg- qual total and fcv total, relcope and qol, age and qol slight positive correlation -- keep it brief in assignment moderate correlation = obove .4 -- all correlations on table are weak but can still comment on these -- one or two sentences don't depend on the correlation matrix unless there are no other sig correlations comment eg- weak sig negative correlation between... mention things that stand out and look at other lit eg- weak positive correlation between fcv and emocope --- only very briefly for correlation matrix linear regression table= In our case, it was significant. This means that the ability of the set of variables that we had entered (Age, Income, Sex, MaritalSt, Region, and Employment_2) to predict QOL was significant. In other words, the information provided by these variables allows us to predict QOL to some degree. But how does this so-called Model 1 (with this set of predictors) predict QOL? Here we can look at R2 (also often written in lower case r 2 ). As you can see in the above table, this value is 0.06. That means that the set of variables can predict only 6% of the variation in people's QOL levels. So, this isn't much at all. But do remember this was only a small set of variables and included only demographics. - 95% of varience not explained by variables --- in my table the perventage was .005 around 5%. model cooefficients table stand. - - goes up by 1 sd if positive eg- The quality of life goes up by .009 standard deviations if income increases by 1 standard deviation. However the beta coefficient is quite small - better correlations = close to 1 fcvtotal, emocope and relcope are all sig p values = model specific results only write r2 and p values for model specific results -- estimate = magnitide of effect small or big - closer to 1 is higher The negative sign of the FCVtotal coefficient indicates that the more fear of COVID-19 one has, the less is QOL. The coping variables, in contrast, are positively associated with QOL. In other words, the more coping one engages in, the higher one's QOL is. assumption checks There are no rigid rules on how to interpret these scores, with some researchers advising that only cases of VIF>5 are an issue (sometimes even up to 10 is accepted). In our case, there isn't anything of concern. If there was, you could run a bivariate correlation with these two variables to see how much they are correlated. If they are highly correlated (say, >0.70), you might want to take one of them out and re-run the analysis. Cook's distance gives you an indication whether there are data from some participants that heavily influence the result of the model. These are usually outliers that may then be excluded. Any value of Cook's distance >1 could be of concern, and as you can see, there weren't any such cases, as the maximum value was 0.0299 (0.03 if rounded to two decimal places). So, we can be confident that no outlier strongly influenced the results of our model. write up sentence about cook's values eg- "A hierarchacal multiple linear regression was conducted to look at predictors of quality of life. The assumptions for regression were inspected by observing Cook's distance for the effect of outliers. The Cook's item values did not exceed 1 (.035), which tells us that we can be confident that no outlier strongly influenced the results of the model." "Multicollinearity was tested for by observing variance inflating factors (VIF), which tells us if any of the predictor variables are highly correlated with one another. The maximum VIF value was 1.65. The higher the VIF score, the less reliable the regression scores are going to be, there is not a firm score to indicate a favourable score, though, some researchers advise that only cases of VIF>5 are an issue (sometimes even up to 10 is accepted). In this case, there isn't anything of concern." Homoscedasticity was also observed with a visual presentation using a Q-Q plot, there is some deviation, however, it does not appear significant enough. don't need the plots only one or two sentences for each assumption check. Don't put these as the main results in assignment The inclusion of the moderator variables was significant to the model, the delta r2 change to 1.4% which is a small yet significant amount of effect on the overall outcome. Before adding the moderator the r2 was 10%. The standardized estimate of increased fear and emotional coping results in significant additional effects such as higher quality of life, it appears to be a buffering effect. This is similar to Frazier and colleagues finding of... (put in discussion) -- see model coefficients QOL table. Income also significantly affects the quality of life, however, the standard estimate is a small amount (.10). Frazier, P. A., Tix, A. P., & Barron, K. E. (2004). Testing moderator and mediator effects in counselling psychology research. Journal of Counseling Psychology, 51(1), 115-134. The inclusion of demographics, fear of covid and coping strategies and moderator analysis in the linear regression only explained 16.6% of the variance, there is still a lot of unexplained variance in the dataset -- WRITE IN DISCUSSION. Look into other research on the quality of life that could predict QoL
alternative hypothesis (Hi)
The hypothesis that the means were drawn from different populations (µ1 cf:. µ2 )
Limitations of Chi-Square
A problem arises if any of your expected cell frequencies is less than five. In such cases the value of chi-square may be artificially inflated
within-groups variability
Error variability or the variability in DV scores that is due to factors other than the IV—individual differences, measurement error, and extraneous variation. Error can arise from either or both of two sources: individual differences between subjects treated alike within groups and experimental error. Take note that variability caused by your treatment effects is unique to the between-groups variability.
"Scale"
Understanding levels of measurement will be very important later for the SPSS assignment. SPSS lumps interval and ratio scale together as "scale", because they both behave identically in statistical analyses
paired samples t-test
before and after scores - comparing the two scores of the same person
Cozby et al.
• Chapters 8 and 10 • (or equivalent chapters in other research methods textbooks)
Even weaker designs
•Sometimes we have to make even more concessions and use a One-group pretest-posttest design: O X O •Or even no baseline measure, as in the One-shot case study design (or One-group posttest-only): X O •Such designs are usually conducted for pilot studies or proof of principle, which then need to be followed up with more robust studies. •Can also compare scores with established norms
Null hypothesis
If the probability of "Mean (Sample 1) - Mean (Sample 2)" is above 5%, we accept the null hypothesis:
Alternative hypothesis
If the probability of "Mean (Sample 1) - Mean (Sample 2)" is below 5%, we reject the null hypothesis:
correlation lecture
cannot make causal claims as it is a cross-sectional study linear regression - some statement of what is a predictor and outcome variable - a bit more of an attempt to see what is causal but still correlational. What this could mean but mention it is not causal add 3 to the participants
independent samples t-test
each group is independent and different, not related and can only be one or the other, eg- males and females, age etc.
nocebo effect
expecting something bad to happen can, by itself, cause something bad to happen
Statistical power increases when:
standard deviation is lower - the effect (group difference) is larger - sample size is larger Can you see how each of them affects statistical power?
active control group
they are doing the same thing as the control group without the dependent variable. eg- having the same room, socialising in groups etc.
History effect
• Any external event that occurred between your pretest and posttest that caused the change in the measure, and not the intervention.
Questions to consider
• Define your research objectives • Why is the new survey needed? • What exactly are we trying to measure? 1. Attitudes and beliefs 2. Facts and demographics 3. Behaviours • What is the purpose of this test, and for whom is it intended? • Level of expertise necessary to administer, score, interpret • special training required to administer the test? • What is the best format for testing? • Self‐report • Pencil & paper/online
Beck Depression Inventory
• General Factor - 2 specific factors • Cognitive • Somatic/Affective
Why do research?
• Intellectual curiosity • Increased understanding of the world • Educate work force • Developing new technologies • Improving standard of living • Getting a PG qualification
Regression towards the mean
• Measurements always vary in a zig-zag fashion. • When you identify your sample by a measure that happened to be on a zig, a zag will follow even when your intervention had no effect at all!
(interrupted) time series
• More extensive data collection than in case reports - series of observations: O1O2O3O4O5O6O7 X O8O9O10O11O12O13O14 (interrupted) time series treatment either naturally occurring or administered experimentally •Or slightly more sophisticated in the control series design: O1O2O3O4O5O6O7 X O8O9O10O11O12O13O14 O1O2O3O4O5O6O7 X O8O9O10O11O12O13O14 comparison control
Measurement error
• We assume that everybody has a true score on something that we measure, but there is also measurement error. • An unreliable test is one that is strongly affected by measurement error.
Question wording
• Wording of the question has an impact on its comprehension, and the kind of responses you get • What is wrong with the question below? • E.g. "Did your mother, father, full‐blooded sisters, full‐blooded brothers, daughters, or sons ever have a heart attack or myocardial infarction?" - double-barreled - asking more than one thing, has jargon
Control over extraneous variables
• You have to make sure that other important factors are the same in both the experimental and the control group! • E.g. if you want to test whether energy drinks help in test performance, you need to make sure that participants in both groups do not drink any caffeine drinks before the test. Otherwise the presence of a third variable may have confounded the results! eg- testing effect of multivitamans but they could be getting vitamans from healthy diet- fruit and veg
FA - procedures Stage 1
• the initial factor extraction •Reduces a large number of correlations to a small number of factors •Several different methods •Includes Principal Component Analysis •PCA asks if we can summarise a lot of variables in just a few linear variables or components i.e. data reduction
Three assumptions underlie parametric inferential tests
(1) the scores have been sampled randomly from the population; (2) the sampling distribution of the mean is normal; and (3) the within-groups variances are homogeneous. Assumption 3 means the variances of the different groups are highly similar Serious violation of one or more of these assumptions may bias the statistical test. Such bias will lead you to commit a Type I error
Testing for differences between groups: Analysis of variance (ANOVA)
A t-test involves two groups (e.g., male/female) and a single measure of a dependent variable (e.g., test score). But what if we have more than two groups (or levels)? Rather than do multiple t-tests (and thus inflate the rate of Type-1 error), we have to do an ANOVA.
The t-test
Applied when there is a single dependent variable e.g., IQ, reaction time... The dependent variable is measured from two groups: the independent samples t-test Used when a researcher wishes to know if the scores of two independent groups are significantly different or not. e.g., the differences in pain perception between males and females at the dentist.
Unplanned Comparisons
Comparison between means that is not directed by your hypothesis and is made after finding statistical significance with an overall statistical test (such as ANOVA). If you do not have a specific preexperimental hypothesis concerning your results, you must conduct this. Looking for any differences that might emerge. In experiments with many levels of an independent variable, you may be required to perform a fairly large number of this.
Reading for this topic: para stats
Cozby et al. •Chapter 13 and elsewhere
If an effect in a particular direction is expected:
E.g., if we have good a priori reason to believe that an effect is in a particular direction, we can use a onetailed test.
Latin Square ANOVA
Used to counterbalance the order in which subjects receive treatments in within-subjects experiments (see Chapter 8). The carryover effects contained in the Latin square design tend to inflate the error term used to calculate your F ratio. Consequently, they must be removed before you calculate F.
Critical z-score cut-off points
Usually to be looked up in statistical tables: One-tailed: Two-tailed: α=.05 1.65 1.96 α=.01 2.23 2.58 α=.01 means you are using a more conservative test. (decreasing probability of Type-1 error but the increasing probability of Type-2 error) either .05 or .01 - choose one consistently throughout the essay and why you chose that cut off - mention in methods section
between-groups variability
Variability in DV scores that is due solely to the effects of the IV. may be caused by the variation in your independent variable, individual differences among the different subjects in your groups, experimental error, or a combination of these (Gravetter & Wallnau, 1996).
Measurement in experimental-type research
We need to obtain an objective measure of both our IVs and DVs e.g., amount of sleep and mood Measurement is the assignment of numbers or categories to objects or events. Direct measurement • Measurement of concrete factors (e.g. height, weight, speed, temperature, etc.) • Indirect measurement • Measurement of abstract factors (e.g. coping strategies, stress levels, depression, etc.)
Necessary steps for conducting t-tests and ANOVAs 3. For ANOVAs:
a)Test for a main effect (your fixed factor or grouping variable). If it is not significant, you stop here and report that there is no significant group effect. If there is a significant effect, explore which groups are different. You do that be using a post-hoc test; go to b). b) Look at the post-hoc test results to find out which groups are significantly different from each other.
Necessary steps for conducting t-tests and ANOVAs 2. Check assumptions:
c) Homogeneity of regression slopes: This means that the effect of co-variate needs to be the same for each group (no interaction). d) For ANCOVAs and MANCOVAs, the relationship the DV and co-variate needs to be linear. Technically, covariate need to be scale, but often ordinal co-variates are fine. e) Homogeneity of variance: The variance is the same in the groups that you are testing (Levene's test).
Correlation versus causation
causal relationship is unidirectional. In dynamic systems, things are often more complicated, but it typically helps to start with simple assumptions and then become increasingly more sophisticated. can still inform us but limits our causation assumptions
t-test for independent samples
compares 2 means derived from unrelated samples, when you have data from two groups of participants and those participants were assigned at random to the two groups. The test comes in two versions depending on the error term selected. - The unpooled version computes an error term based on the standard error of the mean provided separately by each sample. - The pooled version computes an error term based on the two samples combined, under the assumption that both samples come from populations having the same variance. The pooled version may be more sensitive to any effect of the independent variable, but should be avoided if there are large differences in sample sizes and standard errors. Under these conditions the probability estimates provided by the pooled version may be misleading.
stages of survey development
conceptualisation questionaire construction - can ask collueges for advice test pilot - what works and what doesn't item analysis - make judgments on individual items, what needs to be discarded, ambiguous questions, two similar questions questionaire revision - then can go back to test pilot
True experimental designs
experimental designs that randomly assign participants to both experimental and control groups
Belsky and Michael (Nonmaternal Care in the First Year of Life and the Security of Infant-Parent Attach)
extensive nonmaternal care in the first year is associated with heightened risk of insecure infant-mother attachment and, in the case of sons, insecure infant-father attachment. Analysis of data obtained during Strange Situation assessments conducted when infants were 12 and 13 months of age revealed that infants exposed to 20 or more hours of care per week displayed more avoidance of mother on reunion and were more likely to be classified as insecurely attached to her than infants with less than 20 hours of care per week. Sons whose mothers were employed on a full- time basis (> 35 hours per week) were more likely to be classified as insecure in their attachments to their fathers than all other boys, and, as a result, sons with 20 or more hours of nonmaternal care per week were more likely to be insecurely attached to both parents and less likely to be securely attached to both parents than other boys.
Familywise error rate
the probability that a family of comparisons contains at least one Type I error. As the number of comparisons increases (that is, probability pyramiding). Familywise error can be computed with the following formula: aFW = 1-(1-a)c where c is the number of comparisons made and a is your per-comparison error rate
Why Use Factor Analysis?
An example with BDI • To identify relationships among items or questions in a measure • Construct validity • Beck Depression Inventory has 21 items tapping different symptoms • 21 x 21 correlation matrix • 210 separate correlations to consider!
Hypothesis testing: four possible outcomes
Decisions are based on uncertain information. There are four possible outcomes, each expressed as a conditional probability: type 1 error - false alarm or false positive- incorrectly rejecting the null, finding a difference when there isn't one. The difference was due to chance. -- common in psychology, if people repeat the study they cannot find the same results but it is often not published as it is not interesting enough. How can we reduce our false alarm rates? - Reducing the p-value cut off - .01 - taking a conservative approach - you can make this decision in your assignment or keep it at .05 types 2 error - miss or false-negative - accepting the null when it is false and there was a difference correct decision - accepting the null hypothesis when there is no difference in reality correct decision - rejecting the null when there is a difference
Discrete variables
Discrete quantitative variables Values usually restricted to being whole numbers. Variables that are being counted as full numbers are discrete: frequency data. Discrete qualitative variables Some variables consist of categories that cannot be assigned a numerical value in any meaningful way. e.g., diagnoses of mental illnesses
Science and interpretation
Even with the best intentions and the most thorough research design, science does not immediately or necessarily provide the most accurate description of the world. Remember, you are only collecting data (which never lie!), but mistakes can creep in when you analyse and /or interpret the data • Gallup et al. (2002) - anti-depressant function of semen??
Causes
Everything has a cause - nothing happens willy-nilly. At least, that's the assumption in science. A causal relationship means, for example, that: - Event B only happens because Event A happened. - Variable B only increased in value because Variable A increased in value first - Variable B only increased in value because Variable A decreased in value first
Testing for differences between groups: ANOVA
H0 is defined as where k is the number of groups (or levels of the IV) and k>2. The alternative hypothesis is that the means are not all equal: Actually, there only needs to be one "≠"sign somewhere for the ANOVA to be significant. In other words, at least one group is different from at least one other group. The one-way ANOVA works by partitioning the variance in the data into two separate sources: A) Between groups variation B) Within groups variation If A is large, we can detect a difference easily (of course, since the scores vary a lot across the different groups. If B is large, we may not detect an underlying group differently, because scores within each group vary so much, that they overshadow the group difference.
Necessary steps for conducting t-tests and ANOVAs 1. Determine which test to use:
IV is nominal and DV scale. If IV has two groups, then use t-test; if more than two, use ANOVA. 2. Check assumptions: a) As with the t-test, it is assumed that the groups are mutually exclusive (each participant is only a member of one group). b) The DV is scale-level and normality distributed. The Kolmogorov-Smirnov test or the Shapiro-Wilk test can tell you whether that is the case.
Ratio scale
Identical to the interval scale except there is an absolute and non-arbitrary zero point. Equality of spacing between intervals and fixed ordering of variables Has the highest degree of specificity of all types of measurement • Income ($0; $1,000; $10,000; etc.) • Height • Weight For most psychological variables the ratio scale is rarely achieved. Instead psychological data are usually interval level or less.
Weighted Means Analysis
If the inequality in sample sizes was planned or reflects actual differences in the population each group mean is weighted according to the number of subjects in the group. As a result, means with higher weightings (those from larger groups) contribute more to the analysis than do means with lower weights.
Sometimes the table of the critical values of F does not list the exact degrees of freedom for your denominator
If this happens you can approximate the critical value of F by choosing the next lowest degrees of freedom for the denominator in the table. Choosing this lower value provides a more conservative test of your F ratio.
Measurements
If we want to measure a concept (e.g. personality), we need to define it clearly beforehand. When we measure a variable we define beforehand exactly how the variable is going to be measured for our purposes (operationalisation): Measured using categories: Qualitative variables are always discrete Measured using numbers: Quantitative variables can be discrete or continuous
z Test for the Dffference Between Two Proportions
In some research, you may have to determine whether two proportions are significantly different. In a jury simulation where participants return verdicts of guilty or not guilty, for example, your dependent variable might be expressed as the proportion of participants who voted guilty. A relatively easy way to analyze data of this type is to use a z test for the difference between two proportions. The logic behind this test is essentially the same as for the t tests. The difference between the two proportions is evaluated against an estimate of error variance.
Interval
Indicates degree to which categories differ from each other (e.g.: difference between scores of 50 and 70 is equivalent to difference between scores of 20 and 40) Interval scales are continuous variables in which the 0 point is arbitrary (e.g., a zero score does not mean the absence of the trait). Temperature scales (0°, 15°, 30°, etc.) Intelligence scales (IQ=85, 100, 120, etc.) Interval measures can be added and subtracted. It is not meaningful to multiply or divide them
LEVELS OF SPECIFICITY
Nominal - Lowest specificity Ordinal Interval Ratio - Highest specificity *The increase in specificity from nominal to ratio does not imply that continuous measures are better than less discrete or categorical measures. No level of measurement is better or worse than another: they are simply different. The research design, through operational definitions, determines the scale of measurement, the kinds of data analysis that can be used, and therefore the scope of possible conclusions.*
Effect size
Note that a statistically significant difference may not necessarily be meaningful (e.g., an intervention leads to only a tiny improvement on an outcome that it is hardly worth the effort of implementing the intervention. For that reason, the effect size is sometimes calculated to show the magnitude of an effect, such as Cohen's d: d > 0.20 means small effect; d > 0.50 means medium, and d > 0.80 means large effect size - effect size calculators can be searched online when you have a huge sample size on something like eating a large amount of food - if their health is a tiny amount better, it is still statistically significant in t-test -- however, is this clinically significant? Probably not worth the effort.
Relationships between variables
Scientists try to study the relationship between variables. Variables are things that can vary, thus change, e.g., - intensity of light - temperature - performance on a test - sleepiness What sort of relationships between the above-mentioned variables might a scientist study?
Measurement levels
To measure a variable we must employ a measurement scale: Nominal Ordinal Interval Ratio The distinctions between the four levels are of fundamental importance. This is because statistical techniques can be used on data collected on some scales but not others. You need to be absolutely familiar with this before you can move on to the SPSS labs.
Continuous variables
When there are an infinite number of possible values that the measurement of a variable might take. e.g., height (in cm) and reaction time (in milliseconds) is a continuous variable AS A RULE: In everyday speak, discrete quantitative variables involve counting and continuous variables involve measuring.
Analysis of Variance (ANOVA)
analysis of variance test used for designs with three or more sample means based on the concept of analyzing the variance that appears in the data. For this analysis, the variation in scores is divided, or partitioned, according to the factors assumed to be responsible for producing that variation. These factors are referred to as sources of variance. The next sections describe how variation is partitioned into sources and how the resulting source variations are used to calculate a statistic called the F ratio. The F ratio is ultimately checked to determine whether the variation among means is statistically significant.
Intervening/confounding variables
iv to dv appearance in reality, another variable (sometimes called the "third variable") causes changes in the variable that you are measuring: Example of confounding variables Let's say you observe that crime rates are higher in suburbs with a higher population density. Does that mean that population density causes crime? Well, not necessarily. People of lower socioeconomic status tend to live in densely populated suburbs. So, perhaps it's poverty, rather than population density, that is the underlying cause of crime. http://gis2.esri.com/library/userconf/proc00/ professional/papers/PAP508/p508.htm Confounding variables are often also called the "thirdvariable problem". Those problems always have to be considered as alternative explanations when doing observational research with subject (or participant) variables (demographics, such as gender, age, ethnicity, etc. that are permanent). E.g., Belsky & Rovine (1988) found that children who were given to nonmaternal care (early-childhood centres) have less secure attachment with their parents. -- parenting type/career focus could be a alt explaination
When reporting the results of a t-test, mention: -
means, standard deviations, degrees of freedom (df), α-level - the α-level used usually mentioned in a general data analysis section - t-obtained, and "sig." which is the probability of a Type 1 error (p-value) e.g., (t(df)=t-obtained, p-value) if t=1.26, df=9, p=.24 (t(9)=1.26, p>.05) - Note that p is written with no 0 in front
Null Hypothesis (H0)
the hypothesis that there is no significant difference between specified populations, any observed difference being due to sampling or experimental error. The hypothesis that the means were drawn from the same population (that is, µ1 = µ2 )
why is distinguishing between science and pseudoscience important?
the public are increasingly presented with so-called fake news. •http://www.lse.ac.uk/GranthamInstitute/news/the-mail-onsunday-admits-publishing-more-fake-news-about-climatechange/ • And it isn't just the reports per se but also how findings are interpreted: •https://www.scientificamerican.com/article/climateresearchers-rsquo-work-is-turned-into-fake-news/ it is important to understand that not all articles are peer reviewed (pseudoscience) and might not be accurate but are still used. Therefore, it is important to use critical thinking
Factorial ANOVA A single DV:
two or more IVs A factorial design involves all possible combinations of the levels belonging to two or more IVs. e.g., testing the effects of vitamin supplements and exercise on health. Four groups: - Vitamin+Exercise - Vitamin only - Exercise only - No vitamin and no exercise. ANOVA not only looks at the separate effects of each factor but also for an "interaction effect". An interaction is when the effect of one independent variable differs depending on the level of a second independent variable. e.g., vitamins much more effective when a person also exercises
Instrument decay
• Gradual loss of accuracy of the measurement. Usually, a problem when using frequency data (e.g. recording number of cigarettes smoked). • Effect not due to intervention but the fact that fewer events were recorded.
Threats to external validity
• Self‐selection bias • Selecting people who are willing and not considering those who declined • Systematic differences between respondents and non‐ respondents • Interaction between setting and treatment • E.g., hospitals that welcome research versus those that do not
Validity
• The extent to which the test measures what it is designed to measure. • Face validity • Construct validity • Predictive validity • Convergent validity • Discriminant validity • Internal validity • External validity
Interactions
• We cannot always assume that the effects of two variables are additive: Effect (Variable A + Variable B) ≠ Effect (Variable A) + Effect (Variable B) • Often, the effect of one variable is affected by the presence of another one. eg- an apple a day keeps the dr away - other variables could be an overall healthy diet • Sometimes, this can be a potential confounding variable. Other times, it may be a specific topic of investigation (e.g., stress-buffering hypotheses)
Factor Analysis types
•Inductive -Exploratory FA for trying to discern patterns, relationships or structure in data •Deductive -Confirmatory FA for testing a model or hypothesis of these relationships in data
What exactly is science?
"a set of methods used to collect information about phenomena in a particular area of interest and build a reliable base of knowledge about them. This knowledge is acquired via research, which involves a scientist identifying a phenomenon to study, developing hypotheses, conducting a study to collect data, analyzing the data, and disseminating the results. Science also involves developing theories to help better describe, explain, and organize scientific information that is collected... Nothing in science is taken as absolute truth. All scientific observations, conclusions, and theories are open to modification and perhaps even abandonment as new evidence is discovered" (Bordens & Abbott, 2008, p.2)
The value of any particular score obtained in a between subjects experiment is determined by three factors:
(1) characteristics of the subject at the time the score was measured; (2) measurement or recording errors (together called experimental error); and (3) the value of the independent variable (assuming the independent variable is effective). Because subjects differ from one another (Factor 1), and because measurement error fluctuates (Factor 2), scores will vary from one another even when all subjects are exposed to the same treatment conditions. Scores will vary even more if subjects are exposed to different treatment conditions and the independent variable is effective.
Davis & Meade (Both young and older adults discount suggestions from older adults on a social memory test)
age differences in socially introduced misinformation has been studies less. Age on a social memory task have been examined. Young and older adult participants were equally susceptible to other peoples' misleading suggestions. However, the age of one's collaborator influenced the magnitude of the effect: Both young and older adult participants were less likely to immediately incorporate suggestions from older adult confederates than from young adult confederates.
mean square
an estimate of either variance between groups or variance within groups another name for variance
statistically significant
an observed effect so large that it would rarely occur by chance Inferential statistics use the characteristics of the two samples to evaluate the validity of the null hypothesis. Put another way, they assess the probability that the means of the two samples would differ by the observed amount or more, if they had been drawn from the same population of scores. If this probability is sufficiently small (that is, if it is very unlikely that two samples this different would be drawn by chance from the same population), then the difference between the sample means is said to be
Baselines
•Baseline measures control for time as a factor: •e.g. the DV might change simply with time, but if it changes more in the experimental group, you can still claim that the IV had an effect! •Also, sometimes measures increase with repeated measures, so again, you would need to demonstrate that the score has gone up more for the experimental than the control group in order to be able to claim an effect. Baselines • Measuring a baseline also means we can check whether the participants in the experimental and control groups are similar or not.
Which design?
•Choosing the wrong design has bad consequences. But often you don't have much choice! •The design you end up with is often determined by the research question and even practical reasons •The main goal is to pre-empt alternative explanations: • history effects • maturation effects • testing effects • instrument decay • regression towards the mean
Factor Analysis
•Data reduction method •Operates on an item correlation matrix •Asks how many dimensions/factors underpin a questionnaire or rating scale • Paul Kline (1994) An easy guide to factor analysis (p.36) •Defines a factor as 'a linear combination of variables .... So weighted as to account for the variance in the correlations
Criticisms of Factor Analysis
•Many different methods or of factor extraction and rotation => different patterns of results •Different rules for the 'correct number of factors' gives different advice on how many factors to extract
Testing for differences between groups
ANOVA Three steps of ANOVA: 1) Calculate the F-statistic: 2) Obtain Critical-F =.05 3) Compare the F-statistic and critical-F If the F-statistic is greater than the critical-F (i.e., F-statistic > F-critical) then H0 can be rejected. This is how you present it: e.g., (F(2, 119)=93.23, p<0.001) SPSS does this for you
To increase statistical power: Repeated-measures
ANOVA/ANCOVA etc If you take a measure only once, it is possible that this measure deviates from the true score to some extent. If you take it a second time, it might still deviate, but you can start to estimate how large the random error is, and you have much more statistical power to detect a small difference between groups. So, the more data the better.
When we use inferential statistics
Because events in this world are always subject to random variability (e.g., random factors in sampling) we require some sort of formal test to determine whether it is reasonable to claim that an effect has really occurred or whether your result was simply a chance occurrence. When this situation arises, we can use inferential statistics to help us to make a decision.
inferential statistic
Calculate an observed value. This observed value is compared to a critical value of that statistic. Ultimately, you will make your decision about rejecting the null hypothesis based on whether or not the observed value of the statistic meets or exceeds the critical value.
The general structure of the RCT
Control Group Experimental Group We compare the DV of two groups • We randomly allocate participants to either: • experimental group • who get the IV (e.g. guarana) • control group • who don't get the IV (e.g. placebo)
Cross-sectional vs longitudinal designs
Cross-sectional studies collect data in one go and often try to infer or confirm relationships between variables. Often this requires making assumptions about causal relationships. It is still useful if considered in the wider context of converging evidence. However, in order to make stronger claims about causal relationships, longitudinal designs are necessary. Here, predictions of change as a result of changes in variables can be observed over time. Longitudinal designs require a lot of sources - e.g., AUT's Pacific Island Families Studies.
Chi-Square for Contingency Tables/independence
Designed for frequency data in which the relationship, or contingency, between two variables is to be determined. You may want to know whether the two variables are related or independent. The test compares your observed cell frequencies (those you obtained in your study) with the expected cell frequencies (those you would expect to find if chance alone were operating
Research designs
Designing an experiment is a bit like baking a cake: With experience, you get better. Another strategy is to copy recipes from others. Good design enables you to control for alternative explanations (one good experiment is usually much better than several bad ones). Bad design means you cannot make a particular conclusion, and you might not even be able to publish it. So, before you start, plan carefully!!!!
Quasi-experimental designs
Developed to provide alternative means for examining causality in situations not conducive to true-experimental designs, such as ethical issues (e.g. withholding treatment to control group) • Allows some examination of causality in situations where complete control is not possible • At least one of the following three elements of true-experimental research is missing: - Random sampling (most common) - Control groups - Manipulation of the treatment
Factor Analysis vs. PCA
Distinction between factor analysis (FA) and Principal Components Analysis (PCA) Terms often used interchangeably, but: Factor analysis = Theory‐driven How variables are affected by factors Principal Components Analysis = Descriptive Explore the underlying structure of constructs (components) A factor "causes" changes in the variables. The variables are components of a factor.
Variability in the sample means
Due to random error, the sampling means is different every time you collect a sample. If you plot these means on a frequency polygon, you will also get a normal distribution Below is a "sampling distribution" of IQ scores obtained from the general population. A sampling distribution is a distribution of sampling means as opposed to a distribution of individual scores:
Pseudoexplanations
Explanations that might look plausible, but don't actually explain much e.g., references to "instincts" e.g., "mentalistic" or other circular explanations
Controlling for other factors: ANCOVA
If you are comparing groups that differ along with a third variable that might also be related to the main effect that you are testing, you will need to control for it. e.g.: Testing whether the overall average temperature of a country is related to life expectancy. You will need to control for GDP, infant mortality, social services, etc.
Unweighted Means Analysis
If you end up with unequal sample sizes for reasons not related to the effects of your treatments, one solution is to equalize the groups by randomly discarding the excess data from the larger groups. Even then, discarding data may not be a good idea, especially if the sample sizes are small to begin with. The loss of data inevitably reduces the power of your statistical tests. Rather than dropping data, you could use an unweighted means analysis that involves a minor correction to the ANOVA.
Things to consider for FA...
Items per factor • Minimum = 3 • Maximum = unlimited • Typically = 4 to 10 • More items = Greater Reliability SAMPLE SIZE: • 50 very poor • 100 poor • 200 fair • 300 good • 500 very good • 1000+ excellent • Gorsuch (1983): Minimum of 200 • Thumb rule is 10 per variable (question item)
Observation and manipulation
Let's say we hypothesise that Variable A (e.g. brightness of light at work) affects Variable B (e.g. workers' productivity). Observation We observe whether Variable B changes with naturally occurring changes in Variable A. We don't interfere. Experimental manipulation We manipulate Variable A and observe whether it produces the predicted changes in Variable B.
The p-value
Looking up the tables in a statistics book, you can find the probability of obtaining each z-score. This is the p-value or the probability of obtaining the result given that your null hypothesis is true. If this value is lower than your pre-determined cut-off point (e.g. .05), you can say that your results were "statistically significant at the α-level of .05". It all comes down to statistical power. A tiny effect can be statistically significant if your test has much power, and a large one might not be if it doesn't (more about that later).
Characteristics of Experimental Research
Remember from the first week that one of the hallmarks of experimental research is the manipulation of the IV (unlike in observational research), either as: • presence versus absence of a variable • different degrees of a variable • Variable A versus Variable B The other one is control of extraneous variables: - extraneous variables or confounds that could occur, an example is if you are doing a study on the effect of not having caffeine on exam performance you need to make sure that they are not just coffee is removed, but other sources of caffeine such as green tea or coke. The fact that we are comparing the results from one group with that of another is to make sure the change was not due to another reason (e.g., simply due to time). If all else was equal - the effect was due to the IV.
Why Rasch Analysis?
Provides a template to convert ordinal-level data to interval level, which provides precision of measurment provided a measure is unidimensional (Rasch, 1961).
Relationships between latent variables
Psychology is a lot about variables that are not directly observable - so-called latent variables. This includes variables such as self-efficacy, motivation, competitiveness, perceived stress. A lot of psychological research develops models that are meant to describe the relationship between the various variables to explain a mechanism of action. Sophisticated versions of these are called mediation models and are tested using techniques such as structural equation modelling (a bit more about that later in the course)
Psychology and Evidence
Psychology is aligned with the scientist practitioner model according to which treatment needs to be informed by available evidence. There is still quite a bit of misunderstanding about evidence based practice. Science is a slow process. Studies all have limitations. But those shouldn't mean we should abandon the whole approach altogether. unethical to do something in therapy that is not scientifically proven to work. reliability and validity is important to remember
If you have more than one DV: MANOVA
Rather than doing separate ANOVAs, you can do a MANOVA that tests for an effect on more than one DV. Note: DVs should not be highly correlated. And you might end up with a smaller overall sample size because of missing values. Therefore the results do not always exactly match those conducted using separate ANOVAs. combining as you please: The same principle as in ANCOVA You can keep adding on more and more variables as you please. As long as your sample size can cope with it, of course, otherwise you will end up with funny results or a total lack of significant effects.
Pretests and posttests
Rather than measuring a phenomenon just once, those designs measure the outcome at least twice. A baseline measure of DV for both groups BEFORE we introduce the IV and AFTER: Control Group: baseline --- Continue baseline Experimental Group baseline --- Intervention ---- measure these twice (DV). compare the control and experimental group
Nominal
Represent identity relations: numbers arbitrarily identify items or classes. Numbers are used to represent names for the different values of a variable. e.g., measuring gender as 1=male and 2=female. Nominal variables are usually manifested as "frequency" data. Nominal data can only be analysed using nonparametric techniques
Ordinal
Represent order relations: numbers reflect the rank ordering of objects or events. Numbers have a fixed order so the researcher can rank one category higher or lower than another. However, the intervals between the rankings are not necessarily equal: 1=poverty, 2=lower SES, 3=middle SES, 4=upper SES 1=strongly agree, 2=agree, 3=disagree, 4=strongly disagree Subtraction, multiplication, or division are not permitted on either nominal or ordinal level data.
Michalak et al 2014 vegetarian diet and MH
Results Vegetarians displayed elevated prevalence rates for depressive disorders, anxiety disorders and somatoform disorders. Due to the matching procedure, the findings cannot be explained by socio-demographic characteristics of vegetarians (e.g. higher rates of females, predominant residency in urban areas, high proportion of singles). The analysis of the respective ages at adoption of a vegetarian diet and onset of a mental disorder showed that the adoption of the vegetarian diet tends to follow the onset of mental disorders. Conclusions In Western cultures vegetarian diet is associated with an elevated risk of mental disorders. However, there was no evidence for a causal role of vegetarian diet in the etiology of mental disorders. attributable to several possible causal mechanisms: (a) the biological effects of diet have an influence on brain processes that increases the chance for the onset of mental disorders, in which case it could be expected that adopting a vegetarian diet would precede the onset of mental disorders; (b) relatively stable psychological characteristics independently influence the probability of choosing a vegetarian diet pattern, and developing a mental disorder, in which case the adoption of the diet and the onset of a mental disorder would be unrelated; or (c) developing a mental disorder increases the likelihood of choosing a vegetarian diet, in which case the onset of the mental disorder would precede the vegetarian diet. Although published findings on that type of relationship are missing, it is conceivable that individuals with mental disorders are more aware of suffering of animals or may show more health-oriented behaviors (e.g. adopting a vegetarian diet) in order to positively influence the course of their mental disorder.
Degrees of freedom
The number of individual scores that can vary without changing the sample mean. Statistically written as 'N-1' where N represents the number of subjects. come into play when you use any inferential statistic. You can extend this logic to the analysis of an experiment. If you have three groups in your experiment with means of 2, 5, and 10, the grand mean (the sum of all the scores divided by n) is then 5.7. If you know the grand mean and you know the means from two of your groups, the final mean is set. Hence, the degrees of freedom for a three-group experiment are A - 1 (where A is the number of levels of the independent variable). The degrees of freedom are then used to find the appropriate tabled value of a statistic against which the computed value is compared.
Sample Size
The number of individuals in a sample. The ss determines how well the sample represents the population, not the fraction of the population sampled. Just as with a one-factor ANOVA, you can compute a multifactor ANOVA with unequal ss. The unweighted means analysis can be conducted on a design with two or more factors (the logic is the same).
significance level
The particular level of alpha you adopt A difference between means yielding an observed value of a statistic meeting or exceeding the critical value of your inferential statistic is said to be statistically significant. If the obtained p is less than or equal to alpha, your comparison is statistically significant.
p-value
The probability level which forms basis for deciding if results are statistically significant (not due to chance). probability of making a Type I error given that the null hypothesis is true. Hence, for this example you would report that your finding was significant at p < .05 or p < .01
Hypothesis testing
The probability of a Type-1 error (α) occurring is in fact controlled by the researcher. The alpha-level represents the probability of a Type-1 error and is known as the "level of significance". The accepted level in the social sciences and psychology is 5% (α=.05.) A level of significance of α=.05 states that the researcher is willing to accept a 5% chance of making a Type-1 error.
Independent and dependent variables
The researcher observers/manipulates the independent variable (IV) (what is on the x-axis of a graph) and measures the dependent variable (DV) (what is on the y-axis of a graph). The difference between observation and manipulation is that, in manipulation, the quality/intensity/direction of the IV change is controlled by the researcher. This means, we can test the effect (switching the effect on and off), as opposed to waiting for naturally occurring changes. Sometimes we don't know which variable is which but we can assume or hypothesise.
Reasoning behind hypothesis testing
The shaded areas indicate the top 2.5% and bottom 2.5% of the standard normal distribution (5% in total): This area is known as the region of rejection. If our z-score falls within this region, we can reject Ho. two-tailed tests - when you have a null hypothesis that is not directional, both groups are similar one-tailed - expecting one group to have a higher mean, both groups are different still rejecting null
Testing for differences between groups: t-test
The t-test involves a formula that returns a t-statistic that is often called t-obtained. The t-obtained, for independent samples t-tests: Undertaking a t-test t-tests can be undertaken by following three steps: • Calculation of t-obtained The t-obtained tells us how far away the difference between our sample means is from the mean difference between our population means, which, according to the null hypothesis, is zero. • Working out the critical-t The critical-t marks the boundary of the region of rejection. • Interpretation Does the difference between the sample means we have collected fall into the critical region? If the t-obtained is greater than critical-t, the difference between the two sample means lies in the region of rejection. Thus we reject the null hypothesis, else accept it.
Deductive versus inductive reasoning
These are just the extremes of the continuum. In practice, research uses some blend of both inductive and deductive thinking. Quantitative research (focus on numbers) tends to favour deductive reasoning (testing preformulated hypotheses about the world) Qualitative research (focus on verbal behaviour) tends to favour inductive reasoning (using data to help formulate theories) However, this is really a black-and-white distinction and not always the case. A lot of quantitative work (e.g., single-subject design, pilot and feasibility studies) do not necessarily test an a priori defined hypothesis but is used to generate some.
Testing for group differences
We assume that there are no differences between the two groups (null hypothesis). We assume that there are no differences between the mean score of the two samples from the two populations. But due to random error in sampling, we will still expect differences in the means, even if the populations are the same. If we keep collecting a sample from each population and compare them (and the populations are really the same), we would expect to find, most of the time, the means to be very similar, but sometimes to be quite different. The differences in the sample means can also be described by a normal distribution. Even when the populations are different, the sample means taken from both populations will vary. Most of the time, the sample mean will be larger from the population that has higher values:
Statistical Errors
When making a comparison between two sample means, there are two possible states of affairs (the null hypothesis is true or it is false) and two possible decisions you can make (to not reject the null hypothesis or to reject it). In combination, these conditions lead to four possible outcomes: 1) the situation where the null hypothesis is not true (the independent variable had an effect), and you correctly decide to reject the null hypothesis. 2) the situation where the null hypothesis is true (the independent variable had no effect), and you correctly decide not to reject the null hypothesis. This is a disappointing outcome, but at least you made the right decision 3) a more disturbing outcome. Here the null hypothesis is again true, but you have incorrectly decided to reject the null hypothesis. In other words, you decided that your independent variable had an effect when, in fact, it did not. In statistics this mistake is called a Type I error. In signal detection experiments, the same kind of mistake is called a "false alarm" (saying a stimulus was present when actually it was not). 4) a second kind of error. In this case, the null hypothesis is false (the independent variable did have an effect) but you have incorrectly decided not to reject the null hypothesis. This is called a Type II error and represents the case where you concluded your independent variable had no effect when it really did have one. In signal detection experiments, such an outcome is called a "miss" (not detecting a stimulus that was present).
paired t-test
When the dependent variable is measured twice in each individual: paired t-test The repeated measures t-test is used to see if two measurements of a single dependent variable made on a single group differ significantly. e.g., measure the hunger levels of subjects both before and after exposure to fast-food commercials. Often used with pretest-posttest designs. It is important that the dependent variable is measured at least at the interval level. And that the scores in the data set can be represented by the normal distribution.
nonparametric statistics
When your data do not meet the assumptions of a paran1etric test, or when your dependent variable was scaled on a 1101ninal or ordinal scale, consider a nonparametric test. This section discusses two nonparametric tests: chi-square and the Mann-Whitney U test
Chi-Square
When your dependent variable is a dichotomous decision (such as yes-no or guilty-not guilty) or a frequency count (such as how many people voted for Candidate A and how many for Candidate B), the statistic of choice is ... Versions of ... exist for studies with one and two variables. This discussion is limited to the two-variable case. For further information on the one variable analysis, see either Siegel and Castellan (1988) or Roscoe (1975).
Unequal sample sizes
You can still use an ANOVA if your groups contain this, but you must use adjusted computational formulas. The adjustments can take one of two forms, depending on the reasons ... may simply be a by-product of the way you conducted your experiment. If you conducted your experiment by randomly distributing your materials to a large group, for example, you would not be able to keep the sample sizes equal. In such cases, unequal sample sizes do not result from the properties of your treatment conditions. ... may also result from the effects of your treatments. If one of your treatments is painful or stressful, participants may drop out of your experiment because of the aversive nature of that treatment. Death of animals in a group receiving highly stressful conditions is another example of subject loss related to the experimental manipulations that result in ...
t Test
a statistic that compares two means to see whether they could come from the same population used when your experiment includes only two levels of the independent variable
Randomised controlled trial (RCT)
• The classical experimental design • Investigator randomly assigns subjects to an intervention group or a control group (which gets nothing or an alternative intervention) • Considered as very convincing evidence of a cause-and-effect relationship
Administering the survey - 2. Interviews
1. Face‐to‐face - expensive, time‐consuming, best when the sample is small or when certain tests can only be done in person 2. Telephone - useful for large‐scale surveys, less expensive than in‐person format, access participants from a wide geographical location 3. Focus group - about 6‐10 individuals in a group, provided with open-ended questions by an interviewer who is also present. But can be difficult to arrange an agreed set time within a group, travel costs for individuals, time-consuming. • Can ensure completion, and respondents can ask the researcher to clarify in person • But interviewer effect - interviewer subconsciously expects answers, leading to the participant answering according to what the interviewer wants by how they ask questions. social desirability effect - when the interviewers present, questions that are sensitive, people want to answer desirably according to social norms
Sampling
1. Population - all individuals with common attributes of interest to researcher 2. Sampling frame - actual population of individuals from which a sample can be drawn i.e. accessible population (not always your target population) - external validity 3. Sample - a subset of individuals who represent well the target population. If the sample is representative of the target population, then findings from the sample can be generalised to the wider population. • Larger sample sizes are more likely to produce data that accurately reflect our population of interest.
Question wording • Think about:
1. Simplicity - relatively simple to understand, devoid of jargon and technical terms 2. Avoid double‐barreled questions - these ask more than one thing in the sentence • E.g. "Should senior citizens be given more money for recreation centres and food assistance programs?" 3. Loaded questions - a leading question that induces people to respond in a particular way. Questions can contain emotionally charged words e.g. rape, waste, favour, punish, illegal, dangerous • E.g. "Is drink driving dangerous?" vs. "Do you think there is enough awareness being raised in schools of the consequences of drink driving?" -- 2nd statement is more neutral
Question wording that can affect responses
1.Negative wording • E.g. "Do you feel that euthanasia should not be legalised? vs. "Do you feel that euthanasia should be legalised?" 2.Yea‐saying and nay‐saying - the tendency to respond to a set of questions by agreeing or disagree with all the questions regardless of content. -- Solution is reverse coded questions - rephrasing questions eg- "I dislike parties and I like being alone" 3.Number of questions - the respondent burden - having a long questionnaire, people are going to feel mentally fatigued - important to find the right balance
One-Tailed Versus Two-Tailed Tests
A one-tailed test is conducted if you are interested only in whether the obtained value of the statistic falls in one tail of the sampling distribution for that statistic. This is usually the case when your research hypotheses are directional. For example, you may want to know whether a new therapy is measurably better than the standard one. However, if the new therapy is not better, then you really do not care whether it is simply as good as the standard method or is actually worse. You would not use it in either case. In contrast, you would conduct a two-tailed test if you wanted to know whether the new therapy was either better or worse than the standard method. In that case, you need to check whether your obtained statistic falls into either tail of the distribution. The major implication of all this is that for a given alpha level, you must obtain a greater difference between the means of your two treatment groups to reach statistical significance if you use a two-tailed test The one-tailed test is, therefore, more likely to detect a real difference if one is present (that is, it is more powerful). However, using the one-tailed test means giving up any information about the reliability of a difference in the other untested direction. Strictly speaking, you must choose which version you will use before you see the data. You must base your decision on such factors as practical considerations (as in the therapy example), your hypothesis, or previous knowledge. If you wait until after you have seen the data and then base your decision on the direction of the obtained outcome, your actual probability of falsely rejecting the null hypothesis will be greater than the stated alpha value. If you conduct a two-tailed test and then fail to obtain a statistically significant result, the temptation is to find some excuse ,vhy you "should have done" a onetailed test. You can avoid this temptation if you adopt the following rule of thumb: Always use a two-tailed test unless there are compelling a priori reasons not to.
t-test for correlated samples
A parametric inferential statistic used to compare the means of two samples in a matched-pairs or a within-subjects design in order to assess the probability that the two samples came from populations having the same mean. The t test for correlated samples produces a larger t value than the t test for independent samples when applied to the same data if the scores from the two samples are at least moderately correlated, and this tends to make the correlated samples test more sensitive to any effect of the independent variable. However, this advantage tends to be offset by the correlated sample t test's smaller degrees of freedom. With its reduced degrees of freedom the correlated samples t test will then be less able than the independent samples t test to detect any effect of the independent variable.
Post-Hoc tests
A significant ANOVA tells us only that a difference exists between at least two of the group means. It doesn't say exactly where differences are. To find that out, we do posthoc tests. Bonferroni is the most common one: The level of significance (i.e., =.05) is divided by the number of t-tests to be done. The Bonferroni is the most conservative posthoc test, which means that it is less likely to yield you a significant result. Other (common used) post-hoc tests are: - Tukey - LSD (least conservation, as the name might allude to)
Analysis of Covariance (ANCOVA)
A statistical procedure used to test mean differences among groups on an outcome variable, while controlling for one or more covariates. Used when you have included a continuous correlational variable in your experiment (such as age).
Quasi-experimental design issue example
A study is designed to investigate the effects of dog obedience training. The researcher collects data about dog owners that have never taken their dogs to such a training course, and about owners that have just completed a 2-week course. The researchers are surprised to find that the people who have just completed the course report worse behaviour in their dogs than owners that haven't been on the course. Is the course making behaviour worse?
Distribution of quantity of differences of means
Actually, when the sample sizes are not so large, the differences in sample means are described by the t-distribution. Like z, the t-distribution is symmetrical about zero and has the familiar bell-shaped look. But for this course, let's say it's a normal distribution. Remember, we are still assuming that the two populations are the same. So, given this assumption, we can look at the normal distribution to determine the probability of finding various differences in the sample means of two different populations: In 19 of 20 comparisons, the mean difference lies in this area: Anything outside this area is, by convention, considered unlikely and therefore as evidence that the two populations are different.
ANOVA for a Ttvo-Factor Between-Subjects Design: An Example
An experiment conducted by Donnerstein and Donnerstein (1973) provided an excellent example of the application of the ANOVA to the analysis of data from a twofactor experiment. Donnerstein and Donnerstein were interested in studying some of the variables mediating interracial aggression. Participants (all Whites) were told that they would be participating in an experiment on learning. They were told that they would have to administer a mild reward each time the "learner" made a correct response and administer punishment (electric shock) each time the learner made a mistake. In the first of two experiments, Donnerstein and Donnerstein manipulated the race of the learner (Black or White). They also manipulated the extent to which participants believed that their behavior would be censured. In a high-censure condition, participants were told that their responses were being recorded on videotape. In a low-censure condition, no mention was made of videotaping the responses. The dependent variable analyzed with a 2 X 2 ANOVA was a composite of shock intensity, shock duration, and the sum of the high-shock intensities. The results of the AN OVA revealed a main effect of potential censure, F(1,32) = 10.49, p < .01, and a significant interaction between the race of the learner and potential censure, F(l,32) = 6.81, p < .05. Interpreting the Results This example shows how to interpret the results from a two-factor ANOVA. First, consider the two main effects. There was a significant effect of potential censure on aggression. Participants in the potential censure condition were less aggressive than those in the noncensure condition. If this were the only significant effect, you could then conclude that race of the learner had no effect on aggression because the main effect of race was not statistically significant. However, this conclusion is not warranted, because of the presence of a significant interaction between race of learner and censure. The presence of a significant interaction suggests that the relationship between your independent variables and your dependent variable is complex. Figure 12-6 shows the data contributing to the significant interaction in the Donnerstein and Donnerstein experiment. Analyzing a significant interaction like this one involves making comparisons among the means involved. Because Donnerstein and Donnerstein predicted the interaction, they used planned comparisons (t tests) to contrast the relevant means. The results showed a Black learner received significantly less punishment than a White learner when the potential for censure existed, t(32) = 5.92, p < .01. Conversely, when no potential for censure existed, the Black learner received more punishment than the White learner, t(32) = 2.38, p < .05. The conclusion that race did not affect aggression must be discarded. In fact, race does affect aggression, but only when the other independent variable is considered.
reporting stats for ass
An independent-samples t-test indicated that scores were significantly higher for women (M = 27.0, SD = 7.21) than for men (M = 24.2, SD = 7.69), t(734) = 4.30, p < .001, d = 0.35. italasize t, p etc - see reporting stats document if you don't meet the assumptions and the p-value is too low ie- does not have a normal distribution, you can still do a t-test -- look at qq plot -- on lab3 there is some evidence from normality - some level of skewness in histograms skewness distribution - if greater than positive or negative 1 it is highly skewed. If it is between 5 and 1 (positive or negative) then there is a moderate level of skewness. If it is above 1 then you should do a non-parametric t-test. Same with kurtosis high level of skewness in tut3 but still can do t-test as it is not above 1 ANOVA F test is similar to t-test there is a sig dif between marital status and QOL post hoc- single and married = sig diff
Main Effects and Interactions
If you find both significant main effects and interactions in your experiment, you must be careful about interpreting the main effects. When you interpret a main effect, you are suggesting that your independent variable has an effect on the dependent variable, regardless of the level of your other independent variable. The presence of an interaction provides evidence to the contrary. The interaction shows that neither of your independent variables has a simple, independent effect. Consequently, you should avoid interpreting main effects when an interaction is present. Another fact to be aware of when dealing with interactions is that certain kinds of interaction can cancel out the main effects. if you have a significant interaction, ignore the main effects. The factors involved in the interaction are reliable whether or not the main effects are statistically significant. Interaction tend to be inherently more interesting than main effects. They show how changes in one variable alter the effects on behavior of other variables.
One-Factor Within Subjects ANOVA
If you used a multilevel within-subjects design in your experiment use this. As in a between-subjects analysis, the between-treatments sum of squares can be affected by the level of the independent variable and by experimental error. However, unlike the between-subjects case, individual differences no longer contribute to the between-treatments sum of squares because the same subjects are in each experimental treatment group. The within-subjects source of variance(s) can also be partitioned into two factors: variability within a particular treatment (that is, different subjects reacting differently to the same treatment) and experimental error. The contribution of individual differences is estimated by treating subjects as a factor in the analysis (S). You then subtract S from the usual within-groups variance. This subtraction reduces the amount of error in the denominator of the F ratio, thus making the F ratio more sensitive to the effects of the independent variable-a major advantage.
Collecting data from two groups
If you want to compare two groups, and you cannot collect data from the entire population, you are relying on sampling. You are collecting a sample from each population and are comparing the two samples. The comparison is only warranted if the samples are both representatives of their respective populations. *Assuming there is no systematic bias*, and the samples are representative. We still have a random error in data collection. How do we know whether the two samples are different because the populations are different, or because of any random error? A t-test is conducted to find that out. Sample size increase means random error decreases
If we have previous evidence of a difference:
If, for example, previous research showed that OZ males are 2cm taller, your null hypothesis would be: "OZ males are 2cm taller than NZ males" We would then be looking for the likelihood that we got the result that we did (OZ males 1cm taller), given that OZ are on average 2cm taller.
Two factor between subjects ANOVA
In this design, you include two independent variables and randomly assign different subjects to each condition. In addition, you combine independent variables across groups so that you can extract the independent effect of each factor (the main effects) and the combined effect of the two factors (interaction) on the dependent variable.
Rotation Methods
Orthogonal rotation keeps factors un‐ correlated thus increasing the meaning of the factors. It minimises factor covariation. • Varimax (mostly used), Quartimax, Equamax) Oblique rotation allows the factors to correlate leading to a conceptually clearer picture but a nightmare for the explanation (overlapping factors). • Direct oblimin (mostly used), Promax https://stats.idre.ucla.edu/spss/seminars/introduction‐to‐factor‐analysis/a‐practical‐ does not matter which one you use, maybe look at which one looks better and what others have done for ass
semen and depression Gallup et al. (2002)
Main point is that they seem to have interpreted data incorrectly. They did mention the alternative explanation, but they dismissed it too quickly. There may have been some statistical interactions here. Females who had sex without condoms, and therefore would be more likely to have semen in their reproductive tract, evidenced significantly fewer depressive symptoms than those who used condoms. females who were having sex without condoms also showed lower depression scores than those who were abstaining from sex altogether. The fact that depression scores among females who were not having sex did not differ from those who were using condoms demonstrates that it is not sexual activity per se that antagonizes depression. In terms of the relationship between condom use and depressive symptoms, it is also important to comment on the differences in suicide attempts. Sexually active females who usually or always used condoms were more likely to report having attempted suicide than those who never or only sometimes used condoms. Likewise, in much the same way that was true of depression scores, those who abstained from having sex were equivalent to those that typically used condoms in terms of the proportion of respondents who admitted a prior suicide attempt. confounding variables may be oral contraception, amount of sex - more sex for those not using condoms, risk taking behaviour, commited relo no condom -- all these were not correlated with BDI.
Planned comparisons
are used when you have specific preexperimental hypotheses. hypothesized statistical comparisons used with ANOVA to compare individuals scores on the dependent variable according to the groups or categories of the independent variable These comparisons are made using information from your overall ANOVA. Separate F ratios (each having one degree of freedom) or t tests are computed for each pair of means. The resulting F ratios are then compared to the critical values. You can conduct as many of these planned comparisons as necessary. However, a limited number of such comparisons yield unique information. Those comparisons that yield new information are known as orthogonal comparisons. Any set of means has (a - 1) orthogonal comparisons, where a is the number of treatments. Planned comparisons can be used in lieu of an overall ANOVA if you have highly specific preexperimental hypotheses. In this case you would not have the information required to use the given formula for planned comparisons. A simple alternative is to conduct multiple t tests. You should not perform too many of these comparisons, even if the relationships were predicted before you conducted your experiment. Performing multiple tests on the same data increases the probability of making a Type I error across comparisons through a process called probability pyramiding
Lab 7 Nonparametric tests
chi square looks at the observed values are different from the expected values in each cell. report value (similar to t or f stat), df and p independent samples t-test - use mann whitney when normality violations in anova or t test is made strong relationship between sex and employment. Significantly different across employment categories don't just say not significant but be specific - eg- no sig diff between... better to use t test instead of independent t test as type 1 error risk can occur if you test individual questions don't need kriskul wallas in ass skewness should be between -1 and 1 ideally as a general rule, however there are no firm rules put skewness and kertosis in method when discussing normality, visual inspection of histogram and qq plot -- to check for normal dist --- interval qol made it less skewed, qol total is negativly skewed -- only one sentence and no graph of these shapiro-wilk should be a high p value above .05 - indicating normal dist -- shapiro wilk p was less than .001 which indicates there is not a normal dist. However, looking at qq and histogram the dist is normally distributed factor anal lab - refer to notes - health checks important Chi-square tests may be relevant when you want to describe the demographics of your sample at the beginning of the results section of your assignment. - better to present on a table - don't use both table and write up. Use table then explain the group differences read articles for how to set out the regression
Strengths of RCT
considered the gold standard • Having a control group means you are able to find out exactly what effect the IV has on the DV. • If designed well, RCTs are robust against alternative explanations. • Results are easy to interpret. • Its design minimises researcher subjectivity
TUT 4 reliability of scale notes
do not use total scores for qol etc cronbachs - .7 - if it is below reliability is questionable -- FCV is .866 - scale reliability is very good -- can mention is methods or the first part of results just one sentence. qol = .87 if we were to drop qol12 item the cronbachs alpha reliability would be better (.47) - close to .5 coping scale - reliability is quite low but not to be concerned about can be mentioned, look at other literature using this scale CA= .58 - quite low -- should be close to 1 -- .7 and above assumption Bartlett - should be less than .05 - which it is KMO - .86 is high which is good both assumptions have been met -- this goes in the methods section, not results!!! only one factor shown eigenvalues diff from lab handout but alg - tut had 1 --- unidimensional - single factor explains fcv high values on eigenvalues - don't worry about originality fcv1 can belong to factor 1 or factor 2 - 2factor structure is also plausible for the covid scale -fixed no. of 2 components scree plot - shows 1 factor explains data - 2 factors = less explains variance of data, 3rd factor does not explain much information -- elbow curve - is it worthwhile including in my solution - should include scree plot also in ass? I think yes -- 1 factor explains the most variation - other factors do not explain much of the variation -- write how you came up with this solution briefly -- also look at other studies using this scale - conceptual - mention in methods section -- FA IS ONLY VERY SMALL PART OF ASS factor summary explains how much that factor accounts for variability -- explains 49% variance - mention is ass.
Parametric Versus Nonparametric Statistics
estimates the value of a population parameter from the characteristics of a sample. When you use a parametric statistic, you are making certain assumptions about the population from which your sample was drawn. A key assumption of a parametric test is that your sample was drawn from a normally distributed population. -- you can look at this by making a histogram on jamovi and seeing if it is normally distributed or skewed In contrast to parametric statistics, nonparametric statistics make no assumptions about the distribution of scores underlying your sample. Nonparametric statistics are used if your data do not meet the assumptions of a parametric test.
FA - procedures Stage 2
factor rotation • Different rotation types: orthogonal and oblique Assumptions to be checked for FA: • Bartlett's test of sphericity: tests whether variables are unrelated, therefore not suitable for factor extraction. • Bartlett'stest of sphericity: it has to be significant • Kaiser‐Meyer‐Olkin (KMO) sampling adequacy: test for sampling adequacy - indicates if sample size is large enough to undertake FA • KMO‐test values > 0.5 only need one or two sentences in method section in assignment
Lab 5
no sig diff p value too high - no sig diff between FCV and the employment groups -- NOT FCV and employment f stat - within and between group variation leave out cohens d in t test reporting - report both between groups and within groups anova is merged cat 3 and 4 - report the last one : employment and residuals employment star sex indicates an interaction - not sig - can write did not find sig diff in ass To do well in your assignment, you don't need to include every analysis and test that you have been exposed to in the labs. Instead, you need to select ones that make most sense to answer the particular questions that you are asking. You can still mention in passing why you haven't chosen certain alternatives - that way, you can still demonstrate your knowledge of the variety of tests that are available in Jamovi. Verbalize what you are doing - why have you chose anova ancova etc. Why is it the most suitable - don't need to include all types of anova's -- don't need tables just repod one sentence anova
parametric statistics
normal distribution assumption - *only applies to DV* eg- fear of covid We all have some sense of the likelihood of events - statistics is just a more formal and systematic way of implementing that kind of reasoning. Some examples: Extrasensory perception experiment • Your friend presents you with 5 cards, and you are asked to focus your thoughts on one of them. • Your friend claims that they can sense your thoughts. Even though they might not get it right every single time, their performance should be better than chance. • What is the chance of getting it right? • If performance was purely random, how many times would they get it right on 10 trials? Extrasensory perception experiment • Your friend presents you with 5 cards, and you are asked to focus your thoughts on one of them. • Your friend claims that they can sense your thoughts. Even though they might not get it right every single time, their performance should be better than chance. • What is the chance of getting it right? • If performance was purely random, how many times would they get it right on 10 trials? what is the percentage of the null hypothesis - that the result is not due to chance? .05 -- in some cases the p-value needs to be lower eg- medical treatment or forensics 1 in 20 chance is too high stakes - might be falsely accused
Random & systematic variation
o Random variation (occurs by chance) o Expected difference in values that occur when one examines different subjects from the same sample o An individual subject's values will vary from the value of the sample mean. Therefore a small sample will increase random variation. o Systematic variation/bias o Variation in values due to sampling not being random but affected by biases in selection ‐ therefore consistently selecting subjects who are different or vary from the population in some specific way o Worse if unaware of it
Alpha
refers to the cutoff point you adopt
Per-comparison error
the alpha for each comparison between means
alpha level
the probability level used by researchers to indicate the cutoff probability level (highest value) that allows them to reject the null hypothesis. The probability that a difference at least as large as the observed difference between your sample means could have occurred purely through sampling error. The alpha level you adopt (along with the degrees of freedom) also determines the critical value of the statistic you are using. The smaller the value of alpha, the larger the critical value. Alpha represents the probability of a Type I error. Thus, the smaller you make alpha, the less likely you are to make a Type I error. In theory you can reduce the probability of making a Type I error to any desired level. For example, you could average less than one Type I error in 1 million experiments by choosing an alpha value of .000001. There are good reasons, discussed later, why you do not ordinarily adopt such a conservative alpha level. By convention, alpha has been set at .05 (5 chances in 100 that sampling error alone could have produced a difference at least as large as the one observed).
F ratio
the ratio of between-groups variance to within-groups variance. The statistic used in ANOVA to determine statistical significance. A significant ratio tells you that at least some of the differences among your means are probably not caused by chance, but rather by variation in your independent variable but, as usual, it does not tell you where these significant differences occur.
One-factor between-subjects ANOVA
used when your experiment includes only one factor (with two or more levels) and has different subjects in each experimental condition. As an example, imagine you have conducted an experiment on how well participants can detect a signal against a background of noise. Participants were exposed to different levels of background noise (no noise, 20 decibels, or 40 decibels) and asked to indicate whether or not they heard a tone. The number of times the participant correctly stated that a tone was present represented your dependent variable. You found participants in the no-noise group detected more of the tones (36.4) than participants in either the 20-decibel (23.8) or 40-decibel (16.0) groups. Table 12-3 shows the distributions for the three groups.
Factorial design
• An example of a 2x2 design: continuous group design • Let's say you want to investigate the effects of environmental enrichment on learning. • But you want to compare its effects in groups of either low or high exercise: low (er1) vs high (er2) little (ex1) vs a lot (ex2) Note that these are separate groups, so any subject can only be in one of the 4 groups • Apart from "EX1+ER1" versus "EX2+ER2" and "EX2+ER1" versus "EX1+ER2", all comparisons are valid. Which group we call the control here depends on what we are looking at. We can only directly compare groups that differ by no more than one factor. • But we can't assume that the effects of each factor are always directly additive: • The positive effects of exercise might be larger when there is little environmental enrichment: (Effects of EX2 when with ER1) > (Effect of EX2 when with ER2) • Factorial designs are when participants are assigned to groups. However, one can also look at interactions based on observation (not having assigned participants to groups)
Internal consistency (reliability): Cronbach's alpha
• Are all the items/questions measuring the same thing? • Are they all correlated with each other? • α = Average of all possible split‐halves correlations, adjusted by the number of items (Spearman‐Brown formula) • Ranges from 0 to 1 α < 0.7 - below acceptable/poor reliability α > 0.7 - acceptable reliability α > 0.8 - very good reliability α > 0.9 - excellent reliability • If a scale has sub‐scales, you need to calculate Cronbach's alpha scores separately for each sub‐scale • Cronbach's alpha does not provide a measure of "unidimensionality" - for that, we need to conduct a factor analysis are items measuring the same thing and correlated with each other if the scale does not have good reliability especially if it is made by someone else, there is not much you can do about it. Just need to acknowledge it in your discussion eg- "the scale reliability was poor, therefore results need to be interpreted with caution"
Randomisation
• As the sizes of two groups increase, so does the the chance that they are similar (e.g. in terms of average height). • Sometimes, when the effects of other variables are suspected (e.g., gender), we can make sure that: a) they don't affect the group differentially by spreading them across the group equally (matching) b) we measure the potential confounding variable and then account for its effect statistically. This is what you will learn in the SPSS labs. • But still, all sorts of other biases can sneak in during the sampling process. On average, the population may be similar
Matched-pairs designs
• Ideally, randomisation ensures that all the groups are as similar as possible - obviously not always possible (random and systematic error). - Especially when there is a small sample size and you suspect that there is an aspect you need to control for eg- gender differences may be important for things such as depression. - if you have 40 people you don't want to randomize as it could be skewed and biased. • If you suspect that a particular variable might be important and needs to be spread evenly across all groups, you can use a matched-pair design. • Most commonly done with gender or age • Sometimes also called "mechanical matching"
Psychological constructs
• In psychology, it's hard to show what we are really measuring. We often infer psychological constructs from direct and indirect observations. • Self-esteem, for example, would be a construct. • There is no single measure of self-esteem. • Other examples of constructs:
The reasoning behind inferential statistics:
• Assume there was no effect (null hypothesis or H0). Our groups are not different. • The research hypothesis (H1) is that there is an effect For example: - Australian males are on average as tall as New Zealand males. - People who drink coffee in the morning can concentrate as well as people who drink tea in the morning. - People who take Vitamin C supplements are equally likely to get a cold as people who don't take them. We are then trying to find out the likelihood of getting the result that we got, given that the null hypothesis is true. For example: Suppose we sample 100 OZ and 100 NZ males, and we find that the OZ males are on average 1cm taller than the NZ males. Assuming that the average height of all Australian and New Zealand males is the same, what is the probability that we obtained our result? its better to take the null hypothesis is true assumptions If the result was pretty likely to have occurred by chance, then we have no reason to believe that there was an effect. On the other hand, if the result was pretty unlikely (like a 1 in 20 chance of occurring), then we are prepared to believe that there was an effect. 4% or .045 chance of fluke result is unlikely - null needs to be discarded
Questionnaires
• Basic assumption: Variables that significantly correlate with each other do so because they are measuring the same "thing". • The problem: What is the "thing" that these correlated variables are measuring? • How might you "reduce" the data intuitively and make inferences from the "thing" that you're measuring? • Are there certain "factors" that some of the events measure? How might you "reduce" the data intuitively and make inferences from the "thing" that you're measuring?
Maturation effect
• Change simply with time and not due to exposure to the treatment/intervention.
What type of responses to have?
• Closed vs. open‐ended questions • Closed - limited number of response alternatives given e.g. Yes/No • Open‐ended - free text • Rating scales - e.g. Likert scale e.g. FCV‐19S • Graphic rating scale - a defined ruler which corresponds to a specific score (not as common). • E.g. How would you rate your lecturer? Please draw a mark. Very boring ---------- Very exciting or Likert scale with emotion faces - wong baker - for children it is best to be consistent with rating scale and numbers on the scale
Confidence intervals
• Confidence interval/margin of error ‐ A statistical measure indicating the accuracy of your estimate of the population value. Usually expressed as a %. • Commonly 95% confidence interval used • You are 95% confident that the true score of the population is within the interval above or below the sample mean • e.g., election polls 42% + 3% • Larger sample size will reduce the confidence interval, i.e. you become more precise The effect of sample size on CI= Greater Sample Size = More Confidence (more precision)
Problem with one-group pretest-posttest
• Confounding variables are things that you might have varied inadvertently and that prevent you from concluding that only the IV affected the DV. • e.g. "Hawthorne effect"
How many factors are there?
• Determined by the percentage of the variance that is explained by all factors extracted • In the natural sciences 95% variance is desired, whereas in the social sciences 60% is considered sufficient (Hair et al., 1998). Nevertheless, values as low as 32% have sometimes been considered acceptable in the social science literature • The best solution is one which explains the most variance with the least number of factors Kaiser criterion (Eigenvalues>1) • One approach is to accept all factors with an Eigenvalue above 1 (Kaiser criterion). • Eigenvalues tell us how much of the variance within the set of variables is explained by the factors • But Catell (1966) suggested that the Kaiser criterion (Eigenvalue>1) results in too many factors being retained. • Scree plots suggested as alternative Scree plot • Indicates the % of additional variance explained by each factor • Look for the point where an additional factor fails to add appreciably to the cumulative explained variance (the elbow curve) • 1st factor explains the most variance • Last factor explains the least amount of variance
Rasch analysis
• Developed by Georg Rasch in 1960 • Based on Item Response Theory • Operates at the level of individual items =>Each item must be shown to be measuring the same latent trait or variable • A precise statistical model defines the relationship between each item and the latent trait or construct of interest • A fundamental assumption is that the probability that a person will pass or endorse an item is dependent on person ability/construct level and difficulty of the item. • Once a scale meets expectations of Rasch model., then can convert from ordinal to an interval scale. • Items that do not behave according to the model are removed • Rasch model insists on a unidimensional scale • Detailed information about individual item performance
Difficult variables to control
• Experiments need to be kept very simple as you need to make sure you are only changing one parameter between experimental and control condition. • Complexity arises when there is potential for confounding variables that are hard to estimate or control. • You may have multiple raters who may or may not be spread across both conditions, thus adding noise. • You may have attrition affecting both groups differently. • Also, planned and actually arranged levels of the IV may differ due to, for example, treatment delivery, participant compliance.
Types of construct validity
• Extent to which results support a network of research hypotheses based on the assumed characteristics of a theoretical psychological variable (construct) - e.g., is the test a measure of self-esteem? • Predictive validity: How well does your test predict future behaviour or ability? E.g. scholastic aptitude and later exam results. • Convergent validity: How much do the results from your test concur with those of another? E.g. are the high scorers in your intelligence test also high scorers in another already established intelligence test?
Randomised pretest-posttest control group design
• Generally large, well-designed studies • Random sample selection or group allocation to ensure the groups are similar on the variable of interest. - note that the pretest is already an intervention eg- writing down how much you do things is an intervention - already affecting behaviour -- what is a true baseline? - not aware of being measured but that has ethical issues
Repeated-measures designs
• Individual scores always vary - more so if the reliability of your instrument is low. So, your score on a test on a particular day might be lower or higher than your true score for some reason. And the same goes for all participants. • This variability means that we need a larger effect to be able to claim that there are group differences (an effect large enough to be seen above the background of variability). • One way to control for (estimate the size of) this variability is by taking repeated measures. • A sophisticated technique here is Generalisability Theory, which allows for the estimation of state (changeable) and trait (relatively permanent) aspects (see Medvedev et al., 2017 doi:10.1007/s12671-016-0676-8) • If repeated measures coincide with repeated interventions, which may differ slightly, you may need to control for order effects:
Co-variates
• Instead of "mechanical matching", you can also use "statistical matching". - measuring every variable of participants • This means that your statistical analyses take into account any values that the groups differ on. • e.g., if you want to investigate the effect of marriage on quality of life, you will need to control for age • More about that in the ANOVA lecture. - an eg of co-variate -- age and health, as your age goes up your health worsens - then older people start to take care of their health more - the correlation may appear that more healthy food = health worsens -- confound sometimes mechanical variables cannot be done eg- 2 cancer patients here and 2 in another group KEY IN ASSIGNMENT - PhD students use this a lot -- psych has to control for variables as there are more complex mechanisms confound variable is no longer a confound if you measure it
Types of reliability assessment
• Inter-rater reliability • Test-retest reliability: Doing the test twice and seeing to what extent scores correlate - how similar are the scores? • Internal consistency reliability: • Split-half reliability: After the participant sat the test, arbitrarily split the test into two equal halves. How similar are the results from both halves? • Cronbach's alpha
Nonexperimental designs
• Lack of all of the characteristics required for true experimental research • Conclusions rely on the statistical interpretation of data of correlations, which includes statistically controlling for potential confounding variables • Descriptive research (e.g., cross-sectional surveys): - sometimes the only option (ethical reasons) - most useful when testing concepts or relationships among constructs that occur naturally • Often, the preferred method of analysis here is structural equation modelling. This allows for a detailed analysis of mediators and moderators • Often the only way to investigate the effects of participant variables (demographic variables) • Can try to construct a case for causation with LOTS of correlational data AND • Converging evidence
Nonexperimental vs experimental research
• Nonexperimental research is observational, and as mentioned before, it's hard to prove causation • Experimental research can demonstrate causation because of its key features: • Manipulation • Control of extraneous variable
So why do nonexperimental research?
• Often good as a pilot study to inform us where possible causal relationships could lie • Sometimes you can't manipulate so-called participant variables (or sometimes referred to as subject variables or personal attributes, such as ethnicity) • Sometimes unethical to have an experimental and control group (e.g. forcing people to smoke)
Trade-off
• Often, you are faced with a trade-off situation between internal and external validity. • To obtain high internal validity, you need a lot of experimental control (e.g., in a laboratory). • But more control means you risk compromising external validity - are the results valid in the real world?
SCIENCE VS. PSEUDOSCIENCE
• Peer review (results scrutinised by other scientists before it can be published) • Experiments precisely described so others can replicate them • Failures to find something are scrutinised with the aim to learn from it • Convince through research evidence • Ideally, no personal interests of scientists in the findings vs • No peer review (published as is) without quality control • Often lacks precise and detailed description of methods • Failures are often not mentioned • Convince with appeal to emotions and beliefs • Often personal interests involved - hypotheses generated usually untestable - if scientific tests reported, methodology is not scientific and validity of data questionable - claims ignore conflicting evidence - scientific sounding, vauge, appeal to preconcieved ideas, rationalise strongly held beliefs - claims never revised
Administering the survey - 1. Questionnaires
• Personally administer to groups or individuals -Mail surveys - costly, low response rates (easy to forget to do, lost in the mail), • Internet surveys - cheap, but some may not have internet or computer (may be a problem if the sample is elderly or rural), needs to be user‐friendly, technical issues
Sampling methods
• Probability sampling - every individual has a specified probability of being selected 1. Simple random sampling • Every individual has an equal chance of being selected • Usually involves the use of a ʻtable of random numbersʼ 2. Stratified random sampling • Accessible ppl broken up into strata, and within each strata conduct random sampling (e.g. age bands, or ethnic groups). - GOOD WHEN YOU WANT A REPRESENTATIVE SAMPLE • Allows for a smaller sample to be used than simple random sampling, whilst ensuring that all strata of a given characteristic is represented 3. Cluster sampling • Identify clusters of individuals (e.g. suburbs in a region), then randomly sample individuals from these clusters
Randomised waitlist control group design
• When you cannot withhold treatment for ethical reasons • The control group also receives the treatment at some point, but delayed in time, thus controlling for time. • However, being on the waiting list may be aversive, and there is no guarantee that the participant seeks help elsewhere in the meantime. - even if they say they will wait. Being on a waitlist can cause or worsen psychological distress • In therapeutic research, a common comparison is with a so-called treatment-as-usual (TAU) condition. This controls for placebo effects by treatment. • Also watch out for different attrition rates - this may also affect the results.
Random & systematic variation & real effect
• Random variation (occurs by chance), will always be there • expected difference in values that occur when one examines different subjects from the same sample • An individual subject's values will vary from the value of the sample mean. Therefore a small sample will increase random variation. • Random error in sampling When random factors cause the difference between the sample and the population. When we collect a sample, we cannot avoid random error (but we can reduce it by increasing the sample size). • Systematic variation/bias • variation in values due to sampling not being random but affected by biases in selection - therefore consistently selecting subjects who are different or vary from the population in some specific way. Systematic error in sampling When bias or other systematic factors cause the difference between the sample and the population. Thatʼs why we need to make sure we are using a representative sample. For example, if you want to get a sample of the height of New Zealanders but you are collecting your sample at a basketball game, you are introducing a bias. You are really targeting a sub-population (NZ basketball players or fans) who probably wonʼt represent the whole population (all New Zealanders) • Real effect • Against this backdrop of random and variation, you need to detect whether there is an actual real effect (e.g., difference)
Scree plot
• Scree plot shows % of variance in original variables accounted for by each factor. • Look for "elbow" in the curve ‐ point where additional factors don't add much to explained variance.
Pseudo-Random Binary Sequences
• Sometimes, leaving allocation purely to chance can result in "unusual" results that may affect your results in unintentional ways. • One way of dealing with it is through pseudo-random binary sequences (PRBS) where you allocate according to a pre-determined order that simulates randomness and thus often appears more random than randomness :)
Non-randomised controlled trial
• Sometimes, we can't randomly select (ethical reasons), and subjects self-select to groups. This means group allocation of the participants is already determined before the start of the trial: • experimental group • who decided themselves that they will get the IV (e.g. elective surgery) • control group • who decided themselves that they won't get the IV Apart from the lack of randomisation of subjects to groups, everything else can be the same as in the RCT. We still compare the two groups: But we can't say for sure whether the groups differ because of some other factor (e.g. smoking doesn't cause cancer, but people who decide to smoke are for some other reason more likely to get cancer).
Non‐probability sampling - 3. Quota sampling
• Strategies are used to try and ensure the inclusion of groups that tend to be under-represented (e.g. minorities, e.g. accessing Maori participants at a marae) 4. Snowball sampling • Takes advantage of social networks and asks participants to refer others they know who meet selection criteria • Used to locate participants that may be difficult to reach (e.g. members in addiction programmes)
Sampling error
• The difference between the values obtained from a sample and those that actually exist in the population. The error associated with a sample statistic compared to the corresponding population parameter getting a representative sample reduces this •Sampling error decreases with increased sample size (also when the population is more homogeneous)
External validity
• The extent to which the results can be generalised from the sample to the population, or from the experimental setting to the real world. • You have found that: • But does this relationship hold outside the experimental situation? • E.g., Davis & Meade (2013)
Face validity
• The extent to which the test appears to be measuring what it is designed to measure. • Does it look like we are measuring the right thing?
Reliability
• The extent to which the test provides a reliable (repeatable) measure of whatever it is measuring. • Does a personality test yield a similar result each time I sit it? • Do I get the same result if I do a cholesterol test three times in a row?
Testing
• The fact that people were tested, not the treatment, caused the change in behaviour.
So, which item belongs to which factor?
• The factor loadings can tell us - this is the correlation of each item with the factor • The minimum acceptable magnitude of the loading is between 0.3 to 0.45, depending on a sample size • After the rotation, we will be able to see more clearly which item belongs to which factor
Internal validity
• The reported outcomes are the consequence of the relationship between the IV and the DV. • There are no alternative explanations, as all extraneous variables were controlled for. • Being able to say with confidence that: Change in DV because of change in IV
Manipulation
• The scientist arranges changes in IV. If the DV changes as a result, it's stronger evidence for a causal relationship. • Experimental group: gets the treatment (IV) • Control group: doesn't get the treatment (no IV) • If only the condition of the experimental group changes, it's strong evidence that the IV causes a change in the condition. • It's all about ensuring that the only thing that is different between the two groups is the IV and nothing else
Limitations of RCT
• Usually requires a large number of participants - informs us about averages but not the individuals - expensive • Controlled settings (are the findings valid in an applied setting?) • Possible outcomes set at the beginning; cannot change design during the study process • What is the control? - placebo psychotherapy? setting up a straw man difficult in psych, for example when testing two methods of therapy such as cbt versus dbt, there may be different instructors and that is a confounding variable.
Why do we bother with surveys?
• Way to study behaviour, population attributes (e.g. Census data), consumer trends, uniform policy (e.g. referendum), preferences • Useful for exploratory analyses to frame your research question • Only provides a snapshot at a given time • More cost‐efficient way to collect data than experiments