N230 Test 3
Reliability & Validity
Data collected must be as accurate as possible The instrument that measures the study variables determines the reliability and validity of the data
Validity
Does the measurement method used to collect data actually measure what it is supposed to measure? The degree to which an instrument measures what it is supposed to be measuring = validity Validity asks, are you measuring what you think you are measuring. Or put another way, does your measure accurately measure the variable that it is intended to measure. There are Four Types of Validity Face Validity Content Validity Criterion Validity Construct Validity
Why Should You be Concerned about R & V?
Error is introduced into a study when R & V of measures and procedures are not present This error can be random or systematic
Content Validity
Establishes that the measure covers the full range of the concept's meaning. To determine the range of meaning, the researcher may solicit the opinions of experts and review literature that identifies the different aspects, or dimensions, of the concept.
SPSS output: paired t test
Experimental: Shows mean and SD of before and after scores T= t statistic P= sig 2 tailed Control: Shows mean and SD of before and after scores T= t statistic P= sig 2 tailed
Analysis of Scale Development Data
Exploratory factor analysis (EFA): Determines subscales for an instrument Uses what is called a Factor Rotation
SPSS output of ANOVA: comparison of depression among three groups with different level of activity
F= F ratio Sig- <0.05, at least one of the group means is significantly different Post Hoc Tests: •"Between group differences" are all significantly different. •All three group means are different one another.
What You Get From a Factor Analysis?
Factor structure Factor loadings (the correlation of the item to the factor) Want a loading of .3 to .4 at a minimum to establish an item belongs on a factor Amount of variance explained by the factor Desirable to explain at least 60% of the variance in your new measure
Parametric tests
Have stringent assumptions about normal approximation of population parameter Are used when: Data follow normal distribution in population, or Sample size is large (at least 30) Then, the distribution of sample outcomes tends to be normal, regardless of the shape of the data in population - Normality Assumption Require interval/ratio level data (e.g., t-test, ANOVA, Pearson's correlation)
F ratio
If the true means are really different,Then MSbetween(and hence F ratio) will tend to be much larger than MSwithin.
Significance tests of normality - i.e., Kolmogorov-Smirnov test, Shapiro-wilk test
In normality tests, If p<.05, we cannot assume normality in population do non-parametric tests If p>.05, we can assume normality do parametric tests
Reliability
Is a measure of: Stability OR Equivalence OR Internal consistency Tests of Stability a stable instrument can be repeated over and over on the same subject and will produce the same results Assumes that the variable being measured is constant over time One type of test: Test/retest -don't do this on things that can change; pain, sedation -minimum of one week between test and retest -same subjects
Chi-Square Test of Association
Is used to test for an association between two categorical variables: whether the frequency distribution of observations for one variable differs depending on the category of the second variableìIs used to test for independence between two categorical variablesìIs used to compare the proportions of one variable among two or more groups -if the distribution is identical p=1 Step 1: set up the hypotheses ìNull hypothesis(H0): Age and gender are independent(= not related). •Age group distribution does not differ by gender. Alternative hypothesis (H1): Age and gender are dependent (= related). •Age group distribution differs by gender. Now choose the level of significance:Alpha=0.05 Step 2: Choose the test statistic -χ2 χ2=(O-E)^2/E O: Observed frequency of each category E: Expected frequency of each category Er,c= (sum of row r)(sum of column c)/sample size Step 3: Compute p value df= (#rows -1)(#columns -1) χ2≈ 2.5 df= (2-1)(3-1) = 2 Then, p-value > 0.05 Smaller than critical value (5.99) Association Step 4: Make a decision by comparing the p-value to the chosen level of significance P-value > 0.05 Retain the H0: Age and gender are not related (=independent). Conclusion: Based on this data, we do not have reason to believe that the age distribution differs significantly by gender.
The Chi-Square Goodness-of-Fit Test
Is used to test whether the distribution of observations aligns with the expected distribution in the null hypothesis Is used to test whether the distributionof observations is significantly different from the expected distribution in the null. This test is known as the "chi-square goodness-of-fit (GOF)" test
Cronbach's alpha
Measures how good one item predicts the response to another Researchers can evaluate their items by examining what happens to the reliability when an item is deleted In other words, how good do the items in a questionnaire gel Most widely used method for evaluating internal consistency Can be interpreted like other reliability coefficients Normal range of values is between .00 and +1.00 Higher values reflect a higher internal consistency For an instrument or measure, a Cronbach's alpha of a 0.80 is desirable, but 0.7 is often accepted Internal consistency reliability is also assessed for subscales of an instrument Cronbach's alphas may be lower as the fewer number of items, the more difficult it is to establish internal consistency For a subscale, > 0.7 is desirable, but 0.6 can be acceptable -remove and internal consistency goes down; good item -goes up; bad item
Non-parametric tests
No stringent normality assumptions "distribution-free" test Data measured on any scale can be used. nominal, ordinal, interval or ratio e.g., Chi-square test, Mann-Whitney U test, Wilcoxon signed rank test, etc.
Kruskal-Wallis test
Non-parametric counterpart of ANOVA Assumptions Random sampling Independence of groups (3 or more) Dependent variable = continuous variable Hypotheses for the Kruskal-Wallis test: H0: Data distribution is the same in all groups. H1: Data distribution of at least one group is shifted to the left or right of the others. SPSS: If p<0.05, at least one group has significantly larger scores than at least one of the other groups.
Spearman's rank-order correlation
Non-parametric counterpart of Pearson's correlation Assumptions Random sampling Two continuous variables When the assumption of normal distribution is violated, we use Spearman's rank-order correlation coefficient (ρ) instead of r. Spearman's correlation coefficient = ρ (rho) -1 ≤ ρ (rho) ≤ 1 Interpretation is the same with r.
Wilcoxon Signed Rank test
Non-parametric counterpart of paired t-test Assumptions Random sampling Dependent variable = continuous variable We first order the absolute values of the difference scores and assign ranks from the smallest to the largest The ranks for the negative and the positive differences are then summed separately (W- and W+) If there is no significant difference between before- and after-scores, we would expect to see similar numbers of + and - ranks . Hypotheses for the Wilcoxon signed rank test: H0: Before- and after-scores have the same distribution. H1: After-scores are systemically lower (or higher) than before-scores. SPSS: If p<0.05, QoL scores after therapy are significantly higher than those before therapy.
Hypotheses in GOF test
Null hypothesis: 90% of the patients are on time, 8% are 10 -20 minutes late, and 2% are > 20 minutes late. Alternative hypothesis:The data are not consistent with a specified distribution in the null hypothesis. OR At least one of the specified proportions is not true.
Paired t-test
Paired t-test: Compares two means obtained from the same group. Determines whether the difference between paired observations is significant or not. "Two related sample means" typically represent :'Before and after' observations on the same subjects Two different observations from different treatments applied to the same subjects Paired t-test:determines whether the difference between the scores before and after discharge is significant in each group. (education vs. no education) Two sample t-test: determines whether the difference in mean anxiety scores between the two groups is significant before and after surgery.
Parametric and Non Parametric Test counterparts
Parametric:Independent group (two sample) t-test Non Parametric:Mann-Whitney U test(or Wilcoxon rank sum test) Parametric: Paired t-test Non Parametric:Wilcoxon signed rank test Parametric:ANOVA Non Parametric: Kruskal-Wallis test Parametric: Pearson's correlation Non Parametric: Spearman's correlation
Post-hoc test
Post-hoc test Post hoc tests: •Needed after the ANOVA test found to be significant •To determine which "between group differences" are significant Tukey, Scheffé, Duncan, Bonferroni test, etc. Tukey &Bonferroni test - most commonly used
Rank
Practical differences between parametric and NP tests are that NP methods use the ranks of values rather than the actual values -rank in ascending order, 2 same values average the ranks
Face Validity
Refers to confidence gained from careful inspection of the concept to see if it is appropriate "on the face" Every measure should be inspected for face validity Face validity alone does not provide convincing evidence of measurement validity It is the crudest form or validity The instrument is examined by: Experts in the field (Colleagues) Users Each item is inspected for its appropriateness to the construct being measured Wording Clarity Relevance Instructions must be written and reviewed Researchers can use: Focus groups One-on-one assessments Group consensus building Questionable item should be further evaluated
Construct Validity
Shows that a measure is valid by demonstrated that a measure is related to a variety of other measures specified in a theory Construct validity is used when no clear criterion exists for validation purposes Two other approaches to Construct Validity Convergent Validity- is achieved when one measure of a concept is associated with different types of measures of the same concept Discriminant Validity- Scores on the measure to be validated are compared to scores on measures of different by related concepts Construct validity: determines the extent to which the instrument actually measures the concept Contrasted or Known Groups Validity Factor Analysis Contrasted groups Take those expected to be high in the attribute to be measured and statistically compare their score(s) to those expected to be low in the attribute measured What would you expect to see?
Paired T Test Steps
Step 1: Set up the hypotheses Null hypothesis(H0): claims no difference/no effect"The mean scores do not differ before and after surgery (μd = 0)"d: difference between paired observations Alternative hypothesis (H1): choose one of the following "The mean scores differ before and after surgery (μd≠ 0)" "The mean score after surgery is greater than that before surgery (μd> 0)" "The mean score after surgery is lower than that before surgery ( μd < 0 )" Step 2: Choose test statistics T test statistic: observed difference-expected difference/ SE difference Step 3: Compute p-value Using t-table•df = n - 1; •n = the number of pairs = sample size (in each group) Step 4: Make a decision P-value < 0.05 Reject H0 There is a statistically significant difference between anxiety score before and after discharge in experimental group. But, the anxiety score after discharge was not significantly different from the score before discharge in control group.
Test of significance: correlation steps
Step 1: set up the hypotheses H0 : There is no correlation. H1: There is a correlation. Step 2: choose test statistics t statistic Step 3: compute p-value df = n-2 Step 4: make decision
Test of significance: regression steps
Step 1: set up the hypotheses H0 : β1 = 0; H1 : β1 ≠ 0 Step 2: choose test statistics t statistic T=beta1-1/ SE of beta 1 Step 3: compute p-value df = n-2 Step 4: make decision
Summary: ANOVA
Summary: ANOVA Tests differences among the means of three or more groups Use F ratio = signal/noise = between groups / within group varianceSignal > noise (=large F ratio) reject the null "groups are significantly different" Post-hoc test is needed to confirm which groups are different from each other
Test-Retest Reliability
Tests if a measure is consistent across time For example, a test or survey can be administered, then administered again a month later. Barring an event that would have some bearing on results, one can expect similar results if the attribute is stable: Quality of life Attitudes about social injustice Knowledge of pharmacology Administer the same test to a sample on 2 separate occasions then the scores are correlated Reliability coefficient is computed Correlation coefficient Range from -1.00 through .00 to +1.00 A high correlation coefficient is indicative of test-retest reliability It is desirable to have a high correlation of > 0.8, but sometimes a lower correlation such as 0.6 may be accepted Why would you not get test-retest reliability for the pain intensity rating?- chronic pain more stable than acute pain Why did the RAs in this study ask patients on the second testing occasion whether they had any major events in one's life? - can affect mood and stress
Tests of significance
Tests of significance: determines whether the observed difference is just due to chance (when the null is true) or significant difference One sample t-test: to test one sample mean Two sample t-test: to test the difference between two independent sample means Paired t-test: to test the difference between two dependent sample means (usually 'before and after' observations) from the same group Chi-square test: To test frequency distribution of one categorical variable To test the association(independence) of two categorical variables
Criteria for a FA
The Kaiser-Meyer-Olkin Measure of Sampling Adequacy 0.90 as marvelous, in the 0.80's as meritorious, in the 0.70's as middling, in the 0.60's as mediocre, in the 0.50's as miserable, and below 0.50 as unacceptable KMO = 0.626 Bartlett's Test of Sphericity tests the hypothesis that the correlation matrix is an identify matrix; implying that all of the variables are uncorrelated. Sig. value for this analysis leads to reject the null hypothesis, and conclude that there are correlations in the data set that are appropriate for factor analysis Chi-square = 458.45, P < 0.001) Verifies that the data met acceptable criteria for a factor analysis
Reliability: Tests of equivalence
-Attempt to determine if similar tests given at the same time give the same results Variable being measured fluctuates over time Types of tests Alternate form Inter-rater reliability
Assumptions in two sample t-test
1. Random sampling 2. Comparing means, dependent variable should be continuous variable 3. Normality assumption -data are normally distributed in the population 4. Variances of two groups are equal. 5. Two samples should be independent (=unrelated). - Participants (or objects) in each group are different. - If two samples are related, do paired t-test! -this is for testing outcomes in a population (testing if a treatment worked)
Paired t-test assumptions
1. Random sampling 2. Comparing means, dependent variable should be continuous variable 3. Normality assumption -data, i.e., differences between paired data,are normally distributed in the population (or sample size is large).
ANOVA
ANOVA Analysis of variance (ANOVA) •Testing the difference among the means of three or more groups
ANOVA: step 2
ANOVA: step 2 Test statistic = F statistic (F ratio)= signal/noise between group variance/within group variance Step 3: compute p-value; use F ratio Step 4: make decision
Test of normality
Although true normality is considered to be a myth, we can look for normality: visually by using normal plots, or by significance tests, that is, comparing the sample distribution to a normal one We test whether the sample data are different from a theoretical normal population (whether sample data were extracted from the population following the normal distribution) Hypotheses in the test of normality are: H0: The sample data are not significantly different than a normal population. this is what we want for normality assumption H1: The sample data are significantly different than a normal population.
Assumptions for the ANOVA
Assumptions for the ANOVA Random sampling Dependent (outcome) variable = continuous variable Normality assumption •Data in each population are normally distributed. Equality of variance: populations have equal variance on dependent variable .• Do Levene's test to check out equality of variance! Independence: groups being compared have no overlap in membership.
More about the Chi-Square (c2) Test
Both variables should be categorical (nominal or ordinal) variables. It is used to compare frequencies or proportions in two or more groups -Comparison of prevalence of hypertension among younger, middle-aged, and older adult groups. -Comparison of marital status between individuals with and without insomnia chi square test is a nonparametric test (ie, we don't need to make assumptions about normality).
Chi-Square (χ2) GOF Test
Chi-Square (χ2) GOF Test Step 1: set up the hypotheses (null and alternative) Null hypothesis(H0): 30% of college students sleep less than 6 hours, 40% 6-7 hours, 20% 7-8 hours, and 10% more than 8 hours. Alternative hypothesis (H1): The distribution of the sleeping hours differs from the expected distribution in the null. Now choose the level of significance: Alpha=0.05 Step 2: Choose the test statistic -χ2 χ2=(O-E)^2/E Step 3: Compute p-value For the chi-square test, you will also need to calculate the degrees of freedom (df) as well as the test statistic:Degrees of freedom (df)= number of categories -1 Step 4: Make a decision Compare the p-value to the level of significanceP-value < 0.05<reject the null
Choosing a test of significance
Continuous outcome variable>continuous independent variable> correlation/regression test of significance Continuous outcome variable>continuous independent variable> one group> one sample/paired t test; two groups> 2 sample t test; three or more groups>ANOVA Categorical outcome variable>continuous independent variable> ; categorical independent variable>chi square
Concurrent Validity: During
The correlation coefficient between the two scores was robust and significant, r = 0.93 (P < 0.001), and supported concurrent validity. This method obtains both concurrent validity and alternate forms reliability.
Reliability: Tests of internal consistency
The extent to which all parts of the instrument measures the same concept Internal consistency provides a useful measure of reliability in structured quantitative instruments Tests include: Cronbach's alpha
ANOVA: step 1
The null hypothesis is that the means are all equal. The alternative hypothesis is that at least one of the means is different (= not all the means are the same).
Calculation of F ratio
To calculate variances within and between groups, we need the following information; •Degrees of freedom (df) •Sum of squares (SS) •Mean squares (MS) = SS/df F ratio = MSbetween groups/MS within groups MS within groups indicates variance within groups Degrees of freedom (df) = # subjects - # groups = 17-3 =14 Sum of squares (SS):sum of the squared deviations of each score from it's group's mean Total within-group sum of squares (WSS) = SSexer + SSseden + SSrun = 64.7 Mean squares (MS) within groups = WSS/dfwithin= 64.7/14= 4.6
Test of significance: correlation
To examine association between two continuous variables In descriptive statistics, r indicates strength and direction of the association. In the test of significance, we test if r is significant or not.
Basis of ANOVA
To test significant difference among three or four group means, ANOVA compares two types of variances - within- and between-group variance Within-group variance (noise): how much subject-to-subject variation is there within a group? Between-group variance (signal): how much variation is there between group means (x1, x2, and x3)?
Inter-Observer (Rater) Reliability
When researchers use more than one observer to rate the same people, events, or places, inter-observer reliability is their goal If results are similar, we can have more confidence than the ratings reflect the phenomenon being assessed rather than the orientations of the observers Nurses working in a Pediatric ICU conducted a study to evaluate the effects of music alone and music and mother's voice in calming patients who were mechanically ventilated The outcome was an objective sedation scale (ordinal level data) How would a researcher go about measuring inter-rater reliability among nurses?
Alternate-Forms Reliability
When subjects' answers to slightly different versions of survey questions are correlated, alternate-forms reliability is being tested A researcher may reverse the order of the response choices, modify question wording in minor ways and then administer two forms of the test to subjects If the two sets of responses are not too different, alternate forms reliability is established
Fisher's Exact Test
When using a chi-square test, estimations may not be very accurate if expected counts are less than 5 in over 20% of the cells. In such cases, the Fisher's Exact test is a better choice than the Chi-square test. For a 2x2 table, this means that a Fisher's Exact test is more appropriate if one or more cells have an expected count <5.
Mann-Whitney U test
a.k.a. Wilcoxon rank sum test Non-parametric counterpart of independent sample t-test Assumptions Random sampling Independence of two groups Dependent variable = continuous variable (or some ordinal-level data) Combine all of the data from both groups Assign ranks to each of the observations Data: 71 72 73 73 75 75 78 79 80 82 83 84 Ranks:1 2 3.5 3.5 5.5 5.5 7 8 9 10 11 12 -find the rank sum of each category Hypotheses for the Wilcoxon rank sum test: H0: Data distributions of two groups are identical. H1: Data distribution of one population is shifted to the left or right of the other. Sum of the rank sum of each group = 24.5 + 53.5 = 78 If there is no difference in QoL between smokers and non-smokers, we would expect the sum of the ranks in each group to be 36* (=half of 78), which implies identical distribution. We reject the null hypothesis when the rank sum W of one group is far from its mean*. SPSS: If p<0.05, distribution of QoL scores is significantly different between two groups.
Finding regression equation from SPSS output
beta0-constant, beta1 is under this
SPSS output: correlation test
r= Pearson Correlation Cell A: correlation coefficient of height with itself Cell B: correlation coefficient of height with weigh tCell C: correlation coefficient of weight with height Cell D: correlation coefficient of weight with itself Cells in the diagonal show r of one variable with itself Cells above the diagonal = Cells below the diagonal (Cell B=Cell C)
Test of significance: regression
y= B1+B1x Y= 16.365 - 0.405 x Y= daytime sleepiness; x = sleep time Interpretation of β1: For every one additional hour a person sleep, daytime sleepiness decreases by 0.405 points. Sleep time is negatively correlated with daytime sleepiness.